Saturday, December 13, 2008

Search-related metadata

I just sent this to google
--
I was thinking about the schemas idea behind WinFS, and how it might be applied to the intarweb. I decided WinFS isn't particularly relevant to the web, and started looking up metadata for html. I don't think it's exactly what I had in mind either. I was trying to think of something that would make it easier to map the interrelations between webpages and allow authors to write metadata that aids in intelligent web searching capability. WinFs-like schemas and XSD are only about breaking down particular things, like addresses. So I decided to think up my own schema for this purpose.

The reason I'm telling you guys this is that Google supporting a particular metadata schema for its search would probably be the most effective--maybe the only--way for such a schema to become widely used. (The point of its becoming widely used is to make web searching more powerful.)

This schema would not have any mechanism for defining relationships between parts of a webpage -- only relationships between the webpage and various ideas/words or perhaps even other webpages, at least in its main intention, although doing that may also be possible -- parts of a web page could possibly be sectioned off with respect to the pertinent metadata, so perhaps in somewhat the same way they can relate to other webpages, they can relate to other parts of the webpage. This may or may not be useful for search engine purposes, but it could be useful for other general metadata-related purposes.

Here are my ideas for relationships:
- this webpage is vaguely relevant to Y in some way
- this webpage is closely relevant to Y in some way
- this webpage is about something particular to Y, in understanding
- this webpage is about something particular to Y, in function
- this webpage is about something that Y is is particular to, in understanding
- this webpage is about something that Y is is particular to, in function
- this webpage is about something particular to Y and that Y is dependent on, for understanding
- this webpage is about something particular to Y and that Y is dependent on, in function
- this webpage is about something that complements Y, in understanding
- this webpage is about something that complements Y, in function
- this webpage is about something that is dependent on Y, in understanding
- this webpage is about something that is dependent on Y, in function
- this webpage is about something that Y is dependent on, in understanding
- this webpage is about something that Y is dependent on, in function
- what this webpage is about and Y are mutually dependent, in understanding
- what this webpage is about and Y are mutually dependent, in function

..where Y could possibly be a word, a phrase, a key word out of a predefined set of key words, a URL, a metadata-delineated section of a webpage, or a list of any of the above. In all but the first two, Y is most likely a URL or webpage section.

- any of the above with the added qualifier that it applies ONLY to Y
- any of the above with the added qualifier that it was designed specifically to apply to Y
- by design, this webpage is particular to Y and is useless without Y in its functionality
- by design, Y is particular to this webpage is useless without Y in its functionality

The last two have to do with the mechanics of a website -- for example, a page for filling out a form would be useless without the main website. This is why they don't say "is about"; the webpages ARE the functionality. "by design" isn't coupled with "in understanding" for more relations because those are already covered by above relations including the added qualifiers.

I'm not saying one way or another on whether the two added qualifiers above are to be used as qualifiers in the syntax or to combinatorily create 48 more relationnship types. With enough good ideas, the metadata syntax could actually grow into a grammar of its own, which would definitely call for making them qualifiers.

I'm sure there are more good ideas for general relationships that I haven't thought of.

The syntax could also specify domains that a webpage falls under. Some examples would be:
- technical document
- interactive
- entertainment
- media

Perhaps even a hierarchy of domains, perhaps modeled after Yahoo! Directory, dmoz.org, or similar. And then a tree relationship between webpages could be automatically generated, so people can search for coordinate sisters, etc.

But even in the above, they're not all mutually exclusive. They could be made into key words, but then you'd lose the hierarchical aspect -- or maybe not. You could combine the two and have key words that aren't mutually exclusive but have hierarchical relationships to each other. Or, you could forgo hierarchies altogether and have a more web-like system where everything is interconnected according to various relationships but there's no top or bottom to it; it's not a tree-like structure. The interrelationships between the key words could be pre-defined, or perhaps they could be extrapolated somehow from analyzing the topology of hyperlinks in webpages with key words defined -- but then the relationship probably wouldn't be very semantical.

Speaking of semantics, whatever words or key words that appear in WordNet would lend themselves to automatic relational mappings according to all the relationships that that WordNet covers. Also, the relationship-type metadata defined above would lend itself to a hierarchical or web-like structures wherever chains of sub-part or complement relationships can be found, which could also aid in searching ability.

I haven't included any ideas for raw syntax of the metadata because I consider that implementation details. Needless to say, it would be something in XML that's invisible to web browsers.

No comments: