Wednesday, December 17, 2008

Three sorting algorithm ideas

Binary Sort

A sorting algorithm that sorts in constant amortized time, but which is proportional to the average length, in binary, of the elements being sorted

Let's say we have a function that returns a list of {0, 1} values which is a given element's binary representation. Now we make an array of the size of our input list, which we will use as a binary tree. Each node will be a struct instance. For each element in our list to be sorted, we navigate this tree using the element's binary sequence as the path, creating new branches as necessary. When we get to the end, we insert a reference to the original object into the beginning of a linked list of references to objects belonging at that node (i.e., equivalent elements). Then at the end of our sorting we simply walk the tree and traverse its linked lists to return our sorted list. If we want equivalent objects to appear in the same order in which they appeared in the original list, we can append them to the end of the linked lists instead and have the node store a pointer to the current last linked list item.

This is only the naive version, though.. we don't really have to make the full path for every element. Instead of having a function that returns an object's binary representation, we'll have a function that returns an iterator for it. Then instead of navigating/creating nodes to the end of the path, we navigate until we find a node who simply stores a pointer to an iterator for an element that's somewhere underneath it. Now, since a node can only contain one such iterator pointer, when you come to such a node you must poll values from *both* iterators (your current element's and the node's) and create a path until the values diverge, at which point you create two children and put your respective iterator and element pointers in those. Of course you'd nullify the pointers in the original node. You have a 50/50 chance of this happening on the first poll, a 3/4 chance of it happening by the second poll, etc. When an iterator exhausts itself you delete the iterator object, nullify the iterator pointer and start the linked list of equivalent elements for that node.

Alternatively to the linked lists we could simply have a pointer to the first instance of an equivalent member and a count value that specifies how many times it repeats, but that only works with members that don't have any 'hidden variables' with respect to their binary representations. Plain old strings or numbers would work just fine that way. Sorting person objects by last name, wouldn't.

This algorithm isn't generally suitable for mixed-type collections or any type that uses a special comparison function. Strings, ints, floating points and fixed decimals should work fine. The above is optimized for strings. For a numerical type, all iterators are exhausted at the same branching level (8, 16, 32, or 64), although that doesn't change the algorithm a whole lot -- maybe provides for some minor optimizations. But because comparing two numbers might be less expensive than grabbing a value from a binary iterator, we could forgo the binary representations altogether for numbers. Instead of a node storing a pointer to an iterator and an element, it would simply store a number. That could be a huge improvement. And even if we still used binary representations, most numeric values probably start with lots of 0's, so we could speed up the process by performing a BSF assembly command on it to determine the number of leading zeros, then jump directly to a node, skipping the first portion of the path constituting zeros, based on a table of pointers for numbers of leading zeros up to 8, 16, 32 or 64.

Binary Search Library Sort

Use the Library Sort method ( http://en.wikipedia.org/wiki/Library_sort ), but instead of just iterating through the list to find where an item belongs, do a binary search. Since we're starting with an empty gapped list, our elements in it will always be sorted. A new element will be placed in the first empty binary search location, so our search will never have to worry about traversing gaps.

Binary Search Sort

This is actually not a sorted list; it's a sorted linked list. Take one element and put it in the list. That element becomes the first search point. The next element becomes the next search point either larger than or smaller than the previous element. And so and and so on (except that identical elements exist in the same search point). This method actually seems to be identical to the binary sort method above where we use comparison instead of a binary representation. But we can perhaps improve the algorithm by changing search points. For example, each node can store the average value of the two nodes it's 'between' (its parent and grandparent, and we can also use the max and min of the whole collection for the first two two depths if we want), and if a value comes along that is about to be placed undeneath it, but it's closer to the parent's average value than its parent is, then it swaps itself with its parent node. For strings our average might just have to be a character and its place in the string.

Monday, December 15, 2008

The ultimate hacker's ensemble

1. A WiFi NIC that supports monitor/RFMON mode, likely a Cisco
2. A high-gain directional parabolic WiFi antenna, like the RadioLabs 24dB Parabolic Grid WiFi Antenna ($70)
3. Kismet to record packets, detect networks, etc. (free)
3. Aircrack-ng to crack WEP encryption (free)
4. A web browser. Now the trick is to feed those HTTP streams to the web browser. Since normally a web browser isn't accepting a web page from a server unless it had sent a request, we'd have to hack the web browser. I.e., we'd have to download the source to Firefox (open-source), or perhaps code our own web browser from webkit. There might be an easier way, though. Given a web browser that support the experimental Reverse HTTP, we could first send the command to put it into Reverse HTTP mode, and then every subsequent web page will be server-push. This still wouldn't likely work for AJAX or dynamic Flash, etc. applications with two-way stateful communications. But for regular old HTML it'll still be cool.
5. Wireshark to decode AIM, Live Messenger, etc. messages
6. A laptop (for war-driving/stalking), or a normal desktop PC if we just want to eavesdrop on networks from around our house
7. A car with heavily tinted windows (again, only if we want to stalk particular people/organizations or go war driving)

Now we stitch it all together programatically so that we can just point the antenna, select a random NIC on a random BSSID, and view their web sessions as if they were our own, and possibly have pop-ups for IM messages too. And we'd do this all completely passively - i.e., it would be physically impossible for them to detect that we're doing it or to know where we are. Sweeet
Wider memory bus

We could probably make computers faster with a wider memory bus. Say, for example, that the bus speed is 800 Mhz and the CPU speed is 1.8 ghz. Then for every cycle of the bus, the CPU can process at most about 2 64-bit values, or 2 128-bit values with SIMD, or with four cores, perhaps 8 128-bit values. Thus, if we have a bus width of X bytes, we can define an opcode to retrieve up to X bytes of RAM from location Y and put it in the cache, and do it in fewer cycles. X could be, say, 128 bytes. (In the above scenario, requesting more than 32 bytes at once wouldn't be necessary for a given core, but perhaps a 128-byte-wide memory bus would aid in 4 cores each requesting 32 bytes at once.)

Since we don't know what the future of CPUs and bus speeds holds, for future compatibility we should probably make this bus width as wide as is practical, unless it can be increased later but in a way that's easily scalar as far as programmers are concerned.

This would save time every time memory is accessed sequentially, which is often.

I guess the pipelines within the RAM would have to be changed too? I don't know much about RAM.

Apparently a CPU has its own prediction mechanisms and automatic pre-caching. This automated pre-caching can equally take advantage of a wider memory bus, and conversely, my explicit op-code idea for pre-caching carries its own advantage independently of the idea of extending the bus width. That is, why leave it all up to the processor to predict what's going to happen, when the programmer (or possibly compiler/VM) can just *tell* it?
Memory controller transactions

To help with IPC, the memory controller should support transactions. Since all CPUs must use the same memory controller in order to use the same RAM, this is analogous to a DBS supporting transactions that multiple clients may simultaneously connect to. By supporting it in hardware at the point where memory updating is necessarily done one-at-atime anyway, you could eliminate the convolutions and dilemmas of trying to eliminate race conditions in software.

Actually, I don't know how transactions work very much. But the memory controller could implement a lock so that the first CPU to request a transaction can have exclusive use until the transaction is over. I hear that transactions normally use a queue, but that doesn't seem necessary in this case. In the same amount of time it would take a cpu to queue a transaction, the memory controller could already have committed the last one to memory. So it's really just about a lock.

Perhaps while one core/cpu is completing a "transaction", another can be modifying memory in a different place, that's defined to not be in the area of that transaction. For example, the opcode for a "transaction" could include a number of bytes of contiguous memory to be considered as part of that transaction.
CPU paging support

The CPU should store, probably within the page tables, a value indicating the last time any given page of memory was accessed. That way the OS can ask the CPU when a page was last accessed to help it with its swapping algorithm. It seems to me that swapping by time since last accessed would be almost as effective as, and a lot simpler than, using some sort of frequency of use algorithm. Without this feature, the OS has no way of knowing when the last time a page of memory in RAM was accessed was.

I don't know what mechanism the OS uses now for knowing what and when to swap, but I know it must be vastly less efficient than this. I would recommend just swapping out and in at the granularity of memory pages. If I remember correctly, conveniently a page is 4 Kb which is the same minimal amount of storage a harddrive can read or write in one command.

Saturday, December 13, 2008

Search-related metadata

I just sent this to google
--
I was thinking about the schemas idea behind WinFS, and how it might be applied to the intarweb. I decided WinFS isn't particularly relevant to the web, and started looking up metadata for html. I don't think it's exactly what I had in mind either. I was trying to think of something that would make it easier to map the interrelations between webpages and allow authors to write metadata that aids in intelligent web searching capability. WinFs-like schemas and XSD are only about breaking down particular things, like addresses. So I decided to think up my own schema for this purpose.

The reason I'm telling you guys this is that Google supporting a particular metadata schema for its search would probably be the most effective--maybe the only--way for such a schema to become widely used. (The point of its becoming widely used is to make web searching more powerful.)

This schema would not have any mechanism for defining relationships between parts of a webpage -- only relationships between the webpage and various ideas/words or perhaps even other webpages, at least in its main intention, although doing that may also be possible -- parts of a web page could possibly be sectioned off with respect to the pertinent metadata, so perhaps in somewhat the same way they can relate to other webpages, they can relate to other parts of the webpage. This may or may not be useful for search engine purposes, but it could be useful for other general metadata-related purposes.

Here are my ideas for relationships:
- this webpage is vaguely relevant to Y in some way
- this webpage is closely relevant to Y in some way
- this webpage is about something particular to Y, in understanding
- this webpage is about something particular to Y, in function
- this webpage is about something that Y is is particular to, in understanding
- this webpage is about something that Y is is particular to, in function
- this webpage is about something particular to Y and that Y is dependent on, for understanding
- this webpage is about something particular to Y and that Y is dependent on, in function
- this webpage is about something that complements Y, in understanding
- this webpage is about something that complements Y, in function
- this webpage is about something that is dependent on Y, in understanding
- this webpage is about something that is dependent on Y, in function
- this webpage is about something that Y is dependent on, in understanding
- this webpage is about something that Y is dependent on, in function
- what this webpage is about and Y are mutually dependent, in understanding
- what this webpage is about and Y are mutually dependent, in function

..where Y could possibly be a word, a phrase, a key word out of a predefined set of key words, a URL, a metadata-delineated section of a webpage, or a list of any of the above. In all but the first two, Y is most likely a URL or webpage section.

- any of the above with the added qualifier that it applies ONLY to Y
- any of the above with the added qualifier that it was designed specifically to apply to Y
- by design, this webpage is particular to Y and is useless without Y in its functionality
- by design, Y is particular to this webpage is useless without Y in its functionality

The last two have to do with the mechanics of a website -- for example, a page for filling out a form would be useless without the main website. This is why they don't say "is about"; the webpages ARE the functionality. "by design" isn't coupled with "in understanding" for more relations because those are already covered by above relations including the added qualifiers.

I'm not saying one way or another on whether the two added qualifiers above are to be used as qualifiers in the syntax or to combinatorily create 48 more relationnship types. With enough good ideas, the metadata syntax could actually grow into a grammar of its own, which would definitely call for making them qualifiers.

I'm sure there are more good ideas for general relationships that I haven't thought of.

The syntax could also specify domains that a webpage falls under. Some examples would be:
- technical document
- interactive
- entertainment
- media

Perhaps even a hierarchy of domains, perhaps modeled after Yahoo! Directory, dmoz.org, or similar. And then a tree relationship between webpages could be automatically generated, so people can search for coordinate sisters, etc.

But even in the above, they're not all mutually exclusive. They could be made into key words, but then you'd lose the hierarchical aspect -- or maybe not. You could combine the two and have key words that aren't mutually exclusive but have hierarchical relationships to each other. Or, you could forgo hierarchies altogether and have a more web-like system where everything is interconnected according to various relationships but there's no top or bottom to it; it's not a tree-like structure. The interrelationships between the key words could be pre-defined, or perhaps they could be extrapolated somehow from analyzing the topology of hyperlinks in webpages with key words defined -- but then the relationship probably wouldn't be very semantical.

Speaking of semantics, whatever words or key words that appear in WordNet would lend themselves to automatic relational mappings according to all the relationships that that WordNet covers. Also, the relationship-type metadata defined above would lend itself to a hierarchical or web-like structures wherever chains of sub-part or complement relationships can be found, which could also aid in searching ability.

I haven't included any ideas for raw syntax of the metadata because I consider that implementation details. Needless to say, it would be something in XML that's invisible to web browsers.

Friday, December 12, 2008

Google Captchas

most captchas out there can be cracked by software nowdays with enough effort.

google had a program the other year where they would show any random two users who are participating a given image, and for x amount of time they'd both type in as many possible tags for that image as they could, and the tags that they both happened to type in are associated with that image.

my idea is to use that database to show those images - as thumbnails - as captchas, and let the user type in an image tag. if it matches any tag in the database for that picture, they pass. this would have to be a captcha service provided by google.

so as not to frustrate users, we should probably include all tags typed in by both parties - not only teh ones agreed upon - presuming that information is included in the database.

in case thumbs are too small in too many cases, there could be a button to show the image full-size , perhaps in a new window, although it would be just as easy to recycle the image. or images could be shown full-size by default--perhaps in a new window, perhaps on the same page, or perhaps it could pop up just when you mouse over the thumb, like profile pictures in apps.facebook.com/yesnomaybe.

Wednesday, December 03, 2008

Google Earth Flight Sim

Use Google Earth data, including but not limited to the 3d building data in certain cities, to provide accurate worldwide data for a flight simulation application. Since Google Earth uses 70+ TB, distribution would have to be limited to relatively small areas, or the flight simulator would have to be constantly communicating with Google Earth if the bandwidth would be sufficient.

It won't be long, though, until HDD's are large enough to store the whole thing (they're up to a TB now, and it was just a few years ago that 1 GB was a huge harddrive), although media to distribute the program in might be a different matter (blu-ray is only 25 GB, and we don't even really have it yet).

Algorithms should probably be used to sharpen the images for very low flying, and perhaps create tree structures where there appears to be vegetation, extrapolate houses and buildings to 3d, and simulate water where there appears to be lakes, and try to extrapolate height data for hills and mountains.
Balloon Telescopes

Make a very large, clear balloon, fill the bottom portion of it with water, and presto, you have a huge lens. Then make the smaller lenses in the appropriate shapes to complement the odd shape of the larger lens, and presto, you have a gigantic telescope.

Or, make the balloon and air balloon and coat the bottom of the inside of it with silver, then put the complementary mirrors inside of it. Presto, a gigantic reflecting telescope. You could even float it up to high in the atmosphere.