Many Little Turings FTW
The latest Intel chips (Xeon, I7 Core, etc.) have 700-800 million transistors. If one were to make the simplest logic-gate setup possible to run a Universal Turing Machine -- or not necessarily the simplest but making a few simplicity-vs.-efficacy trade-offs --, then parallel-scale it to around 800 million transistors, we could possibly have thousands, maybe even hundreds of thousands, of general-purpose CPU cores at once running in a single half-an-inch CPU.
Graphics cards, the current latest ones pushing 1 teraFLOPS, do something similar to this but their cores (or more accurately, stream processors) are not Turing-complete -- that is, they can't do general computing --, and they probably still take orders of magnitude more transistors per "core" (because they perform many sophisticated kinds of manipulations, such as floating-point instructions and specifically CGI-related functions, directly), thus meaning far fewer parallel units than this computer could have.
Multiplying two 16-bit numbers, for example, might take over 256 cycles, and adding two numbers could take at least 16, but with thousands/hundreds of thousands of cores it might all be worth it. (I come up with the values 256 and 16 because I'm thinking this machine would have no intrinsic conception of bytes and would process everything a bit at a time, but perhaps it wouldn't be that way.)
This idea is for a computer intended for very specialized applications, not general home computing as with a PC; average PC users have no use for thousands of simultaneous general-purpose cores.
It might be neat if we could actually make this self-programmable in a way, perhaps like a Turing machine with a self-modifiable state table. This kind of dynamicism would be very useful for a machine with such a limited instruction set. Maybe an AND could be re-written as an XOR, and so on. I'm not sure how that can be done in a way that doesn't incur so many extra transistors that they'd be better put to use by simply extending the instruction set andor adding more cores. Perhaps there can be intermediate router circuits that can re-route the flow of bits through the core's parts -- for example, through an AND gate instead of through an XOR gate. Or, perhaps there can be a number of different types of cores that have slightly different functionality, and the code determines which type of core a given thread goes to. Maybe a thread can even switch core-types mid-way through; its state information would be transferred to another core. (this could be effectively the same as reprogramming a core, but more limited.) The proportions of numbers of cores available for the various core-types would hopefully be determined by the most common use-case scenarios.
The Tile64 had a good idea for inter-core communication: each core has its own router, and a message passing from one core to another steps through each core in between to get there. This may be a reasonable way to do it, although it seems like it would be a rather complex thing for a system going for (per-core) minimalism. Perhaps the system could have very localized clouds of shared memory for X number of cores, and then those clouds of memory can themselves communicate with other clouds of memory. The CPU's scheduler would group tasks according to which tasks need to either communicate a lot with or share memory with which other tasks, or perhaps the operating system itself could. If the different-type-of-cores idea is used, then each localized core-cloud should probably contain a variety of different core-types. That way threads can more quickly switch core-type, and super-tasks involving different kinds of functions can more easily have their different functions inter-communicate -- though the second argument can go either way; perhaps some features should be locally grouped and some not.
Anyway, this is all on one chip. If we could have 100,000 cores on one chip, imagine how many petaFLOPS a supercomputer made up of these chips would do!