Thursday, October 15, 2009

Efficient searching for substrings

I don't know if this idea's ever been done before, so I'll post it.

It's a way to search for substrings within a large dataset. Google's code, for example, I'm sure is very efficient, but doesn't do substrings. It just indexes words. The same method wouldn't work for substrings, because the sentence "I heard it through the grape vine" has 561, give or take, possible substrings.

So here's my way. It uses SQL, but obviously using hash tables or binary trees directly would also work. It's just easier to implement it in SQL, and possibly even more efficient on account of the DBE's query optimizer and so forth.

So I'm indexing the string "I heard it through the grape vine", say it's in Record 1.

my fields are (autonumber id, int nextid, char letter, int record)
we'll call the table t, just because it's convenient.

here's what i index:

(0, 1, 'I', 1) //or normalize it to lower-case so we can do non-case-sensitive searches
(1, 2, ' ', 1)
(2, 3, 'h', 1)
(3, 4, 'e', 1)
(4, 5, 'a', 1)
(5, 6, 'r', 1)
(6, 7, 'd', 1)
(7, 8, ' ', 1)
(8, 9, 'i', 1)
(9, 10, 't', 1)
(10, 11, ' ', 1)
(11, 12, 't', 1)
(12, 13, 'h', 1)
(13, 14, 'r', 1)
(14, 15, 'o', 1)
(15, 16, 'u', 1)
(16, 17, 'g', 1)
(17, 18, 'h', 1)
(18, 19, ' ', 1)
(19, 20, 't', 1)
(20, 21, 'h', 1)
(21, 22, 'e', 1)
(22, 23, ' ', 1)
(23, 24, 'g', 1)
(24, 25, 'r', 1)
(25, 26, 'a', 1)
(26, 27, 'p', 1)
(27, 28, 'e', 1)
(28, 29, ' ', 1)
(29, 30, 'v', 1)
(30, 31, 'i', 1)
(31, 32, 'n', 1)
(32, null, 'e', 1)

now i'll search for "rape".

Now here's my search query.

select a.record from t a, t b, t c, t d on a.nextid = b.id inner join b.nextid = c.id inner join c.nextid = d.id inner join a.letter = "r" inner join b.letter = "a" inner join c.letter = "p" inner join d.letter = "e"

Or something like that. I'm actually not that experienced with SQL. But you know what I'm getting at here.

The result of that search (after it's turned into sql that will actually work) will be 1, because "rape" occurs in record 1.

---

Another thing we can do is this

(0, 1, 'I', 1, 0)
(1, 2, ' ', 1, 1)
(2, 3, 'h', 1, 2)
(3, 4, 'e', 1, 3)
(4, 5, 'a', 1, 4)
(5, 6, 'r', 1, 5)
(6, 7, 'd', 1, 6)
(7, 8, ' ', 1, 7)
(8, 9, 'i', 1, 8)

etc.

where the last field is the index within the record at which that letter occurs.
it just happens to match id and nextid-1 in this example, but it wouldn't necessarily, because the same table would include many records.

that way if I searched for "rape" i would instantly know that it occurs in Record 1, at index 24, without having to do a string search through record 1's text (which is stored as a normal sting in some other table).

Also, we might want to use ints instead of chars for the letters to speed it up.

Also, for unicode you might have to do it a little differently. The same principle still applies.

It might also be safe to forgo the "nextid" field and do "...a.id = b.id-1 inner join a.id = c.id-2 inner join a.id = d.id-3...", but i don't know which way is more efficient. i suspect this way would be much less optimized.

oh, one other thing: obviously index the "nextid" and "letter" fields.

No comments: