1. ternary search trees question
- Posted by Tone Škoda <tskoda at email.si> Aug 09, 2005
- 461 views
- Last edited Aug 10, 2005
Maybe some expert on this topic will be able to answer this question: What if we insert key "aba" and then "abandon". Since data for key is stored in eqkid, "aba" can't have any data because eqkid will be required to store "abandon" key. Am I missing something? I'm writing ternary search tree database because creation (not access) of database with EDS is too slow for something I'm doing. http://algorithm.myrice.com/resources/technical_artile/ternary_search_tree/terstree.htm
2. Re: ternary search trees question
- Posted by Matt Lewis <matthewwalkerlewis at gmail.com> Aug 09, 2005
- 460 views
- Last edited Aug 10, 2005
Tone Škoda wrote: > > Maybe some expert on this topic will be able to answer this question: > > What if we insert key "aba" and then "abandon". Since data for key is stored > in eqkid, > "aba" can't have any data because eqkid will be required to store "abandon" > key. > > Am I missing something? > > I'm writing ternary search tree database because creation (not access) of > database > with EDS is too slow for something I'm doing. I haven't thought much about ternary trees, but that's what Pete Eberlein used in Derek's contest: http://www.rapideuphoria.com/finalsubmissions.zip Matt Lewis
3. Re: ternary search trees question
- Posted by "Kat" <gertie at visionsix.com> Aug 09, 2005
- 451 views
- Last edited Aug 10, 2005
On 9 Aug 2005, at 13:08, Tone =8Akoda wrote: > > > posted by: Tone =8Akoda <tskoda at email.si> > > > > > Maybe some expert on this topic will be able to answer this question: > > What if we insert key "aba" and then "abandon". Since data for key is sto= red in > eqkid, "aba" can't have any data because eqkid will be required to store > "abandon" key. > > Am I missing something? > > I'm writing ternary search tree database because creation (not access) of= > database with EDS is too slow for something I'm doing. > > http://algorithm.myrice.com/resources/technical_artile/ternary_search_tre= e/terst > ree.htm The part i hope someone can explain to me is this: how are the branches of= such trees ordered, on what criteria, and how is such pre-set-up criteria= going to be optimal for a search in any but for the pre-set-up question? For instance, you can set up a tree for words who's every other letter is t= he same, or every 2nd letter is the same, or have common endings (minus the= suffixes), or have the same 3rd letter. But, for instance, how is the ranki= ng for=20 3rd letters arranged for "cow" vs "cot" vs "pow" determined, by what criter= ia?=20 Why would "pow" be anywhere near "cow"? I understand pattern matching, and i understand hashing, but to pre-set the database for fastest searching= of=20 a pattern, you have to know in advance what that pattern is! This leads to= many databases optomised for many patterns. Kat
4. Re: ternary search trees question
- Posted by Pete Lomax <petelomax at blueyonder.co.uk> Aug 10, 2005
- 465 views
On Tue, 09 Aug 2005 13:08:20 -0700, Tone =8Akoda <guest at RapidEuphoria.com> wrote: >Maybe some expert on this topic will be able to answer this question: > >What if we insert key "aba" and then "abandon". Since data for key is stor= ed in eqkid, "aba" can't have any data because eqkid will be required to st= ore "abandon" key. > >Am I missing something? > Pete Eberlein's winning entry solved this problem by holding completely separate trees for different word lengths. According to the notes I made at the time (see http://palacebuilders.pwp.blueyonder.co.uk/euphoria.html ) The initial block of 1911 is in fact 21 blocks of 91 first chars, the first of which holds frequency counts for the 1-letter words whereas the others are pointers for the second letter. In the remainder of the tree, the equal to node holds the frequency count when on the last letter, otherwise it holds a node pointer. Regards, Pete
5. Re: ternary search trees question
- Posted by Pete Lomax <petelomax at blueyonder.co.uk> Aug 10, 2005
- 445 views
On Tue, 09 Aug 2005 13:08:20 -0700, Tone =8Akoda <guest at RapidEuphoria.com> wrote: >What if we insert key "aba" and then "abandon". Since data for key is stor= ed in eqkid, "aba" can't have any data because eqkid will be required to st= ore "abandon" key. > >Am I missing something? > >http://algorithm.myrice.com/resources/technical_artile/ternary_search_tree= /terstree.htm Now I've studied that paper, it looks to me like the C code given stores the terminating null found on all C strings, so it does not match on the second a of "aba" but the 4th byte, zero. Regards, Pete
6. Re: ternary search trees question
- Posted by Matt Lewis <matthewwalkerlewis at gmail.com> Aug 10, 2005
- 464 views
Kat wrote: > > The part i hope someone can explain to me is this: how are the branches of > such trees ordered, on what criteria, and how is such pre-set-up criteria > going to be optimal for a search in any but for the pre-set-up question? > For instance, you can set up a tree for words who's every other letter is > the same, or every 2nd letter is the same, or have common endings > (minus the suffixes), or have the same 3rd letter. But, for instance, how > is the ranking for 3rd letters arranged for "cow" vs "cot" vs "pow" > determined, by what criteria > > Why would "pow" be anywhere near "cow"? I understand pattern matching, > and i understand hashing, but to pre-set the database for fastest searching > of a pattern, you have to know in advance what that pattern is! This > leads to many databases optomised for many patterns. > I believe it's all done lexigraphically. I suppose that you'd have to decide on a starting point to use (though I haven't looked in detail at the algorithms on the page Tone linked to). My understanding is that it's not guaranteed to be the fastest search algorithm, but it takes log(k+n) (IIRC), so it's going to be pretty fast. It compromises between fast search and lower memory requirements. To search, you basically follow the tree. I don't think that there's any optimization as far as suffixes. To get that, you probably need to use a suffix tree, which has a large overhead, but really fast searching. I think that "pow" and "cow" wouldn't be close, unless there were no words that started with any letters between "c" and "p". "Cow" and "cot" should be very close, though. Matt Lewis
7. Re: ternary search trees question
- Posted by "Kat" <gertie at visionsix.com> Aug 10, 2005
- 457 views
On 9 Aug 2005, at 19:48, Matt Lewis wrote: > > > posted by: Matt Lewis <matthewwalkerlewis at gmail.com> > > Kat wrote: > > > > The part i hope someone can explain to me is this: how are the branches of > > such trees ordered, on what criteria, and how is such pre-set-up criteria > > going to be optimal for a search in any but for the pre-set-up question? For > > instance, you can set up a tree for words who's every other letter is the > > same, or every 2nd letter is the same, or have common endings (minus the > > suffixes), or have the same 3rd letter. But, for instance, how is the ranking > > for 3rd letters arranged for "cow" vs "cot" vs "pow" determined, by what > > criteria > > > > Why would "pow" be anywhere near "cow"? I understand pattern matching, > > and i understand hashing, but to pre-set the database for fastest searching of > > a pattern, you have to know in advance what that pattern is! This leads to > > many databases optomised for many patterns. > > > > I believe it's all done lexigraphically. I suppose that you'd have to decide > on > a starting point to use (though I haven't looked in detail at the algorithms > on > the page Tone linked to). My understanding is that it's not guaranteed to be > the fastest search algorithm, but it takes log(k+n) (IIRC), so it's going to > be > pretty fast. It compromises between fast search and lower memory requirements. > > To search, you basically follow the tree. I don't think that there's any > optimization as far as suffixes. To get that, you probably need to use > a suffix tree, which has a large overhead, but really fast searching. > > I think that "pow" and "cow" wouldn't be close, unless there were no words > that started with any letters between "c" and "p". "Cow" and "cot" should > be very close, though. I can see "cow" and "cot" being close for several reasons. So searching for "co*" would find them quickly. They line up with english spelling nicely. But searching for "pow" and "cow" by using "*ow" would be very slow, effectively meaning all the leaves of 3 letters or greater would need to be searched! Any optomisation for that would make finding "cow" and "cot" as slow as "pow" and "cow" were. Multiple matches, or more complex patterns would be even more narrowly optomised, as far as i can tell. Like "cow??e" to find "coweye" vs "cowpie", you can get to "cow" fast enough, but then it's brute force of all down-branch 6-letter words after that. And that is assuming you have ordered all "cow*" by word length, which wouldn't really help finding "kewpie" and "cowpie" by looking for "*wpie". In fact, looking for "*wpie" might trigger a question "did you mean "wipe"? Any insights you can share? Kat, interested.
8. Re: ternary search trees question
- Posted by Tone Škoda <tskoda at email.si> Aug 10, 2005
- 453 views
Pete Lomax wrote: > Pete Eberlein's winning entry solved this problem by holding > completely separate trees for different word lengths. This is good idea! Speedup because trees are smaller plus less memory needed. Downsides: a little more coding and limit to length of words (not necessary though). > >Now I've studied that paper, it looks to me like the C code given >stores the terminating null found on all C strings, so it does not >match on the second a of "aba" but the 4th byte, zero. > Yes, it looks like it. I understand it now; first "n" of "abandon" would be stored at the right side of terminating 0 of "aba". (This site has clearer code: http://www.opensource.apple.com/darwinsource/10.4/gccfast-1621.1/libiberty/ternary.c) But I'll use your first advice. A hybrid between ternary search tree and digital search tree would be really fast. For first two or three letters would be used digital search tree (for real speedy search), for rest of the letters would be used ternary search tree (to not consume too much memory). Although I'm not sure if digital search tree wouldn't be too slow when inserting new keys because there would be more writing to disk needed.
9. Re: ternary search trees question
- Posted by Tone Škoda <tskoda at email.si> Aug 10, 2005
- 466 views
Kat wrote: > > The part i hope someone can explain to me is this: how are the branches of= > > such trees ordered, on what criteria, and how is such pre-set-up criteria= > > going to be optimal for a search in any but for the pre-set-up question? Yes, they are not so easy to clearly understand. That's why I made to myself this drawing which helps me to understand them clearer: http://www10.brinkster.com/tskoda/Ternary_Search_Tree_Explanation.html Speed of search for a word depends on how deeply is that word nested in tree. How deeply is word nested in tree depends on how late was word inserted in tree, i.e. words inserted later generally take more time to search, plus how many similar words already existed in tree when the word was inserted. What means "similar" here is yet to be more clearly explained.
10. Re: ternary search trees question
- Posted by "Kat" <gertie at visionsix.com> Aug 10, 2005
- 445 views
On 10 Aug 2005, at 1:36, Tone =8Akoda wrote: > > > posted by: Tone =8Akoda <tskoda at email.si> > > Kat wrote: > > >=20 > > The part i hope someone can explain to me is this: how are the branches= of= > >=20 > > such trees ordered, on what criteria, and how is such pre-set-up criter= ia= > >=20 > > going to be optimal for a search in any but for the pre-set-up question= ? > > > Yes, they are not so easy to clearly understand. That's why I made to mys= elf > this drawing which helps me to understand them clearer: > http://www10.brinkster.com/tskoda/Ternary_Search_Tree_Explanation.html > > Speed of search for a word depends on how deeply is that word nested in t= ree. > How deeply is word nested in tree depends on how late was word inserted i= n tree, > i.e. words inserted later generally take more time to search, plus how ma= ny > similar words already existed in tree when the word was inserted. What me= ans > "similar" here is yet to be more clearly explained. That looks like a regular dictionary lookup process to me. There's nothing= there related to threes (ternary). Why are later entries harder to find? It= still=20 looks to me like the tree must be built for each search pattern, because as= your latest url plainly shows, "pot" would be far away from "cow", and stil= l take a while to match "?o?" search terms. Kat