1. ternary search trees question
- Posted by Tone Škoda <tskoda at email.si>
Aug 09, 2005
-
Last edited Aug 10, 2005
Maybe some expert on this topic will be able to answer this question:
What if we insert key "aba" and then "abandon". Since data for key is stored in
eqkid, "aba" can't have any data because eqkid will be required to store
"abandon" key.
Am I missing something?
I'm writing ternary search tree database because creation (not access) of
database with EDS is too slow for something I'm doing.
http://algorithm.myrice.com/resources/technical_artile/ternary_search_tree/terstree.htm
2. Re: ternary search trees question
- Posted by Matt Lewis <matthewwalkerlewis at gmail.com>
Aug 09, 2005
-
Last edited Aug 10, 2005
Tone Škoda wrote:
>
> Maybe some expert on this topic will be able to answer this question:
>
> What if we insert key "aba" and then "abandon". Since data for key is stored
> in eqkid,
> "aba" can't have any data because eqkid will be required to store "abandon"
> key.
>
> Am I missing something?
>
> I'm writing ternary search tree database because creation (not access) of
> database
> with EDS is too slow for something I'm doing.
I haven't thought much about ternary trees, but that's what Pete Eberlein
used in Derek's contest:
http://www.rapideuphoria.com/finalsubmissions.zip
Matt Lewis
3. Re: ternary search trees question
- Posted by "Kat" <gertie at visionsix.com>
Aug 09, 2005
-
Last edited Aug 10, 2005
On 9 Aug 2005, at 13:08, Tone =8Akoda wrote:
>
>
> posted by: Tone =8Akoda <tskoda at email.si>
>
>
>
>
> Maybe some expert on this topic will be able to answer this question:
>
> What if we insert key "aba" and then "abandon". Since data for key is sto=
red in
> eqkid, "aba" can't have any data because eqkid will be required to store
> "abandon" key.
>
> Am I missing something?
>
> I'm writing ternary search tree database because creation (not access) of=
> database with EDS is too slow for something I'm doing.
>
> http://algorithm.myrice.com/resources/technical_artile/ternary_search_tre=
e/terst
> ree.htm
The part i hope someone can explain to me is this: how are the branches of=
such trees ordered, on what criteria, and how is such pre-set-up criteria=
going to be optimal for a search in any but for the pre-set-up question?
For instance, you can set up a tree for words who's every other letter is t=
he
same, or every 2nd letter is the same, or have common endings (minus the=
suffixes), or have the same 3rd letter. But, for instance, how is the ranki=
ng for=20
3rd letters arranged for "cow" vs "cot" vs "pow" determined, by what criter=
ia?=20
Why would "pow" be anywhere near "cow"? I understand pattern matching,
and i understand hashing, but to pre-set the database for fastest searching=
of=20
a pattern, you have to know in advance what that pattern is! This leads to=
many databases optomised for many patterns.
Kat
4. Re: ternary search trees question
On Tue, 09 Aug 2005 13:08:20 -0700, Tone =8Akoda
<guest at RapidEuphoria.com> wrote:
>Maybe some expert on this topic will be able to answer this question:
>
>What if we insert key "aba" and then "abandon". Since data for key is stor=
ed in eqkid, "aba" can't have any data because eqkid will be required to st=
ore "abandon" key.
>
>Am I missing something?
>
Pete Eberlein's winning entry solved this problem by holding
completely separate trees for different word lengths. According to the
notes I made at the time (see
http://palacebuilders.pwp.blueyonder.co.uk/euphoria.html )
The initial block of 1911 is in fact 21 blocks of 91 first chars, the
first of which holds frequency counts for the 1-letter words whereas
the others are pointers for the second letter. In the remainder of the
tree, the equal to node holds the frequency count when on the last
letter, otherwise it holds a node pointer.
Regards,
Pete
5. Re: ternary search trees question
On Tue, 09 Aug 2005 13:08:20 -0700, Tone =8Akoda
<guest at RapidEuphoria.com> wrote:
>What if we insert key "aba" and then "abandon". Since data for key is stor=
ed in eqkid, "aba" can't have any data because eqkid will be required to st=
ore "abandon" key.
>
>Am I missing something?
>
>http://algorithm.myrice.com/resources/technical_artile/ternary_search_tree=
/terstree.htm
Now I've studied that paper, it looks to me like the C code given
stores the terminating null found on all C strings, so it does not
match on the second a of "aba" but the 4th byte, zero.
Regards,
Pete
6. Re: ternary search trees question
Kat wrote:
>
> The part i hope someone can explain to me is this: how are the branches of
> such trees ordered, on what criteria, and how is such pre-set-up criteria
> going to be optimal for a search in any but for the pre-set-up question?
> For instance, you can set up a tree for words who's every other letter is
> the same, or every 2nd letter is the same, or have common endings
> (minus the suffixes), or have the same 3rd letter. But, for instance, how
> is the ranking for 3rd letters arranged for "cow" vs "cot" vs "pow"
> determined, by what criteria
>
> Why would "pow" be anywhere near "cow"? I understand pattern matching,
> and i understand hashing, but to pre-set the database for fastest searching
> of a pattern, you have to know in advance what that pattern is! This
> leads to many databases optomised for many patterns.
>
I believe it's all done lexigraphically. I suppose that you'd have to decide
on a starting point to use (though I haven't looked in detail at the algorithms
on the page Tone linked to). My understanding is that it's not guaranteed to
be the fastest search algorithm, but it takes log(k+n) (IIRC), so it's going
to be pretty fast. It compromises between fast search and lower memory
requirements.
To search, you basically follow the tree. I don't think that there's any
optimization as far as suffixes. To get that, you probably need to use
a suffix tree, which has a large overhead, but really fast searching.
I think that "pow" and "cow" wouldn't be close, unless there were no words
that started with any letters between "c" and "p". "Cow" and "cot" should
be very close, though.
Matt Lewis
7. Re: ternary search trees question
On 9 Aug 2005, at 19:48, Matt Lewis wrote:
>
>
> posted by: Matt Lewis <matthewwalkerlewis at gmail.com>
>
> Kat wrote:
> >
> > The part i hope someone can explain to me is this: how are the
branches of
> > such trees ordered, on what criteria, and how is such pre-set-up criteria
> > going to be optimal for a search in any but for the pre-set-up question?
For
> > instance, you can set up a tree for words who's every other letter is the
> > same, or every 2nd letter is the same, or have common endings (minus
the
> > suffixes), or have the same 3rd letter. But, for instance, how is the
ranking
> > for 3rd letters arranged for "cow" vs "cot" vs "pow" determined, by what
> > criteria
> >
> > Why would "pow" be anywhere near "cow"? I understand pattern
matching,
> > and i understand hashing, but to pre-set the database for fastest
searching of
> > a pattern, you have to know in advance what that pattern is! This leads to
> > many databases optomised for many patterns.
> >
>
> I believe it's all done lexigraphically. I suppose that you'd have to decide
> on
> a starting point to use (though I haven't looked in detail at the algorithms
> on
> the page Tone linked to). My understanding is that it's not guaranteed to
be
> the fastest search algorithm, but it takes log(k+n) (IIRC), so it's going to
> be
> pretty fast. It compromises between fast search and lower memory
requirements.
>
> To search, you basically follow the tree. I don't think that there's any
> optimization as far as suffixes. To get that, you probably need to use
> a suffix tree, which has a large overhead, but really fast searching.
>
> I think that "pow" and "cow" wouldn't be close, unless there were no words
> that started with any letters between "c" and "p". "Cow" and "cot" should
> be very close, though.
I can see "cow" and "cot" being close for several reasons. So searching for
"co*" would find them quickly. They line up with english spelling nicely. But
searching for "pow" and "cow" by using "*ow" would be very slow, effectively
meaning all the leaves of 3 letters or greater would need to be searched! Any
optomisation for that would make finding "cow" and "cot" as slow as "pow"
and "cow" were. Multiple matches, or more complex patterns would be even
more narrowly optomised, as far as i can tell. Like "cow??e" to find "coweye"
vs "cowpie", you can get to "cow" fast enough, but then it's brute force of all
down-branch 6-letter words after that. And that is assuming you have ordered
all "cow*" by word length, which wouldn't really help finding "kewpie" and
"cowpie" by looking for "*wpie". In fact, looking for "*wpie" might trigger a
question "did you mean "wipe"?
Any insights you can share?
Kat,
interested.
8. Re: ternary search trees question
Pete Lomax wrote:
> Pete Eberlein's winning entry solved this problem by holding
> completely separate trees for different word lengths.
This is good idea! Speedup because trees are smaller plus less memory needed.
Downsides: a little more coding and limit to length of words (not necessary
though).
>
>Now I've studied that paper, it looks to me like the C code given
>stores the terminating null found on all C strings, so it does not
>match on the second a of "aba" but the 4th byte, zero.
>
Yes, it looks like it.
I understand it now; first "n" of "abandon" would be stored at the right side of
terminating 0 of "aba".
(This site has clearer code:
http://www.opensource.apple.com/darwinsource/10.4/gccfast-1621.1/libiberty/ternary.c)
But I'll use your first advice.
A hybrid between ternary search tree and digital search tree would be really
fast. For first two or three letters would be used digital search tree (for real
speedy search), for rest of the letters would be used ternary search tree (to not
consume too much memory). Although I'm not sure if digital search tree wouldn't
be too slow when inserting new keys because there would be more writing to disk
needed.
9. Re: ternary search trees question
Kat wrote:
>
> The part i hope someone can explain to me is this: how are the branches of=
>
> such trees ordered, on what criteria, and how is such pre-set-up criteria=
>
> going to be optimal for a search in any but for the pre-set-up question?
Yes, they are not so easy to clearly understand. That's why I made to myself
this drawing which helps me to understand them clearer:
http://www10.brinkster.com/tskoda/Ternary_Search_Tree_Explanation.html
Speed of search for a word depends on how deeply is that word nested in tree.
How deeply is word nested in tree depends on how late was word inserted in tree,
i.e. words inserted later generally take more time to search, plus how many
similar words already existed in tree when the word was inserted. What means
"similar" here is yet to be more clearly explained.
10. Re: ternary search trees question
On 10 Aug 2005, at 1:36, Tone =8Akoda wrote:
>
>
> posted by: Tone =8Akoda <tskoda at email.si>
>
> Kat wrote:
>
> >=20
> > The part i hope someone can explain to me is this: how are the branches=
of=
> >=20
> > such trees ordered, on what criteria, and how is such pre-set-up criter=
ia=
> >=20
> > going to be optimal for a search in any but for the pre-set-up question=
?
>
>
> Yes, they are not so easy to clearly understand. That's why I made to mys=
elf
> this drawing which helps me to understand them clearer:
> http://www10.brinkster.com/tskoda/Ternary_Search_Tree_Explanation.html
>
> Speed of search for a word depends on how deeply is that word nested in t=
ree.
> How deeply is word nested in tree depends on how late was word inserted i=
n tree,
> i.e. words inserted later generally take more time to search, plus how ma=
ny
> similar words already existed in tree when the word was inserted. What me=
ans
> "similar" here is yet to be more clearly explained.
That looks like a regular dictionary lookup process to me. There's nothing=
there related to threes (ternary). Why are later entries harder to find? It=
still=20
looks to me like the tree must be built for each search pattern, because as=
your latest url plainly shows, "pot" would be far away from "cow", and stil=
l
take a while to match "?o?" search terms.
Kat