1. Suffix Tree (was ramdisk ...)
- Posted by Matt Lewis <matthew.w.lewis at saic.com> Mar 25, 2004
- 361 views
Matt Lewis wrote: > > Allen V Robnett wrote: > > <snip> > > > I am interested in the possibility of letting the user determine > > the length of the search string by pressing <ENTER> or <TAB> > > after entering the desired number of characters to be matched. EU > > seems to want sequences to be assigned a value which then > > determines the length. I am considering using a slice to truncate > > the original sequence length at whatever the length the user > > indicates. Will this work? Is there a better way? > > I think the 'better way' may involve a new design. What is the > logic of how records are allocated? Is there any structured layout > behind them? If not, might there be a way to do this? > > If the answer is still no, you could design an indexing system to > help you look up the strings. The fastest way would probably be a > suffix tree (although it's going to take up a LOT of disk space for > something this big). I have some code to build a tree, although it > wouldn't work for your case, because it's all done in memory, but > it could be done in files fairly easily. I'll see if I can get it > modified in the next few days or so. OK, it seems to be working, and it scales linearly. The tree itself (which is housed in 6 separate files) takes up about 26 times the size of the data file. So for your 200MB file, your suffix tree would take nearly 5GB. On my 2.4GHz machine, I estimate it would take about 32-36 hours to build a tree for your file (althouth the makeup of the data could change that). But, searching would be really fast. I've tested on files with up to 2^17 12-byte records, where the records are just hex numbers right justified amidst '_'s: ___________1 ___________2 ... ________1000 etc... Let me know if you're interested. It would obviously be useless if your data changed often, but it takes almost no time to search for a string, and you can look for partial results, too, so you could find all records that contain FF, for example. Matt Lewis