1. need help - with giant text files :)
- Posted by timmy <tim781 at PACBELL.NET> Jan 26, 2000
- 364 views
Hi Everyone, I'm trying to find the best way to work with very large text files ( larger than 500,000 lines ). I need to find information in it very quickly. Example: find all lines with the word "optical". I don't want to have to scan every line each time I need information. My thoughts are to load the file and assign a memory location to each word. (This means the file will take up twice as much memory, but, that's ok.) Then I would create an alphabetical list of all the words and assign the memory locations of each word. This way, when I need to find all the lines with a certain word I can call it quickly from the alphabetical list. I'm writting to ask if anyone knows of a better way. :) ...thanks ...timmy
2. Re: need help - with giant text files :)
- Posted by "Lucius L. Hilley III" <lhilley at CDC.NET> Jan 26, 2000
- 370 views
> ---------------------- Information from the mail header ----------------------- > Sender: Euphoria Programming for MS-DOS <EUPHORIA at LISTSERV.MUOHIO.EDU> > Poster: timmy <tim781 at PACBELL.NET> > Organization: Pacific Bell Internet Services > Subject: need help - with giant text files :) > -------------------------------------------------------------------------- ----- > > Hi Everyone, I'm trying to find the best way to > work with very large text files ( larger than 500,000 lines ). > I need to find information in it very quickly. Example: > find all lines with the word "optical". I don't want to have > to scan every line each time I need information. My thoughts > are to load the file and assign a memory location to each word. > (This means the file will take up twice as much memory, but, that's ok.) > > Then I would create an alphabetical list of all the words and assign the > > memory locations of each word. This way, when I need to find > all the lines with a certain word I can call it quickly from the > alphabetical list. I'm writting to ask if anyone knows of a > better way. :) ...thanks ...timmy > That sounds like a pretty good idea. First scan the file and make a list of every unique word in the file. and for each unique word make a list of file address pointers. the pointers would point to the exact address position of where the word starts in the file. Then you could save the list to an indexing file. Before doing that it would be great to sort the list so you can do binary searches. You could also keep a list pointers to where each line begins. If you want to assume the file to be TEXT in either UNIX or DOS format then you will assume a new line starts every time you come across a LF. Just think of CR's as by products of the DOS file format. PS: CR = 13 -- CR = '\r' -- CR = #0D LF = 10 -- LF = '\n' -- LF = #0A Lucius L. Hilley III lhilley at cdc.net +----------+--------------+--------------+ | Hollow | ICQ: 9638898 | AIM: LLHIII | | Horse +--------------+--------------+ | Software | http://www.cdc.net/~lhilley | +----------+-----------------------------+
3. Re: need help - with giant text files :)
- Posted by Irv Mullins <irv at ELLIJAY.COM> Jan 26, 2000
- 375 views
----- Original Message ----- From: timmy <tim781 at PACBELL.NET> To: <EUPHORIA at LISTSERV.MUOHIO.EDU> Sent: Wednesday, January 26, 2000 10:59 AM Subject: need help - with giant text files :) > Hi Everyone, I'm trying to find the best way to > work with very large text files ( larger than 500,000 lines ). > I need to find information in it very quickly. Example: > find all lines with the word "optical". I don't want to have > to scan every line each time I need information. My thoughts > are to load the file and assign a memory location to each word. > (This means the file will take up twice as much memory, but, that's ok.) > > Then I would create an alphabetical list of all the words and assign the > > memory locations of each word. This way, when I need to find > all the lines with a certain word I can call it quickly from the > alphabetical list. I'm writting to ask if anyone knows of a > better way. :) ...thanks ...timmy Look at Junko's hash.ex - it comes with the Euphoria download, in the euphoria/demo directory. I think you could modify this to store a sequence of line numbers with each unique "hashed" word. Something like: if word is not in the hash, add it, and the line number where it was found. If word is already in the hash, just append the line number to the one(s) already there. It should be pretty fast, and not very memory-intensive. Irv