1. need help - with giant text files :)

Hi Everyone, I'm trying to find the best way to
work with very large text files ( larger than 500,000 lines ).
I need to find information in it very quickly. Example:
find all lines with the word "optical". I don't want to have
to scan every line each time I need information. My thoughts
are to load the file and assign a memory location to each word.
(This means the file will take up twice as much memory, but, that's ok.)

Then I would create an alphabetical list of all the words and assign the

memory locations of each word. This way, when I need to find
all the lines with a certain word I can call it quickly from the
alphabetical list.  I'm writting to ask if anyone knows of a
better way. :) ...thanks ...timmy

new topic     » topic index » view message » categorize

2. Re: need help - with giant text files :)

> ---------------------- Information from the mail
header -----------------------
> Sender:       Euphoria Programming for MS-DOS
<EUPHORIA at LISTSERV.MUOHIO.EDU>
> Poster:       timmy <tim781 at PACBELL.NET>
> Organization: Pacific Bell Internet Services
> Subject:      need help - with giant text files :)
> --------------------------------------------------------------------------
-----
>
> Hi Everyone, I'm trying to find the best way to
> work with very large text files ( larger than 500,000 lines ).
> I need to find information in it very quickly. Example:
> find all lines with the word "optical". I don't want to have
> to scan every line each time I need information. My thoughts
> are to load the file and assign a memory location to each word.
> (This means the file will take up twice as much memory, but, that's ok.)
>
> Then I would create an alphabetical list of all the words and assign the
>
> memory locations of each word. This way, when I need to find
> all the lines with a certain word I can call it quickly from the
> alphabetical list.  I'm writting to ask if anyone knows of a
> better way. :) ...thanks ...timmy
>

    That sounds like a pretty good idea.
First scan the file and make a list of every unique word in the file.
and for each unique word make a list of file address pointers.
the pointers would point to the exact address position of where the
word starts in the file.  Then you could save the list to an indexing
file.  Before doing that it would be great to sort the list so you can
do binary searches.  You could also keep a list pointers to where each
line begins.
    If you want to assume the file to be TEXT in either UNIX or DOS format
then you will assume a new line starts every time you come across a LF.
Just think of CR's as by products of the DOS file format.

PS:
    CR = 13 -- CR = '\r' -- CR = #0D
    LF = 10 -- LF = '\n' -- LF = #0A



        Lucius L. Hilley III
        lhilley at cdc.net
+----------+--------------+--------------+
| Hollow   | ICQ: 9638898 | AIM: LLHIII  |
|  Horse   +--------------+--------------+
| Software | http://www.cdc.net/~lhilley |
+----------+-----------------------------+

new topic     » goto parent     » topic index » view message » categorize

3. Re: need help - with giant text files :)

----- Original Message -----
From: timmy <tim781 at PACBELL.NET>
To: <EUPHORIA at LISTSERV.MUOHIO.EDU>
Sent: Wednesday, January 26, 2000 10:59 AM
Subject: need help - with giant text files :)


> Hi Everyone, I'm trying to find the best way to
> work with very large text files ( larger than 500,000 lines ).
> I need to find information in it very quickly. Example:
> find all lines with the word "optical". I don't want to have
> to scan every line each time I need information. My thoughts
> are to load the file and assign a memory location to each word.
> (This means the file will take up twice as much memory, but, that's ok.)
>
> Then I would create an alphabetical list of all the words and assign the
>
> memory locations of each word. This way, when I need to find
> all the lines with a certain word I can call it quickly from the
> alphabetical list.  I'm writting to ask if anyone knows of a
> better way. :) ...thanks ...timmy

Look at Junko's hash.ex - it comes with the Euphoria download, in the
euphoria/demo
directory. I think you could modify this to store a sequence of line numbers
with each
unique "hashed" word.

Something like: if word is not in the hash, add it, and the line number
where it was found. If word is already in the hash, just append the line
number to the one(s) already
there. It should be pretty fast, and not very memory-intensive.

Irv

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu