Re: too much memory use!

new topic     » goto parent     » topic index » view thread      » older message » newer message

On 21 Feb 2002, at 2:30, Jiri Babor wrote:

> 
> Kat,
> 
> try the following quick fix instead of your strtok code. Let me know if it's
> any
> better, I have not got big enough text files to test it.
> 
> jiri
> 
> constant false = 0, true = 1
> sequence data,word
> atom t
> integer c,f,inword
> 
> data = {}
> word = {}
> inword = false                  -- flag
> f = open("kat1.eml", "rb")
> c = getc(f)
> while c != -1 do
>     if find(c, {32,13,10}) then
>         if inword then
>             data = append(data, word)
>             inword = false
>             word = {}
>         end if
>     else
>         inword = true
>         word &= c
>     end if
>     c = getc(f)
> end while
> close(f)
> if inword then                  -- flush
>     data = append(data, word)
> end if

Jiri, i gave it a try, it ran 1 hour and 24 minutes, and use 160megs of memory.


Modified a bit as follows:

data = {} -- the bulk file contents
readline = "" -- one of the lines in the file
word = ""
datafile = open(data_noun,"rb") -- 12megs

-- <Jiri's code>
-- corrected for Jiri's "data" vs my "readline"

c = getc(datafile)
while c != -1 do

  if equal(c,32)
    then
       if inword then
         readline = append(readline, word)
         inword = false
         inline = true
         word = {}
       end if
    elsif find(c,{10,13}) and ( inline = true )
      then
         data = append(data,readline)
         readline = ""
inline = false -- do not land here if there is a 13 and a 10 on the
         same
line!! 
         inword = false
    else
        inword = true
        inline = true
        word &= c
    end if

    c = getc(datafile)

end while

-- <end Jiri's code>

close(datafile)

I modified it because i need the data arranged like:

{data -- one sequence
   {readline},{readline}, -- 75,000 of them
     {word},{word}, -- 5..1000 of them per readline
} } }

{indexes -- one sequence
   {readline},{readline}, -- 145,000 of them
     {word},{word}, -- 5..1000 of them per readline
} } }


Then i re-index the whole mess. The indexes are the 8-digit words below, 
which need to be longer too, but that's another story. In pascal (on dos), i 
would have done this with a ramdrive and file pointers. The first Eu program 
did the same way, but no ramdrive (i figured it would confuse win95) and at 
the end of the first day, it was about line 1000 on the 75,000 line file. Since 
windoze won't run 75 days, this way of re-indexing was not only too slow, but 
too <expletive> slow, and meant syncing the files in memory with those on 
the drive periodically, and determining after reboots where it left off before
the
reboot. If i can't get this to run better, i will either forget indexing and do 
bruteforce searches (making for 30second lookup times), or look at the 
allocated memory schemes and any libs in the Eu archives to wrap them.

One file is arranged like (two lines picked at random):
00252962 04 n 04 decrease 0 diminution 0 reduction 0 step-down 0 027 @ 
00252809 n 0000 ! 00260981 n 0101 ~ 00253565 n 0000 ~ 00254314 n 0000 
~ 00254503 n 0000 ~ 00254762 n 0000 ~ 00254954 n 0000 ~ 00255044 n 
0000 ~ 00255167 n 0000 ~ 00255414 n 0000 ~ 00255692 n 0000 ~ 
00256172 n 0000 ~ 00256313 n 0000 ~ 00258079 n 0000 ~ 00259254 n 0000 
~ 00259472 n 0000 ~ 00260041 n 0000 ~ 00260158 n 0000 ~ 00260295 n 
0000 ~ 00260392 n 0000 ~ 00262205 n 0000 ~ 00309963 n 0000 ~ 
00814987 n 0000 ~ 00833309 n 0000 ~ 11005389 n 0000 ~ 11007893 n 0000 
~ 11061536 n 0000 | the act of decreasing or reducing something  
00253565 04 n 01 cut 5 006 @ 00252962 n 0000 ~ 00253801 n 0000 ~ 
00253899 n 0000 ~ 00253992 n 0000 ~ 00254075 n 0000 ~ 00254227 n 0000 
| the act of reducing the amount or number; "the mayor proposed extensive 
cuts in the city budget"  

The other is arranged like (two lines picked at random):
amorousness n 2 2 @ ~ 2 0 06129685 06087784  
amorpha n 1 3 @ ~ #m 1 0 10220950  

One line may end in {32,32,10}, and the next may be {32,10} and the next 
may end in {}, so i must allow for anything. There are no null lines, or lines 
that trim() down to {}.

Kat

 
> ----- Original Message -----
> From: "Kat" <gertie at PELL.NET>
> To: "EUforum" <EUforum at topica.com>
> Sent: Wednesday, February 20, 2002 6:17 PM
> Subject: too much memory use!
> 
> 
> > Eu took 29 minutes, 36 sec to execute the following program, and used
> > 142.7Megs of memory. The file it was reading is 12.1 megabytes.
> >
> > data = {}
> > datafile = open(data_noun,"u")
> > readline = gets(datafile) -- get a line
> > while not atom(readline) do
> >   while find(readline[length(readline)],{10,13,32}) do readline =
> > readline[1..length(readline)-1] end while
> >   junk_s = parse(readline,32)
> >   data = data & {junk_s}
> >   readline = gets(datafile) -- get another line
> > end while
> > close(datafile)
> > trace(1) -- to hold the program while getting memory use data
> > abort(0)
> >
> > What am i doing that runs a 12meg file up to 142.7megabytes? and takes
> > 1/2 hour to do it?
> >
> > How can i say data = glom(file) ?
> >
> > Kat
> >
> >
> 
> 
>

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu