Re: too much memory use!
On 21 Feb 2002, at 2:30, Jiri Babor wrote:
>
> Kat,
>
> try the following quick fix instead of your strtok code. Let me know if it's
> any
> better, I have not got big enough text files to test it.
>
> jiri
>
> constant false = 0, true = 1
> sequence data,word
> atom t
> integer c,f,inword
>
> data = {}
> word = {}
> inword = false -- flag
> f = open("kat1.eml", "rb")
> c = getc(f)
> while c != -1 do
> if find(c, {32,13,10}) then
> if inword then
> data = append(data, word)
> inword = false
> word = {}
> end if
> else
> inword = true
> word &= c
> end if
> c = getc(f)
> end while
> close(f)
> if inword then -- flush
> data = append(data, word)
> end if
Jiri, i gave it a try, it ran 1 hour and 24 minutes, and use 160megs of memory.
Modified a bit as follows:
data = {} -- the bulk file contents
readline = "" -- one of the lines in the file
word = ""
datafile = open(data_noun,"rb") -- 12megs
-- <Jiri's code>
-- corrected for Jiri's "data" vs my "readline"
c = getc(datafile)
while c != -1 do
if equal(c,32)
then
if inword then
readline = append(readline, word)
inword = false
inline = true
word = {}
end if
elsif find(c,{10,13}) and ( inline = true )
then
data = append(data,readline)
readline = ""
inline = false -- do not land here if there is a 13 and a 10 on the
same
line!!
inword = false
else
inword = true
inline = true
word &= c
end if
c = getc(datafile)
end while
-- <end Jiri's code>
close(datafile)
I modified it because i need the data arranged like:
{data -- one sequence
{readline},{readline}, -- 75,000 of them
{word},{word}, -- 5..1000 of them per readline
} } }
{indexes -- one sequence
{readline},{readline}, -- 145,000 of them
{word},{word}, -- 5..1000 of them per readline
} } }
Then i re-index the whole mess. The indexes are the 8-digit words below,
which need to be longer too, but that's another story. In pascal (on dos), i
would have done this with a ramdrive and file pointers. The first Eu program
did the same way, but no ramdrive (i figured it would confuse win95) and at
the end of the first day, it was about line 1000 on the 75,000 line file. Since
windoze won't run 75 days, this way of re-indexing was not only too slow, but
too <expletive> slow, and meant syncing the files in memory with those on
the drive periodically, and determining after reboots where it left off before
the
reboot. If i can't get this to run better, i will either forget indexing and do
bruteforce searches (making for 30second lookup times), or look at the
allocated memory schemes and any libs in the Eu archives to wrap them.
One file is arranged like (two lines picked at random):
00252962 04 n 04 decrease 0 diminution 0 reduction 0 step-down 0 027 @
00252809 n 0000 ! 00260981 n 0101 ~ 00253565 n 0000 ~ 00254314 n 0000
~ 00254503 n 0000 ~ 00254762 n 0000 ~ 00254954 n 0000 ~ 00255044 n
0000 ~ 00255167 n 0000 ~ 00255414 n 0000 ~ 00255692 n 0000 ~
00256172 n 0000 ~ 00256313 n 0000 ~ 00258079 n 0000 ~ 00259254 n 0000
~ 00259472 n 0000 ~ 00260041 n 0000 ~ 00260158 n 0000 ~ 00260295 n
0000 ~ 00260392 n 0000 ~ 00262205 n 0000 ~ 00309963 n 0000 ~
00814987 n 0000 ~ 00833309 n 0000 ~ 11005389 n 0000 ~ 11007893 n 0000
~ 11061536 n 0000 | the act of decreasing or reducing something
00253565 04 n 01 cut 5 006 @ 00252962 n 0000 ~ 00253801 n 0000 ~
00253899 n 0000 ~ 00253992 n 0000 ~ 00254075 n 0000 ~ 00254227 n 0000
| the act of reducing the amount or number; "the mayor proposed extensive
cuts in the city budget"
The other is arranged like (two lines picked at random):
amorousness n 2 2 @ ~ 2 0 06129685 06087784
amorpha n 1 3 @ ~ #m 1 0 10220950
One line may end in {32,32,10}, and the next may be {32,10} and the next
may end in {}, so i must allow for anything. There are no null lines, or lines
that trim() down to {}.
Kat
> ----- Original Message -----
> From: "Kat" <gertie at PELL.NET>
> To: "EUforum" <EUforum at topica.com>
> Sent: Wednesday, February 20, 2002 6:17 PM
> Subject: too much memory use!
>
>
> > Eu took 29 minutes, 36 sec to execute the following program, and used
> > 142.7Megs of memory. The file it was reading is 12.1 megabytes.
> >
> > data = {}
> > datafile = open(data_noun,"u")
> > readline = gets(datafile) -- get a line
> > while not atom(readline) do
> > while find(readline[length(readline)],{10,13,32}) do readline =
> > readline[1..length(readline)-1] end while
> > junk_s = parse(readline,32)
> > data = data & {junk_s}
> > readline = gets(datafile) -- get another line
> > end while
> > close(datafile)
> > trace(1) -- to hold the program while getting memory use data
> > abort(0)
> >
> > What am i doing that runs a 12meg file up to 142.7megabytes? and takes
> > 1/2 hour to do it?
> >
> > How can i say data = glom(file) ?
> >
> > Kat
> >
> >
>
>
>
|
Not Categorized, Please Help
|
|