Re: too much memory use!
- Posted by Kat <gertie at PELL.NET> Feb 20, 2002
- 535 views
On 21 Feb 2002, at 2:30, Jiri Babor wrote: > > Kat, > > try the following quick fix instead of your strtok code. Let me know if it's > any > better, I have not got big enough text files to test it. > > jiri > > constant false = 0, true = 1 > sequence data,word > atom t > integer c,f,inword > > data = {} > word = {} > inword = false -- flag > f = open("kat1.eml", "rb") > c = getc(f) > while c != -1 do > if find(c, {32,13,10}) then > if inword then > data = append(data, word) > inword = false > word = {} > end if > else > inword = true > word &= c > end if > c = getc(f) > end while > close(f) > if inword then -- flush > data = append(data, word) > end if Jiri, i gave it a try, it ran 1 hour and 24 minutes, and use 160megs of memory. Modified a bit as follows: data = {} -- the bulk file contents readline = "" -- one of the lines in the file word = "" datafile = open(data_noun,"rb") -- 12megs -- <Jiri's code> -- corrected for Jiri's "data" vs my "readline" c = getc(datafile) while c != -1 do if equal(c,32) then if inword then readline = append(readline, word) inword = false inline = true word = {} end if elsif find(c,{10,13}) and ( inline = true ) then data = append(data,readline) readline = "" inline = false -- do not land here if there is a 13 and a 10 on the same line!! inword = false else inword = true inline = true word &= c end if c = getc(datafile) end while -- <end Jiri's code> close(datafile) I modified it because i need the data arranged like: {data -- one sequence {readline},{readline}, -- 75,000 of them {word},{word}, -- 5..1000 of them per readline } } } {indexes -- one sequence {readline},{readline}, -- 145,000 of them {word},{word}, -- 5..1000 of them per readline } } } Then i re-index the whole mess. The indexes are the 8-digit words below, which need to be longer too, but that's another story. In pascal (on dos), i would have done this with a ramdrive and file pointers. The first Eu program did the same way, but no ramdrive (i figured it would confuse win95) and at the end of the first day, it was about line 1000 on the 75,000 line file. Since windoze won't run 75 days, this way of re-indexing was not only too slow, but too <expletive> slow, and meant syncing the files in memory with those on the drive periodically, and determining after reboots where it left off before the reboot. If i can't get this to run better, i will either forget indexing and do bruteforce searches (making for 30second lookup times), or look at the allocated memory schemes and any libs in the Eu archives to wrap them. One file is arranged like (two lines picked at random): 00252962 04 n 04 decrease 0 diminution 0 reduction 0 step-down 0 027 @ 00252809 n 0000 ! 00260981 n 0101 ~ 00253565 n 0000 ~ 00254314 n 0000 ~ 00254503 n 0000 ~ 00254762 n 0000 ~ 00254954 n 0000 ~ 00255044 n 0000 ~ 00255167 n 0000 ~ 00255414 n 0000 ~ 00255692 n 0000 ~ 00256172 n 0000 ~ 00256313 n 0000 ~ 00258079 n 0000 ~ 00259254 n 0000 ~ 00259472 n 0000 ~ 00260041 n 0000 ~ 00260158 n 0000 ~ 00260295 n 0000 ~ 00260392 n 0000 ~ 00262205 n 0000 ~ 00309963 n 0000 ~ 00814987 n 0000 ~ 00833309 n 0000 ~ 11005389 n 0000 ~ 11007893 n 0000 ~ 11061536 n 0000 | the act of decreasing or reducing something 00253565 04 n 01 cut 5 006 @ 00252962 n 0000 ~ 00253801 n 0000 ~ 00253899 n 0000 ~ 00253992 n 0000 ~ 00254075 n 0000 ~ 00254227 n 0000 | the act of reducing the amount or number; "the mayor proposed extensive cuts in the city budget" The other is arranged like (two lines picked at random): amorousness n 2 2 @ ~ 2 0 06129685 06087784 amorpha n 1 3 @ ~ #m 1 0 10220950 One line may end in {32,32,10}, and the next may be {32,10} and the next may end in {}, so i must allow for anything. There are no null lines, or lines that trim() down to {}. Kat > ----- Original Message ----- > From: "Kat" <gertie at PELL.NET> > To: "EUforum" <EUforum at topica.com> > Sent: Wednesday, February 20, 2002 6:17 PM > Subject: too much memory use! > > > > Eu took 29 minutes, 36 sec to execute the following program, and used > > 142.7Megs of memory. The file it was reading is 12.1 megabytes. > > > > data = {} > > datafile = open(data_noun,"u") > > readline = gets(datafile) -- get a line > > while not atom(readline) do > > while find(readline[length(readline)],{10,13,32}) do readline = > > readline[1..length(readline)-1] end while > > junk_s = parse(readline,32) > > data = data & {junk_s} > > readline = gets(datafile) -- get another line > > end while > > close(datafile) > > trace(1) -- to hold the program while getting memory use data > > abort(0) > > > > What am i doing that runs a 12meg file up to 142.7megabytes? and takes > > 1/2 hour to do it? > > > > How can i say data = glom(file) ? > > > > Kat > > > > > > >