1. RE: ramdisk
- Posted by "Kat" <gertie at visionsix.com> Mar 26, 2004
- 387 views
On 25 Mar 2004, at 16:27, Derek Parnell wrote: > > > > -----Original Message----- > > From: Allen Robnett [mailto:alrobnett at alumni.princeton.edu] > > Sent: Thursday, 25 March 2004 3:35 PM > > To: EUforum at topica.com > > Subject: Re: ramdisk (was: Re: Changing data types Concluded) > > > > > > Derek wrote: > > > > <<As you have all the bytes in RAM, you do not have to convert them to > > Euphoria integers etc... You can use the RAM-based string > > searching routines > > built in to Windows. > > > > <snip> > > while offset < FileSize do > > result = c_func(CompareString,{0, 0, RAMADDR+offset, len, > > FindStr, len}) > > if result = CSTR_ERROR or result = CSTR_EQUAL then > > exit > > end if > > offset += recsize > > end while > > >> > > > > Euman wrote: > > > > << Make sure you free that allocated string. >> > > > > Kat wrote: > > > > <<No, it means it took 14 sec to find it each time. Which > > sounds bad to me, > > because it means a 200meg file will take about 6 minutes to > > find a record at > > the end. There must be a gotcha somewhere.>> > > > > Many thanks to all for the generous assistance. > > Using Derek's suggested CompareString on my 200MB+ file > > (Pentium 4, 2.53 GHz, 512MB RAM) I got 7.69 seconds for a > > 4-character search string and 8.36 seconds for a 12-character > > search string. (The majority of the fields contained simply > > underscores. The target string was in the last 12-character > > record, number 16,777,216, ending with byte number 201,326,592.) > > > > Modifying the while statement to: > > > > while offset < FileSize and c_func(CompareString,{0, 0, > > RAMADDR+offset, len, FindStr, len}) != CSTR_EQUAL do > > offset += recsize > > end while > > > > resulted in a only a very slight improvement: 7.30 seconds. > > CompareString() is a *very* expense method as it takes regional locale > aspects into consideration. It might be better to get a small machine-code > routine written to scan from a RAM address for the first occurance of the > target string. Still not in machine code,, but for a 10meg file with the find-pattern in the middle of it, i got the time down to 4.25 sec. If i take the allocation of vars out of the timing loop, it doesn't change. If i remove the puts() from the loop so there is no dos box during timing, the time drops to 3.7 seconds. I am still not terribly pleased yet. But, i am using peek() now, and making over twice as many as needed, imho. Lets see if i can fix that somehow.... Yeas, got it down to finding the string midway in a 10 megabyte file in 1 second (average time on 10 searches). So i put the string at the end of the file, and it was found in 2 sec (average time on 10 searches). This makes for finding the end of a 200 megabyte file in 40 seconds. That's more reasonable, at least for this slow computer. Naturally, if there was a clue to the approximate location of the string in the file, it could be found even faster. I haven't checked, but i figure from what the code is doing that a binary search on a 200 meg file, the way i am doing it, will take 2.25 sec, much faster than the previous scarey 6 minutes. Note i am not using fixed size records, which would slow the speed as the record sizes get smaller using the method i am using now. But a fixed record size *may* speed up binary searches. Of course, a separate index file would really speed it up. Something else interesting, which points up Eu not closing out files properly, is that two instances of the loadsearch code (i tried to post here in a .zip file) cannot be run at the same time, the 2nd instance errors out with "cannot open readfile", because the first instance (finished running, but sitting there waiting for me to press enter) still owns the test file. So lemme ask again: Why is this? Kat
2. RE: ramdisk
- Posted by kbochert at copper.net Mar 26, 2004
- 380 views
On 26 Mar 2004 at 3:50, Kat wrote: > > On 25 Mar 2004, at 16:27, Derek Parnell wrote: > > > > > > -----Original Message----- > > > From: Allen Robnett [mailto:alrobnett at alumni.princeton.edu] > > > Sent: Thursday, 25 March 2004 3:35 PM > > > To: EUforum at topica.com > > > Subject: Re: ramdisk (was: Re: Changing data types Concluded) > > > > > > > > > Derek wrote: > > > > > > <<As you have all the bytes in RAM, you do not have to convert them to > > > Euphoria integers etc... You can use the RAM-based string > > > searching routines > > > built in to Windows. > > > > > > <snip> > > > while offset < FileSize do > > > result = c_func(CompareString,{0, 0, RAMADDR+offset, len, > > > FindStr, len}) > > > if result = CSTR_ERROR or result = CSTR_EQUAL then > > > exit > > > end if > > > offset += recsize > > > end while > > > >> > > > > > > Euman wrote: > > > > > > << Make sure you free that allocated string. >> > > > > > > Kat wrote: > > > > > > <<No, it means it took 14 sec to find it each time. Which > > > sounds bad to me, > > > because it means a 200meg file will take about 6 minutes to > > > find a record at > > > the end. There must be a gotcha somewhere.>> > > > > > > Many thanks to all for the generous assistance. > > > Using Derek's suggested CompareString on my 200MB+ file > > > (Pentium 4, 2.53 GHz, 512MB RAM) I got 7.69 seconds for a > > > 4-character search string and 8.36 seconds for a 12-character > > > search string. (The majority of the fields contained simply > > > underscores. The target string was in the last 12-character > > > record, number 16,777,216, ending with byte number 201,326,592.) > > > > > > Modifying the while statement to: > > > > > > while offset < FileSize and c_func(CompareString,{0, 0, > > > RAMADDR+offset, len, FindStr, len}) != CSTR_EQUAL do > > > offset += recsize > > > end while > > > > > > resulted in a only a very slight improvement: 7.30 seconds. > > > > CompareString() is a *very* expense method as it takes regional locale > > aspects into consideration. It might be better to get a small machine-code > > routine written to scan from a RAM address for the first occurance of the > > target string. > > Still not in machine code,, but for a 10meg file with the find-pattern in the > middle of it, i got the time down to 4.25 sec. If i take the allocation of > vars out > of the timing loop, it doesn't change. If i remove the puts() from the loop so > > there is no dos box during timing, the time drops to 3.7 seconds. I am still > not terribly pleased yet. > > But, i am using peek() now, and making over twice as many as needed, > imho. Lets see if i can fix that somehow.... > > Yeas, got it down to finding the string midway in a 10 megabyte file in 1 > second (average time on 10 searches). So i put the string at the end of the > file, and it was found in 2 sec (average time on 10 searches). This makes for > finding the end of a 200 megabyte file in 40 seconds. That's more > reasonable, at least for this slow computer. > > Naturally, if there was a clue to the approximate location of the string in > the > file, it could be found even faster. I haven't checked, but i figure from what > the > code is doing that a binary search on a 200 meg file, the way i am doing it, > will take 2.25 sec, much faster than the previous scarey 6 minutes. > > Note i am not using fixed size records, which would slow the speed as the > record sizes get smaller using the method i am using now. But a fixed record > size *may* speed up binary searches. Of course, a separate index file would > really speed it up. > > snip < > > Kat > Ha ve you tried a Boyer-Moore search? Is it applicable in this case? Karl Bochert
3. RE: ramdisk
- Posted by "Kat" <gertie at visionsix.com> Mar 26, 2004
- 398 views
On 26 Mar 2004, at 2:38, kbochert at copper.net wrote: <snip> > > Yeas, got it down to finding the string midway in a 10 megabyte file in 1 > > second (average time on 10 searches). So i put the string at the end of the > > file, and it was found in 2 sec (average time on 10 searches). This makes > > for > > finding the end of a 200 megabyte file in 40 seconds. That's more > > reasonable, > > at least for this slow computer. > > > > Naturally, if there was a clue to the approximate location of the string in > > the file, it could be found even faster. I haven't checked, but i figure > > from > > what the code is doing that a binary search on a 200 meg file, the way i am > > doing it, will take 2.25 sec, much faster than the previous scarey 6 > > minutes. > > > > Note i am not using fixed size records, which would slow the speed as the > > record sizes get smaller using the method i am using now. But a fixed record > > size *may* speed up binary searches. Of course, a separate index file would > > really speed it up. > > > > snip < > > > > Kat > > > Ha ve you tried a Boyer-Moore search? > Is it applicable in this case? I am using Eu's built-in match() to locate the string in the megabyte file, without putting the megabyte file into match(). (Sure you figured it out, but it's not as easy as that, consider a peek(fn,x) where x lands in the middle of what you are looking for.) Using the BM search on the file would mean accessing the file one byte at a time, and would work well (i suspect) only in machine code. I use Euphoria to avoid machine code. The next problem is being able to do string insert() on a file too big to duplicate in memory, such as inserting a record 40% of the way from an end of the file, in a way that binary search still works quickly. I suspect the only way to accomplish this is a separate index list. However, if the 200 megabyte file IS the index list, which it could be, i am back to square one. Kat