1. Searching for Fastest Way to Input a File
- Posted by euphoric (admin) May 13, 2014
- 2860 views
I did a search for "fastest way to read a file" and got no hits.
Anybody want to suggest a better search phrase that will work better?
I will also try searching with Google.
If anybody has a direct link to code, please share it.
If anybody has code for this, please post it.
2. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 13, 2014
- 2858 views
I did a search for "fastest way to read a file" and got no hits.
Ummm... more detail please.
- Which language are you intending to use?
- Is it a text file or binary?
- Do you include processing / interpreting the bytes in the 'speed' test or is it just a matter of getting the bytes into RAM?
- Is hardware an issue?
- Is size of the file an issue?
- What media does the file reside on?
3. Re: Searching for Fastest Way to Input a File
- Posted by ghaberek (admin) May 13, 2014
- 2822 views
Based on my (severely limited) testing, I can say that using text mode is slower than binary mode (Windows only?), and using getc() or gets() seems to be faster than get_bytes(), read_lines(), or read_file().
-Greg
4. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 13, 2014
- 2879 views
If anybody has code for this, please post it.
include std/io.e integer h integer c sequence d atom p d = command_line() h = open(d[3], "rb") if h != -1 then io:seek(h, -1) -- Go to end of file. p = io:where(h) -- Get file size in bytes io:seek(h,0) -- return to start of file. d = repeat(0, p) -- Make space in RAM for entire file. p = 1 -- Start at first position c = getc(h) -- get first byte while c != -1 do d[p] = c -- save byte p +=1 -- go to next position in RAM c = getc(h) -- get next byte end while close(h) else puts(1, "File '" & d[3] & "' could not be opened.\n") end if
5. Re: Searching for Fastest Way to Input a File
- Posted by ChrisB (moderator) May 14, 2014
- 2799 views
Hi
Doesn't read_file() do it for you?
Chris
6. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 14, 2014
- 2782 views
Doesn't read_file() do it for you?
My code example is pretty much what read_file() does for 'binary' data files. For text files, read_file() does a little bit more processing to make sure line endings are standardized.
A problem with read_file() is that it attempts to fit the entire file into RAM, which means that files greater than a quarter of your real RAM size (for 32-bit systems) just won't be read, nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.
But the main point I'd like to make, is that the fastest way to read a file is by using the getc() function. How you use the bytes read is the application's responsibility.
7. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 14, 2014
- 2788 views
nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.
Can you elaborate on this? Even the 32bit version is suppose to work with 4GB+ sized files. Large file size support was added back when 4.0.0 was alpha. If this no longer works, it's a bug with the existing functionality.
8. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 14, 2014
- 2772 views
nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.
Can you elaborate on this?
Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"
repetition count must not be more than 1073741823When I corrected my example code to cater for this, I ran out of memory for large files.
d = {} while p > 1073741823 do d &= repeat(0, 1073741823) p -= 1073741823 end while d = repeat(0, p)
When I removed the actual storing of bytes, I could read files of any size (around 25MB/second)
9. Re: Searching for Fastest Way to Input a File
- Posted by petelomax May 14, 2014
- 2774 views
While getc() is fast, if the file is quite large I suspect you could do much better with MapViewOfFile, I think there is a sysmmap on linux, but it would be quite messy stuff that Eu does not natively support.
10. Re: Searching for Fastest Way to Input a File
- Posted by gimlet May 15, 2014
- 2726 views
My god!
But what kind of file would need it?
11. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2770 views
While getc() is fast, if the file is quite large I suspect you could do much better with MapViewOfFile, I think there is a sysmmap on linux, but it would be quite messy stuff that Eu does not natively support.
I guess we could always bring it back: http://scm.openeuphoria.org/hg/euphoria/file/11871c40c37f/include/std/unix/mmap.e
12. Re: Searching for Fastest Way to Input a File
- Posted by bugmagnet May 15, 2014
- 2709 views
But what kind of file would need it?
The 20GB file I recently created in SSMS containing a schema and all the data as INSERT statements would seem to be a good candidate. It's in UTF-16LE encoding as well.
Bugmagnet
13. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2716 views
nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.
Can you elaborate on this?
Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"
repetition count must not be more than 1073741823
Ah ha, now I get it. I wonder why that limit exists. I guess it is acceptable on 32bit code, but I'm not sure if it makes sense for 64bit.
When I corrected my example code to cater for this, I ran out of memory for large files.
When I removed the actual storing of bytes, I could read files of any size (around 25MB/second)
Something like PAE is needed to access more than 4GB from a 32bit process, but I don't really think it's worth the effort to do all of that. It's probably easier to just seek around the file and read in and process pieces of it at a time.
14. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 15, 2014
- 2711 views
nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.
Can you elaborate on this?
Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"
repetition count must not be more than 1073741823When I corrected my example code to cater for this, I ran out of memory for large files.
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
Matt
15. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2719 views
1073741823 should be enough sequence elements for everyone!
ROFLOLMAO! This really made my day.
16. Re: Searching for Fastest Way to Input a File
- Posted by mindwalker May 15, 2014
- 2696 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
The Milky Way contains 100–400 billion stars. I guess one could make 100-400 separate sequences assuming this limit is per sequence and not an overall sequence element limit per program.
Yeah, everyone is thinking database not sequence. But to compute the net gravitational effect of the rest of the galaxy on an individual star one needs all the masses and relative coordinates of every star. And if one was trying to simulate the motion of the galaxy they would need to compute the gravitational effect on every star. Imagine the physical i/o required if one relied on a database to read this information in for each star's computation. Then repeat over and over again.
Although I enjoyed playing with the DOS version of Astro from the archive simulating up to 2500 masses at a time, I'm not planning to simulate the Milky Way. Besides there would be other computing power problems to overcome. My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.
17. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2686 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.
I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").
In 64bit, I agree. There's no sane reason to have a limit.
In 32bit, I'd still agree in principle. However the 2GB limit of memory available to a 32bit process (though there are ways around that, but at most you can only push the limit back to 4GB, not counting things like PAE) and the fact that the minimum size of a sequence element is 4 bytes, means that in practice, you can't get much further past this size (if at all) anyways. And trying to use multiple sequences to get around the limit is doomed to fail for the same reason.
In short, no 32bit process can model the entire galaxy (or work on similiar sized datasets) while holding all the information in conventional addressing memory space.
If you really want to process datasets this big, you either 1) want to do it using a different algorithm (one that lets you load slices of the dataset one at a time or something), 2) do it with a 64bit process, or 3) both. In light of these restrictions for 32bit processes, it can be argued that the current limitation makes sense by providing a (slightly) more user friendly error message than a mere OOM.
18. Re: Searching for Fastest Way to Input a File
- Posted by _tom (admin) May 15, 2014
- 2668 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").
The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.
The funny part was that Bill Gates provided lots of 64k limits in DOS.
Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."
In 64bit, I agree. There's no sane reason to have a limit.
Is this limit going to be raised?
_tom
19. Re: Searching for Fastest Way to Input a File
- Posted by euphoric (admin) May 15, 2014
- 2666 views
I did a search for "fastest way to read a file" and got no hits.
Ummm... more detail please.
- Which language are you intending to use?
- Is it a text file or binary?
- Do you include processing / interpreting the bytes in the 'speed' test or is it just a matter of getting the bytes into RAM?
- Is hardware an issue?
- Is size of the file an issue?
- What media does the file reside on?
My request was more for strengthening our search capabilities than obtaining the code, though obtaining the code was still an objective! :D
And I forgot about the read_lines() and read_file() commands. THOSE are what should have appeared in my search results. I guess we need smarter search, or link a Google search on this site.
Thank you for the code, Derek!
20. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2721 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").
The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.
The funny part was that Bill Gates provided lots of 64k limits in DOS.
Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."
The reference to computer folklore is accurate. It's not clear if Bill Gates really stated the quote, however.
I recall at some point, I read a webpage that said Bill Gates had in the 1980s said to an audience of students that 640K would be enough for 10 years or something along those lines, which got corrupted into the form that folklore recalls today. Not sure if that's true, however.
In 64bit, I agree. There's no sane reason to have a limit.
Is this limit going to be raised?
I made a mistake here. I took a closer look at the code, and this is what it actually does:
if (count > MAXINT_DBL) RTFatal("repetition count must not be more than 1073741823");
(from source/be_runtime.c)
Here's how MAXINT_DBL is defined:
#define MAXINT_DBL ((eudouble)MAXINT)
And here's the actual limit for 32bit:
#define MAXINT (intptr_t) 0x3FFFFFFFL
But it's set to this for 64bit:
#define MAXINT (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )
(The last three snippnets are from include/euphoria.h)
So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.
21. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 15, 2014
- 2711 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").
The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.
The funny part was that Bill Gates provided lots of 64k limits in DOS.
Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."
The reference to computer folklore is accurate. It's not clear if Bill Gates really stated the quote, however.
I recall at some point, I read a webpage that said Bill Gates had in the 1980s said to an audience of students that 640K would be enough for 10 years or something along those lines, which got corrupted into the form that folklore recalls today. Not sure if that's true, however.
Found this on wikiquote though:
I have to say that in 1981, making those decisions, I felt like I was providing enough freedom for 10 years. That is, a move from 64k to 640k felt like something that would last a great deal of time. Well, it didn't - it took about only 6 years before people started to see that as a real problem.
1989 speech on the history of the microcomputer industry.
Wikiquote says its from this source: http://www.csclub.uwaterloo.ca/media/1989%20Bill%20Gates%20Talk%20on%20Microsoft.html
edit: found the page I was thinking of: http://blog.softlayer.com/2008/640k-should-be-enough-for-everybody
22. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 16, 2014
- 2633 views
Although I enjoyed playing with the DOS version of Astro from the archive simulating up to 2500 masses at a time, I'm not planning to simulate the Milky Way. Besides there would be other computing power problems to overcome. My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.
There are probably better ways to do it. Such a sequence would require a massive amount of RAM just to allocate, let alone what would happen on copy-on-write operations. And hard core numerical computation like that is probably best done in something else anyways.
Matt
23. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 16, 2014
- 2620 views
I made a mistake here. I took a closer look at the code, and this is what it actually does:
if (count > MAXINT_DBL) RTFatal("repetition count must not be more than 1073741823");
(from source/be_runtime.c)
Here's how MAXINT_DBL is defined:
#define MAXINT_DBL ((eudouble)MAXINT)
And here's the actual limit for 32bit:
#define MAXINT (intptr_t) 0x3FFFFFFFL
But it's set to this for 64bit:
#define MAXINT (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )
(The last three snippnets are from include/euphoria.h)
So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.
But if you tried to make a sequence that big, there would be problems, because the actual structure uses an "int" for the sequence length. Honestly, I don't see any compelling reason to raise the limit. I can't see any sane reason to try to use a sequence that big except to say that you did it. Maybe someday we'll have large and fast enough memory to make it worthwhile.
Matt
24. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 16, 2014
- 2619 views
I made a mistake here. I took a closer look at the code, and this is what it actually does:
if (count > MAXINT_DBL) RTFatal("repetition count must not be more than 1073741823");
(from source/be_runtime.c)
Here's how MAXINT_DBL is defined:
#define MAXINT_DBL ((eudouble)MAXINT)
And here's the actual limit for 32bit:
#define MAXINT (intptr_t) 0x3FFFFFFFL
But it's set to this for 64bit:
#define MAXINT (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )
(The last three snippnets are from include/euphoria.h)
So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.
But if you tried to make a sequence that big, there would be problems, because the actual structure uses an "int" for the sequence length.
Either way, the check is broken - either the sequence length should be promoted to long, or the check should check for the old (32bit) MAXINT value when determining maximum length. The point is, the check for the maximum sequence length and the actual maximum sequence length should match.
If they don't, then it's a serious code freeze overriding bug.
Honestly, I don't see any compelling reason to raise the limit.
Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?
25. Re: Searching for Fastest Way to Input a File
- Posted by mindwalker May 16, 2014
- 2674 views
It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!
My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.
I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").
Thanks I didn't catch the reference, so thought I was looking at another of many short sighted assumptions littering the history of computers.
My favorite one is the IBM executive who decided that IBM wouldn't market the in-house developed mini computer (precurser to today's PCs) because in his opinion there was probably only a market for 5-10 world wide.
When I started my programing career I worked for a manufacturing company that produced automated mills and lathes. These were built with an attached computer that controled the machine's operations. When I started, these computers had 32KB total for both the operating system and the loaded (from Mylar tape) parts program; there was no floppy or hard drive since dust and such in the work environment was very destructive.
It was a constant struggle to minimize the operating system to preserve space for the parts program. When they moved to computers with 64KB of memory, it was thought we would never need all that space. A few predicted it would be the last time the computers would need to be upgraded. But it wasn't long until someone observed that if the operating manual was in the computer it would be a great improvement. Another suggested additional industry manuals could be added. In less than 2 years we were back to struggling to keep the operating system size small.
When I left the company after working there for 6 years, they were struggling to contain everything in a 128KB space. Turns out when the customers saw the manual stored on the computer many had asked if it was possible to keep their part programs on the machine so they wouldn't need to reload them each time and wouldn't have to worry about misplacing the mylar tapes.
26. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 17, 2014
- 2612 views
Thanks I didn't catch the reference, so thought I was looking at another of many short sighted assumptions littering the history of computers.
Perhaps you thought correctly. In any case, I've now opened a ticket for this: ticket:895
27. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 19, 2014
- 2543 views
Honestly, I don't see any compelling reason to raise the limit.
Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?
Like I said, no compelling reason.
Matt
28. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 19, 2014
- 2509 views
Honestly, I don't see any compelling reason to raise the limit.
Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?
Like I said, no compelling reason.
Matt
I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!
Edit: On the flip side, is there any compelling reason to avoid adding this?
If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.
If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.
29. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 19, 2014
- 2500 views
Honestly, I don't see any compelling reason to raise the limit.
Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?
Like I said, no compelling reason.
Matt
I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!
What is it that you think you are going to accomplish by loading that much information into RAM? Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.
Edit: On the flip side, is there any compelling reason to avoid adding this?
It's not a useful thing to do.
If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.
If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.
I'm not even a little bit worried about this, but I suppose another 8 bytes for sequence storage won't kill us.
Matt
30. Re: Searching for Fastest Way to Input a File
- Posted by ghaberek (admin) May 19, 2014
- 2501 views
What is it that you think you are going to accomplish by loading that much information into RAM? Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.
I agree with Matt; loading an entire large file into a sequence is a Very Bad Idea to begin with. If this is the reason for extending the limit of a single sequence, then I would be against it. If, however, we perceive this built-in limit to be a correctable limitation for overall 64-bit compatibility, then it should be corrected before an official 64-bit release. If one must load an entire large file into memory, he would be better served with allocate()/free() and peek()/poke() instead.
-Greg
31. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 19, 2014
- 2506 views
Like I said, no compelling reason.
Matt
I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!
What is it that you think you are going to accomplish by loading that much information into RAM?
Maybe the hypothetical application needs to do it for speed. Perhaps the file is located on a network share, and local disk quotas prevent copying it locally (to prevent the disk from filling up with gigabytes and gigabytes of temp files copied over from network shares).
Ok, there are still better ways to do this like using mmap, but what it comes down to is programmer choice.
Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.
I agree with Matt; loading an entire large file into a sequence is a Very Bad Idea to begin with. If this is the reason for extending the limit of a single sequence, then I would be against it.
If one must load an entire large file into memory, he would be better served with allocate()/free() and peek()/poke() instead.
Agreed. There are much better ways to manipulate large amounts of data. However, I think it's really backwards to have a library routine impose an arbituary limit like this for no good reason.
Edit: On the flip side, is there any compelling reason to avoid adding this?
It's not a useful thing to do.
Regarding usefuless of reading 1GB text files: Entire books could fit in a 64MB text file. Should we say the limit is 64MB, because no one would write a work of literature that could take up more space than that?
Regarding usefulness of extending the maximum length of a sequence generally: Also, how do you know what is not useful now won't be useful in the future? Computers originally just processed text. Later on, they could handle music and 2-D video. Then we have 3D video and full length movies. Each required a corresponding increase in available RAM and disk space.
I can see future datasets, combining text (like subtitles), sound+music, 360 degree spherical video, sensory data for touch simulators, and even encodings to duplicate taste and smell. (Perhaps a virtual world simulator would use these.) The files for this would probably be huge.
Processing 1TB of data would not have been useful back in the days of ENIAC. But today, astronomists do this all the time.
If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.
If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.
I'm not even a little bit worried about this, but I suppose another 8 bytes for sequence storage won't kill us.
You aren't event a little bit worried about backwards compatiblity between different versions of the E_OBJECT/E_SEQUENCE/E_ATOM interface?
If, however, we perceive this built-in limit to be a correctable limitation for overall 64-bit compatibility, then it should be corrected before an official 64-bit release.
We were imposing a limit on 64bit that was unnecessary and made no sense. Additionally, if it ever turned out that it would be useful to increase the limit (on a system that supported using that much RAM/memory/etc, of course), we could not do so without encountering compatibility problems. If we change it now, we can avoid those compatibility issues down the line.
32. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 19, 2014
- 2449 views
You aren't event a little bit worried about backwards compatiblity between different versions of the E_OBJECT/E_SEQUENCE/E_ATOM interface?
Ideally, no, because I wouldn't expand the max size of a sequence.
Working with a data set of a certain size does not imply loading it in its entirety into something like a sequence. If you think you need to do this, you need to realize you're doing it wrong and that there is a better way.
I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.
Matt
33. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 19, 2014
- 2437 views
Working with a data set of a certain size does not imply loading it in its entirety into something like a sequence. If you think you need to do this, you need to realize you're doing it wrong and that there is a better way.
Well, arrays are "something like a sequence", and memory-mapping files are sort of like arrays. And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.
The whole point is about flexibility and future-proofing. You never know what'll turn out to be useful later on. And you never know what'll still be around then either. (Just think of those stories from folklore about COBOL programmers who figured their programs would be long gone by the year 2000, only for thousands of dollars to be spent in order to make them Y2K-bug free.)
I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.
On the flip side, predicting the future accurately is pretty tough too. Just think of those in the past who thought we'd have Moon colonies and rocket cars by the 1990s.
So, that seems more than fair.
34. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 19, 2014
- 2421 views
I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.
It's kinda like the "goto" situation. Basically, do we let people do stupid/unwise/courageous programming or do we make Euphoria baby-sit them?
My concern with this proposed change is how it would affect the performance of using small (normal sized) sequences? Anything that adversely affects that, would be detrimental to Euphoria's acceptance.
35. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 19, 2014
- 2437 views
... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.
Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.
36. Re: Searching for Fastest Way to Input a File
- Posted by jimcbrown (admin) May 19, 2014
- 2453 views
... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.
Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.
Agreed. Still, I would have thought it'd be possible to, for example, map a 5GB or 6GB file into memory under a 64bit cpu. Certainly, servers are being sold today that could support such an operation: http://www.amazon.com/Kingston-Technology-667MHz-KTH-XW667-64G/dp/B001VMJ6MO
On the other hand, the only thing that I can think of that takes up that much memory today in a single process is MS SQL Server. (I've personally seen this go up as high as 20GB.)
My concern with this proposed change is how it would affect the performance of using small (normal sized) sequences? Anything that adversely affects that, would be detrimental to Euphoria's acceptance.
Agreed. I haven't benchmarked it, but I wouldn't expect it to slow things down at all.
37. Re: Searching for Fastest Way to Input a File
- Posted by mattlewis (admin) May 19, 2014
- 2432 views
... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.
Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.
I don't think it's limited to the RAM, is it? On a 64-bit OS, the virtual storage (memory space) is really really big. x86-64 "only" supports 256TB of a possible 16EB virtual memory space.
Matt
38. Re: Searching for Fastest Way to Input a File
- Posted by DerekParnell (admin) May 19, 2014
- 2427 views
I don't think it's limited to the RAM, is it? On a 64-bit OS, the virtual storage (memory space) is really really big. x86-64 "only" supports 256TB of a possible 16EB virtual memory space.
Technically still limited ... but the main issue with this is the ratio of real RAM to Virtual RAM. This ratio will have an effect on performance, especially when doing random access on the file data.
39. Re: Searching for Fastest Way to Input a File
- Posted by petelomax May 20, 2014
- 2380 views
Erm, actually I already have a perfectly reasonable use of quite ridiculously long sequences:
I sometimes watch movies online, but actually more often than not I copy them to my tablet or a thumbdrive that I can plug into the tv in the lounge. I clearly remember one being over 4GB.
Sometimes the connection craps out with like 15 mins left, and it is no big deal to drag the slider to just before the cutoff point and watch/capture the last bit. So I have a 40-line ditty to bolt the two chunks together. It copies the first chunk/file byte-by-byte, but it reads the whole of the second file into memory so that it can match() the last 100 bytes of the first file with the second, and carry on from there.
The point is, it works fine, it is dirt simple, and it is /very/ fast. It is quite clearly entirely i/o bound, and all the better precisely because it uses silly-length-sequences.
Pete