OpenEuphoria: Forum: Searching for Fastest Way to Input a File

1. Searching for Fastest Way to Input a File

Posted by euphoric (admin) May 13, 2014
2860 views

I did a search for "fastest way to read a file" and got no hits.

Anybody want to suggest a better search phrase that will work better?

I will also try searching with Google.

If anybody has a direct link to code, please share it.

If anybody has code for this, please post it.

new topic » topic index » view message » categorize

2. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 13, 2014
2858 views

euphoric said...

I did a search for "fastest way to read a file" and got no hits.

Ummm... more detail please.

Which language are you intending to use?
Is it a text file or binary?
Do you include processing / interpreting the bytes in the 'speed' test or is it just a matter of getting the bytes into RAM?
Is hardware an issue?
Is size of the file an issue?
What media does the file reside on?

new topic » goto parent » topic index » view message » categorize

3. Re: Searching for Fastest Way to Input a File

Posted by ghaberek (admin) May 13, 2014
2822 views

Based on my (severely limited) testing, I can say that using text mode is slower than binary mode (Windows only?), and using getc() or gets() seems to be faster than get_bytes(), read_lines(), or read_file().

-Greg

new topic » goto parent » topic index » view message » categorize

4. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 13, 2014
2879 views

euphoric said...

If anybody has code for this, please post it.

include std/io.e 
 
integer h 
integer c 
sequence d 
atom p 
 
d = command_line() 
 
h = open(d[3], "rb") 
if h != -1 then 
	io:seek(h, -1)  -- Go to end of file. 
	p = io:where(h)  -- Get file size in bytes 
	io:seek(h,0)  -- return to start of file. 
	 
	d = repeat(0, p)  -- Make space in RAM for entire file. 
 
	p = 1  -- Start at first position 
	c = getc(h) -- get first byte 
	while c != -1 do 
		d[p] = c  -- save byte 
		p +=1  -- go to next position in RAM 
		c = getc(h)  -- get next byte 
	end while 
	close(h) 
else 
	puts(1, "File '" & d[3] & "' could not be opened.\n") 
end if

new topic » goto parent » topic index » view message » categorize

5. Re: Searching for Fastest Way to Input a File

Posted by ChrisB (moderator) May 14, 2014
2799 views

Hi

Doesn't read_file() do it for you?

Chris

new topic » goto parent » topic index » view message » categorize

6. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 14, 2014
2782 views

ChrisB said...

Doesn't read_file() do it for you?

My code example is pretty much what read_file() does for 'binary' data files. For text files, read_file() does a little bit more processing to make sure line endings are standardized.

A problem with read_file() is that it attempts to fit the entire file into RAM, which means that files greater than a quarter of your real RAM size (for 32-bit systems) just won't be read, nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.

But the main point I'd like to make, is that the fastest way to read a file is by using the getc() function. How you use the bytes read is the application's responsibility.

new topic » goto parent » topic index » view message » categorize

7. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 14, 2014
2788 views

DerekParnell said...

nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.

Can you elaborate on this? Even the 32bit version is suppose to work with 4GB+ sized files. Large file size support was added back when 4.0.0 was alpha. If this no longer works, it's a bug with the existing functionality.

new topic » goto parent » topic index » view message » categorize

8. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 14, 2014
2772 views

jimcbrown said...

DerekParnell said...

nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.

Can you elaborate on this?

Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"

repetition count must not be more than 1073741823

When I corrected my example code to cater for this, I ran out of memory for large files. sad

 
        d = {} 
	while p > 1073741823 do 
		d &= repeat(0, 1073741823) 
		p -= 1073741823 
	end while 
	d = repeat(0, p)

When I removed the actual storing of bytes, I could read files of any size (around 25MB/second)

new topic » goto parent » topic index » view message » categorize

9. Re: Searching for Fastest Way to Input a File

Posted by petelomax May 14, 2014
2774 views

While getc() is fast, if the file is quite large I suspect you could do much better with MapViewOfFile, I think there is a sysmmap on linux, but it would be quite messy stuff that Eu does not natively support.

new topic » goto parent » topic index » view message » categorize

10. Re: Searching for Fastest Way to Input a File

Posted by gimlet May 15, 2014
2726 views

My god!

But what kind of file would need it?

new topic » goto parent » topic index » view message » categorize

11. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2770 views

petelomax said...

While getc() is fast, if the file is quite large I suspect you could do much better with MapViewOfFile, I think there is a sysmmap on linux, but it would be quite messy stuff that Eu does not natively support.

I guess we could always bring it back: http://scm.openeuphoria.org/hg/euphoria/file/11871c40c37f/include/std/unix/mmap.e

new topic » goto parent » topic index » view message » categorize

12. Re: Searching for Fastest Way to Input a File

Posted by bugmagnet May 15, 2014
2709 views

gimlet said...

But what kind of file would need it?

The 20GB file I recently created in SSMS containing a schema and all the data as INSERT statements would seem to be a good candidate. It's in UTF-16LE encoding as well.

Bugmagnet

new topic » goto parent » topic index » view message » categorize

13. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2716 views

DerekParnell said...

jimcbrown said...

DerekParnell said...

nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.

Can you elaborate on this?

Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"

repetition count must not be more than 1073741823

Ah ha, now I get it. I wonder why that limit exists. I guess it is acceptable on 32bit code, but I'm not sure if it makes sense for 64bit.

DerekParnell said...

When I corrected my example code to cater for this, I ran out of memory for large files. sad

When I removed the actual storing of bytes, I could read files of any size (around 25MB/second)

Something like PAE is needed to access more than 4GB from a 32bit process, but I don't really think it's worth the effort to do all of that. It's probably easier to just seek around the file and read in and process pieces of it at a time.

new topic » goto parent » topic index » view message » categorize

14. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 15, 2014
2711 views

DerekParnell said...

jimcbrown said...

DerekParnell said...

nor will files with more than (power(2,31)-1) bytes long, due to a limitation in the seek() function.

Can you elaborate on this?

Sorry. I misinformed you. It is actually a problem with the repeat() function. And the limit is "power(2,30) - 1"

repetition count must not be more than 1073741823

When I corrected my example code to cater for this, I ran out of memory for large files. sad

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

Matt

new topic » goto parent » topic index » view message » categorize

15. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2719 views

mattlewis said...

1073741823 should be enough sequence elements for everyone!

ROFLOLMAO! This really made my day.

new topic » goto parent » topic index » view message » categorize

16. Re: Searching for Fastest Way to Input a File

Posted by mindwalker May 15, 2014
2696 views

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

The Milky Way contains 100–400 billion stars. I guess one could make 100-400 separate sequences assuming this limit is per sequence and not an overall sequence element limit per program.

Yeah, everyone is thinking database not sequence. But to compute the net gravitational effect of the rest of the galaxy on an individual star one needs all the masses and relative coordinates of every star. And if one was trying to simulate the motion of the galaxy they would need to compute the gravitational effect on every star. Imagine the physical i/o required if one relied on a database to read this information in for each star's computation. Then repeat over and over again.

Although I enjoyed playing with the DOS version of Astro from the archive simulating up to 2500 masses at a time, I'm not planning to simulate the Milky Way. Besides there would be other computing power problems to overcome. My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.

new topic » goto parent » topic index » view message » categorize

17. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2686 views

mindwalker said...

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.

I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").

In 64bit, I agree. There's no sane reason to have a limit.

In 32bit, I'd still agree in principle. However the 2GB limit of memory available to a 32bit process (though there are ways around that, but at most you can only push the limit back to 4GB, not counting things like PAE) and the fact that the minimum size of a sequence element is 4 bytes, means that in practice, you can't get much further past this size (if at all) anyways. And trying to use multiple sequences to get around the limit is doomed to fail for the same reason.

In short, no 32bit process can model the entire galaxy (or work on similiar sized datasets) while holding all the information in conventional addressing memory space.

If you really want to process datasets this big, you either 1) want to do it using a different algorithm (one that lets you load slices of the dataset one at a time or something), 2) do it with a 64bit process, or 3) both. In light of these restrictions for 32bit processes, it can be argued that the current limitation makes sense by providing a (slightly) more user friendly error message than a mere OOM.

new topic » goto parent » topic index » view message » categorize

18. Re: Searching for Fastest Way to Input a File

Posted by _tom (admin) May 15, 2014
2668 views

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

jimcbrown said...

I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").

The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.

The funny part was that Bill Gates provided lots of 64k limits in DOS.

Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."

jimcbrown said...

In 64bit, I agree. There's no sane reason to have a limit.

Is this limit going to be raised?

_tom

new topic » goto parent » topic index » view message » categorize

19. Re: Searching for Fastest Way to Input a File

Posted by euphoric (admin) May 15, 2014
2666 views

DerekParnell said...

euphoric said...

I did a search for "fastest way to read a file" and got no hits.

Ummm... more detail please.

Which language are you intending to use?
Is it a text file or binary?
Do you include processing / interpreting the bytes in the 'speed' test or is it just a matter of getting the bytes into RAM?
Is hardware an issue?
Is size of the file an issue?
What media does the file reside on?

My request was more for strengthening our search capabilities than obtaining the code, though obtaining the code was still an objective! :D

And I forgot about the read_lines() and read_file() commands. THOSE are what should have appeared in my search results. I guess we need smarter search, or link a Google search on this site.

Thank you for the code, Derek!

new topic » goto parent » topic index » view message » categorize

20. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2721 views

_tom said...

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

jimcbrown said...

I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").

The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.

The funny part was that Bill Gates provided lots of 64k limits in DOS.

Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."

The reference to computer folklore is accurate. It's not clear if Bill Gates really stated the quote, however.

http://www.computerworld.com/s/article/9101699/The_640K_quote_won_t_go_away_but_did_Gates_really_say_it_

I recall at some point, I read a webpage that said Bill Gates had in the 1980s said to an audience of students that 640K would be enough for 10 years or something along those lines, which got corrupted into the form that folklore recalls today. Not sure if that's true, however.

_tom said...

jimcbrown said...

In 64bit, I agree. There's no sane reason to have a limit.

Is this limit going to be raised?

I made a mistake here. I took a closer look at the code, and this is what it actually does:

        if (count > MAXINT_DBL) 
	                RTFatal("repetition count must not be more than 1073741823");

(from source/be_runtime.c)

Here's how MAXINT_DBL is defined:

#define MAXINT_DBL ((eudouble)MAXINT)

And here's the actual limit for 32bit:

#define MAXINT       (intptr_t) 0x3FFFFFFFL

But it's set to this for 64bit:

#define MAXINT       (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )

(The last three snippnets are from include/euphoria.h)

So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.

new topic » goto parent » topic index » view message » categorize

21. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 15, 2014
2711 views

jimcbrown said...

_tom said...

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

jimcbrown said...

I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").

The Bill Gates reference is accurate. At the time he was telling users of 64k computers (like Commodore 64) that 640k was a huge number.

The funny part was that Bill Gates provided lots of 64k limits in DOS.

Wikipedia: "The size of 64 kilobytes is a traditional limit which was inherited from the maximum size of a COM file."

The reference to computer folklore is accurate. It's not clear if Bill Gates really stated the quote, however.

http://www.computerworld.com/s/article/9101699/The_640K_quote_won_t_go_away_but_did_Gates_really_say_it_

I recall at some point, I read a webpage that said Bill Gates had in the 1980s said to an audience of students that 640K would be enough for 10 years or something along those lines, which got corrupted into the form that folklore recalls today. Not sure if that's true, however.

Found this on wikiquote though:

I have to say that in 1981, making those decisions, I felt like I was providing enough freedom for 10 years. That is, a move from 64k to 640k felt like something that would last a great deal of time. Well, it didn't - it took about only 6 years before people started to see that as a real problem.

1989 speech on the history of the microcomputer industry.

Wikiquote says its from this source: http://www.csclub.uwaterloo.ca/media/1989%20Bill%20Gates%20Talk%20on%20Microsoft.html

edit: found the page I was thinking of: http://blog.softlayer.com/2008/640k-should-be-enough-for-everybody

new topic » goto parent » topic index » view message » categorize

22. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 16, 2014
2633 views

mindwalker said...

Although I enjoyed playing with the DOS version of Astro from the archive simulating up to 2500 masses at a time, I'm not planning to simulate the Milky Way. Besides there would be other computing power problems to overcome. My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.

There are probably better ways to do it. Such a sequence would require a massive amount of RAM just to allocate, let alone what would happen on copy-on-write operations. And hard core numerical computation like that is probably best done in something else anyways.

Matt

new topic » goto parent » topic index » view message » categorize

23. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 16, 2014
2620 views

jimcbrown said...

I made a mistake here. I took a closer look at the code, and this is what it actually does:

        if (count > MAXINT_DBL) 
	                RTFatal("repetition count must not be more than 1073741823");

(from source/be_runtime.c)

Here's how MAXINT_DBL is defined:

#define MAXINT_DBL ((eudouble)MAXINT)

And here's the actual limit for 32bit:

#define MAXINT       (intptr_t) 0x3FFFFFFFL

But it's set to this for 64bit:

#define MAXINT       (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )

(The last three snippnets are from include/euphoria.h)

So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.

But if you tried to make a sequence that big, there would be problems, because the actual structure uses an "int" for the sequence length. Honestly, I don't see any compelling reason to raise the limit. I can't see any sane reason to try to use a sequence that big except to say that you did it. Maybe someday we'll have large and fast enough memory to make it worthwhile.

Matt

new topic » goto parent » topic index » view message » categorize

24. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 16, 2014
2619 views

mattlewis said...

jimcbrown said...

I made a mistake here. I took a closer look at the code, and this is what it actually does:

        if (count > MAXINT_DBL) 
	                RTFatal("repetition count must not be more than 1073741823");

(from source/be_runtime.c)

Here's how MAXINT_DBL is defined:

#define MAXINT_DBL ((eudouble)MAXINT)

And here's the actual limit for 32bit:

#define MAXINT       (intptr_t) 0x3FFFFFFFL

But it's set to this for 64bit:

#define MAXINT       (intptr_t) INT64_C( 0x3FFFFFFFFFFFFFFF )

(The last three snippnets are from include/euphoria.h)

So, if the true limit is ever reached, the error message udner 64bits will be wrong. But the limit for 64bit has already been raised.

But if you tried to make a sequence that big, there would be problems, because the actual structure uses an "int" for the sequence length.

Either way, the check is broken - either the sequence length should be promoted to long, or the check should check for the old (32bit) MAXINT value when determining maximum length. The point is, the check for the maximum sequence length and the actual maximum sequence length should match.

If they don't, then it's a serious code freeze overriding bug.

mattlewis said...

Honestly, I don't see any compelling reason to raise the limit.

Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?

new topic » goto parent » topic index » view message » categorize

25. Re: Searching for Fastest Way to Input a File

Posted by mindwalker May 16, 2014
2674 views

jimcbrown said...

mindwalker said...

mattlewis said...

It's actually a limit of sequences. After all, 1073741823 should be enough sequence elements for everyone!

My point is there are imaginable applications where this sequence size limitation could make the task even more difficult.

I'm pretty sure matt's comment is a joke (in-line with the supposed (and likely untrue) Bill Gates reference that "640K of RAM should be enough for everyone").

Thanks I didn't catch the reference, so thought I was looking at another of many short sighted assumptions littering the history of computers.

My favorite one is the IBM executive who decided that IBM wouldn't market the in-house developed mini computer (precurser to today's PCs) because in his opinion there was probably only a market for 5-10 world wide.

When I started my programing career I worked for a manufacturing company that produced automated mills and lathes. These were built with an attached computer that controled the machine's operations. When I started, these computers had 32KB total for both the operating system and the loaded (from Mylar tape) parts program; there was no floppy or hard drive since dust and such in the work environment was very destructive.

It was a constant struggle to minimize the operating system to preserve space for the parts program. When they moved to computers with 64KB of memory, it was thought we would never need all that space. A few predicted it would be the last time the computers would need to be upgraded. But it wasn't long until someone observed that if the operating manual was in the computer it would be a great improvement. Another suggested additional industry manuals could be added. In less than 2 years we were back to struggling to keep the operating system size small.

When I left the company after working there for 6 years, they were struggling to contain everything in a 128KB space. Turns out when the customers saw the manual stored on the computer many had asked if it was possible to keep their part programs on the machine so they wouldn't need to reload them each time and wouldn't have to worry about misplacing the mylar tapes.

new topic » goto parent » topic index » view message » categorize

26. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 17, 2014
2612 views

mindwalker said...

Thanks I didn't catch the reference, so thought I was looking at another of many short sighted assumptions littering the history of computers.

Perhaps you thought correctly. In any case, I've now opened a ticket for this: ticket:895

new topic » goto parent » topic index » view message » categorize

27. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 19, 2014
2543 views

jimcbrown said...

mattlewis said...

Honestly, I don't see any compelling reason to raise the limit.

Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?

Like I said, no compelling reason. smile

Matt

new topic » goto parent » topic index » view message » categorize

28. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 19, 2014
2509 views

mattlewis said...

jimcbrown said...

mattlewis said...

Honestly, I don't see any compelling reason to raise the limit.

Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?

Like I said, no compelling reason. smile

Matt

I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!

Edit: On the flip side, is there any compelling reason to avoid adding this?

If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.

If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.

new topic » goto parent » topic index » view message » categorize

29. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 19, 2014
2500 views

jimcbrown said...

mattlewis said...

jimcbrown said...

mattlewis said...

Honestly, I don't see any compelling reason to raise the limit.

Why should read_file() be limited to being able to read no more than 1GB of a file on a 64bit machine with 32GB of ram?

Like I said, no compelling reason. smile

Matt

I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!

What is it that you think you are going to accomplish by loading that much information into RAM? Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.

jimcbrown said...

Edit: On the flip side, is there any compelling reason to avoid adding this?

It's not a useful thing to do.

jimcbrown said...

If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.

If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.

I'm not even a little bit worried about this, but I suppose another 8 bytes for sequence storage won't kill us.

Matt

new topic » goto parent » topic index » view message » categorize

30. Re: Searching for Fastest Way to Input a File

Posted by ghaberek (admin) May 19, 2014
2501 views

mattlewis said...

What is it that you think you are going to accomplish by loading that much information into RAM? Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.

I agree with Matt; loading an entire large file into a sequence is a Very Bad Idea to begin with. If this is the reason for extending the limit of a single sequence, then I would be against it. If, however, we perceive this built-in limit to be a correctable limitation for overall 64-bit compatibility, then it should be corrected before an official 64-bit release. If one must load an entire large file into memory, he would be better served with allocate()/free() and peek()/poke() instead.

-Greg

new topic » goto parent » topic index » view message » categorize

31. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 19, 2014
2506 views

mattlewis said...

jimcbrown said...

mattlewis said...

Like I said, no compelling reason. smile

Matt

I think that's pretty compelling. Otherwise, why even have large file support in Euphoria at all? 2GB files should be more than enough for everyone!

What is it that you think you are going to accomplish by loading that much information into RAM?

Maybe the hypothetical application needs to do it for speed. Perhaps the file is located on a network share, and local disk quotas prevent copying it locally (to prevent the disk from filling up with gigabytes and gigabytes of temp files copied over from network shares).

Ok, there are still better ways to do this like using mmap, but what it comes down to is programmer choice.

ghaberek said...

mattlewis said...

Reading and writing large files does not mean that you have to put the whole thing into a flat sequence.

I agree with Matt; loading an entire large file into a sequence is a Very Bad Idea to begin with. If this is the reason for extending the limit of a single sequence, then I would be against it.

If one must load an entire large file into memory, he would be better served with allocate()/free() and peek()/poke() instead.

Agreed. There are much better ways to manipulate large amounts of data. However, I think it's really backwards to have a library routine impose an arbituary limit like this for no good reason.

mattlewis said...

jimcbrown said...

Edit: On the flip side, is there any compelling reason to avoid adding this?

It's not a useful thing to do.

Regarding usefuless of reading 1GB text files: Entire books could fit in a 64MB text file. Should we say the limit is 64MB, because no one would write a work of literature that could take up more space than that?

Regarding usefulness of extending the maximum length of a sequence generally: Also, how do you know what is not useful now won't be useful in the future? Computers originally just processed text. Later on, they could handle music and 2-D video. Then we have 3D video and full length movies. Each required a corresponding increase in available RAM and disk space.

I can see future datasets, combining text (like subtitles), sound+music, 360 degree spherical video, sensory data for touch simulators, and even encodings to duplicate taste and smell. (Perhaps a virtual world simulator would use these.) The files for this would probably be huge.

Processing 1TB of data would not have been useful back in the days of ENIAC. But today, astronomists do this all the time.

mattlewis said...

jimcbrown said...

If we do ever decide to add it, now is a good time to do so. Bumping up the limit involves deep changes in Euphoria's internal structures. What this means is that if we decide to do this 10 years down the line, any 64bit Euphoria compiled dll that was built before will have a different and incompatible structure. So you may not be able to use a dll compiled with Euphoria 5.35.7 in an Euphoria program running under an Euphoria 5.35.8 interpreter. Or maybe we can, but we'll need to add version detection and a shim layer to do conversion for us, which would be a lot of work.

If we make this change now, before the first official 64bit release, then we have no 64bit Euphoria dll incompatibilities to worry about. Waiting, on the other hand, will make this even more painful to deal with.

I'm not even a little bit worried about this, but I suppose another 8 bytes for sequence storage won't kill us.

You aren't event a little bit worried about backwards compatiblity between different versions of the E_OBJECT/E_SEQUENCE/E_ATOM interface?

ghaberek said...

If, however, we perceive this built-in limit to be a correctable limitation for overall 64-bit compatibility, then it should be corrected before an official 64-bit release.

We were imposing a limit on 64bit that was unnecessary and made no sense. Additionally, if it ever turned out that it would be useful to increase the limit (on a system that supported using that much RAM/memory/etc, of course), we could not do so without encountering compatibility problems. If we change it now, we can avoid those compatibility issues down the line.

new topic » goto parent » topic index » view message » categorize

32. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 19, 2014
2449 views

jimcbrown said...

You aren't event a little bit worried about backwards compatiblity between different versions of the E_OBJECT/E_SEQUENCE/E_ATOM interface?

Ideally, no, because I wouldn't expand the max size of a sequence. tongue

Working with a data set of a certain size does not imply loading it in its entirety into something like a sequence. If you think you need to do this, you need to realize you're doing it wrong and that there is a better way.

I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.

Matt

new topic » goto parent » topic index » view message » categorize

33. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 19, 2014
2437 views

mattlewis said...

Working with a data set of a certain size does not imply loading it in its entirety into something like a sequence. If you think you need to do this, you need to realize you're doing it wrong and that there is a better way.

Well, arrays are "something like a sequence", and memory-mapping files are sort of like arrays. And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.

The whole point is about flexibility and future-proofing. You never know what'll turn out to be useful later on. And you never know what'll still be around then either. (Just think of those stories from folklore about COBOL programmers who figured their programs would be long gone by the year 2000, only for thousands of dollars to be spent in order to make them Y2K-bug free.)

mattlewis said...

I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.

On the flip side, predicting the future accurately is pretty tough too. Just think of those in the past who thought we'd have Moon colonies and rocket cars by the 1990s.

So, that seems more than fair. blink

new topic » goto parent » topic index » view message » categorize

34. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 19, 2014
2421 views

mattlewis said...

I'm not going to stop you, but I reserve the right to mock people who come back complaining about how terrible things get when you use ginormous sequences.

It's kinda like the "goto" situation. Basically, do we let people do stupid/unwise/courageous programming or do we make Euphoria baby-sit them?

My concern with this proposed change is how it would affect the performance of using small (normal sized) sequences? Anything that adversely affects that, would be detrimental to Euphoria's acceptance.

new topic » goto parent » topic index » view message » categorize

35. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 19, 2014
2437 views

jimcbrown said...

... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.

Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.

new topic » goto parent » topic index » view message » categorize

36. Re: Searching for Fastest Way to Input a File

Posted by jimcbrown (admin) May 19, 2014
2453 views

DerekParnell said...

jimcbrown said...

... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.

Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.

Agreed. Still, I would have thought it'd be possible to, for example, map a 5GB or 6GB file into memory under a 64bit cpu. Certainly, servers are being sold today that could support such an operation: http://www.amazon.com/Kingston-Technology-667MHz-KTH-XW667-64G/dp/B001VMJ6MO

On the other hand, the only thing that I can think of that takes up that much memory today in a single process is MS SQL Server. (I've personally seen this go up as high as 20GB.)

DerekParnell said...

My concern with this proposed change is how it would affect the performance of using small (normal sized) sequences? Anything that adversely affects that, would be detrimental to Euphoria's acceptance.

Agreed. I haven't benchmarked it, but I wouldn't expect it to slow things down at all.

new topic » goto parent » topic index » view message » categorize

37. Re: Searching for Fastest Way to Input a File

Posted by mattlewis (admin) May 19, 2014
2432 views

DerekParnell said...

jimcbrown said...

... And mmap'ing a large file to operate on it is widely considered an efficent way to deal with files.

Note that even memory-mapping a file is limited to the system's RAM and virtual storage constraints.

I don't think it's limited to the RAM, is it? On a 64-bit OS, the virtual storage (memory space) is really really big. x86-64 "only" supports 256TB of a possible 16EB virtual memory space.

Matt

new topic » goto parent » topic index » view message » categorize

38. Re: Searching for Fastest Way to Input a File

Posted by DerekParnell (admin) May 19, 2014
2427 views

mattlewis said...

I don't think it's limited to the RAM, is it? On a 64-bit OS, the virtual storage (memory space) is really really big. x86-64 "only" supports 256TB of a possible 16EB virtual memory space.

Technically still limited ... but the main issue with this is the ratio of real RAM to Virtual RAM. This ratio will have an effect on performance, especially when doing random access on the file data.

new topic » goto parent » topic index » view message » categorize

39. Re: Searching for Fastest Way to Input a File

Posted by petelomax May 20, 2014
2380 views

Erm, actually I already have a perfectly reasonable use of quite ridiculously long sequences:

I sometimes watch movies online, but actually more often than not I copy them to my tablet or a thumbdrive that I can plug into the tv in the lounge. I clearly remember one being over 4GB.

Sometimes the connection craps out with like 15 mins left, and it is no big deal to drag the slider to just before the cutoff point and watch/capture the last bit. So I have a 40-line ditty to bolt the two chunks together. It copies the first chunk/file byte-by-byte, but it reads the whole of the second file into memory so that it can match() the last 100 bytes of the first file with the second, and carry on from there.

The point is, it works fine, it is dirt simple, and it is /very/ fast. It is quite clearly entirely i/o bound, and all the better precisely because it uses silly-length-sequences.

Pete

new topic » goto parent » topic index » view message » categorize

40. Re: Searching for Fastest Way to Input a File

Posted by gimlet Jun 08, 2014
1691 views

x

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. Searching for Fastest Way to Input a File

2. Re: Searching for Fastest Way to Input a File

3. Re: Searching for Fastest Way to Input a File

4. Re: Searching for Fastest Way to Input a File

5. Re: Searching for Fastest Way to Input a File

6. Re: Searching for Fastest Way to Input a File

7. Re: Searching for Fastest Way to Input a File

8. Re: Searching for Fastest Way to Input a File

9. Re: Searching for Fastest Way to Input a File

10. Re: Searching for Fastest Way to Input a File

11. Re: Searching for Fastest Way to Input a File

12. Re: Searching for Fastest Way to Input a File

13. Re: Searching for Fastest Way to Input a File

14. Re: Searching for Fastest Way to Input a File

15. Re: Searching for Fastest Way to Input a File

16. Re: Searching for Fastest Way to Input a File

17. Re: Searching for Fastest Way to Input a File

18. Re: Searching for Fastest Way to Input a File

19. Re: Searching for Fastest Way to Input a File

20. Re: Searching for Fastest Way to Input a File

21. Re: Searching for Fastest Way to Input a File

22. Re: Searching for Fastest Way to Input a File

23. Re: Searching for Fastest Way to Input a File

24. Re: Searching for Fastest Way to Input a File

25. Re: Searching for Fastest Way to Input a File

26. Re: Searching for Fastest Way to Input a File

27. Re: Searching for Fastest Way to Input a File

28. Re: Searching for Fastest Way to Input a File

29. Re: Searching for Fastest Way to Input a File

30. Re: Searching for Fastest Way to Input a File

31. Re: Searching for Fastest Way to Input a File

32. Re: Searching for Fastest Way to Input a File

33. Re: Searching for Fastest Way to Input a File

34. Re: Searching for Fastest Way to Input a File

35. Re: Searching for Fastest Way to Input a File

36. Re: Searching for Fastest Way to Input a File

37. Re: Searching for Fastest Way to Input a File

38. Re: Searching for Fastest Way to Input a File

39. Re: Searching for Fastest Way to Input a File

40. Re: Searching for Fastest Way to Input a File

Search

Include:

Quick Links

User menu

Misc Menu