1. euphoria text processing
- Posted by seany Jun 02, 2013
- 1975 views
Hello all.
I am considering stepping into euphoria .. i have a 64 bit machine, so instead of linking the 32 bit stuff, i iwll go with the bleeding edge version.
but before i do so, would anyone like to give me a taste of text processing (e.g. processing raw LaTex text, or mathematical expressions, written with unicode) in euphoria?
thanks a bunch. s
2. Re: euphoria text processing
- Posted by DerekParnell (admin) Jun 02, 2013
- 2036 views
Hello all.
I am considering stepping into euphoria .. i have a 64 bit machine, so instead of linking the 32 bit stuff, i iwll go with the bleeding edge version.
but before i do so, would anyone like to give me a taste of text processing (e.g. processing raw LaTex text, or mathematical expressions, written with unicode) in euphoria?
thanks a bunch. s
What do mean by "text processing"? That term can mean a whole lot of things to different people.
In Euphoria, all characters are stored as UTF-32 integers, however the standard library routines to read and write text from files still only work in 8-bit characters. You would need to get hold of routines to read/write Unicode characters from files.
However, once you have the text data in a sequence, you can easily "process" it. The big exceptions to this at the moment are that certain standard library routines don't recognise language specific characteristics when it comes to collating, upper-lower case conversion, right-to-left text direction, etc ... What is needed is an API to be developed for Euphoria to use established libraries for Unicode processing, such as IBM's ICU
3. Re: euphoria text processing
- Posted by petelomax Jun 03, 2013
- 1858 views
written with unicode
I assume you mean files saved in various encodings, with the appropriate identifing BOM (Byte Order Mark)...
As a first step I think we would need something that can read the following files. What I mean by the following is five test files, each containing "Hello", ranging from 5 to 12 bytes. The () just indicate the BOM, nothing else.
Hello.txt: 48 65 6C 6C 6F Hello.Unicode.little.endian.txt: (FF FE) 48 00 65 00 6C 00 6C 00 6F 00 Hello.Unicode.big.endian.txt: (FE FF) 00 48 00 65 00 6C 00 6C 00 6F Hello.UTF8.txt: (EF BB BF) 48 65 6C 6C 6F Hello.UTF7.txt: (2B 2F 76 38 2D) 48 65 6C 6C 6F
I might be able to cobble something up for Windows, but would have no idea for Linux.
Or is there already something that can do this?
Pete
4. Re: euphoria text processing
- Posted by DerekParnell (admin) Jun 03, 2013
- 1822 views
Or is there already something that can do this?
I already have these UTF file read/write functions but they didn't make it into Eu4. I'll dig them out of whatever closet they went to hide in and put them up for inspection. Stay tuned ...
But even if you can read/write UTF files, that is not the same as "text processing". Even searching for subtext within a Unicode string is very complicated due to language characteristics.
Oh, and by the way "unicode" is not the same as UTF16 (double-byte characters).
5. Re: euphoria text processing
- Posted by EUWX Jun 03, 2013
- 1854 views
Hello all.
I am considering stepping into euphoria .. i have a 64 bit machine, so instead of linking the 32 bit stuff, i iwll go with the bleeding edge version.
but before i do so, would anyone like to give me a taste of text processing (e.g. processing raw LaTex text, or mathematical expressions, written with unicode) in euphoria?
thanks a bunch. s
The simple fact is that Euphoria has a 4 byte character as DEFAULT. Just to bring you current, Microsoft has a 2byte character as default. so, in theory euphoria is fully ready for a total Unicode syntax based on 16 bit character as used by Microsoft. It is known as 16 bit LE (Little Endian). Microsoft chose is that way starting with Windows XP. In C or C plus plus, there is a methodology based on W prefix signifying wide characters or 16 bit characters. However, I don't think anybody as made an effort to allow this way of compilation.
You can quite easily write your own functions using Peek/Poke facilities of Euphoria to create the full 16 bit LE text, and write little functions to select a character or number of characters, rotate, etc, using you own small functions. Alternatively you can modify the actual Euphoria low level functions in C language (or C plus). Whatever you do, you will have to create two levels of text functions. If you happen to want the extended Chinese then you can use the 4byte per character setup similar to what I suggested above.
The newly created functions should preferably be Euphoria functions in a .e file.
The other popular system which the Linux enthusiasts use and the Web has almost adopted is UTF-8. While it is a neat system for the web, the creation (by you) of the new functions similar to what I suggested above would be more difficult. The reason for this is that UTF-8 for a majority of international characters creates anything from one to 5 bytes per character, and selection and extraction of text is a bit more complicat4ed. On the other hand, you can access a vast number of pre-defined character manipulation libraries for UTF-8 and interface to them if you know how to connect with C language.
If mathematical expressions are your main target usage, then consider Freemat.
edited by jimcbrown: removed off-topic content.
6. Re: euphoria text processing
- Posted by DerekParnell (admin) Jun 03, 2013
- 1823 views
The simple fact is that Euphoria has a 4 byte character as DEFAULT.
This is nitpicking, I know, but by using the word "default" here, it might imply that there are alternatives. There are none. Euphoria only stores characters as integers, and an integer can hold any of the valid Unicode values (code points) (0 to #10FFFF) and each takes up 4 bytes in RAM.
Just to bring you current, Microsoft has a 2byte character as default. so, in theory euphoria is fully ready for a total Unicode syntax based on 16 bit character as used by Microsoft. It is known as 16 bit LE (Little Endian). Microsoft chose is that way starting with Windows XP.
Firstly, just to make very clear, the MICROSOFT default is for UTF16LE encoding for strings in RAM and files, but this is not the WINDOWS default. Windows has no default, as it fully supports both ANSI encoding and UTF16 encoding; it's up to the application to choose which to use. In Microsoft's case, their applications (eg. Word, Excel, Powerpoint, cmd.exe, etc...) choose to use UTF16 encoding of text strings. The 'little endian' choice was influenced by the native way that Intel chips store 16-bit integers in RAM. Had Microsoft designed their PC's around non-Intel chips (eg. Motorola, as Apple did) then they probably would have used 'big endian' encoding.
In C or C plus plus, there is a methodology based on W prefix signifying wide characters or 16 bit characters. However, I don't think anybody as made an effort to allow this way of compilation.
Unless I'm misunderstanding you, the "W" suffix is actually a Windows convention in naming their API functions and has nothing to do with C or C++. In the C/C++ language, there is a way of specifying 'wide' characters, and that is to use the 'L' prefix. eg... wchar_t wszStr[] = L"1a1g";
You can quite easily write your own functions using Peek/Poke facilities of Euphoria to create the full 16 bit LE text, and write little functions to select a character or number of characters, rotate, etc, using you own small functions. Alternatively you can modify the actual Euphoria low level functions in C language (or C plus). Whatever you do, you will have to create two levels of text functions. If you happen to want the extended Chinese then you can use the 4byte per character setup similar to what I suggested above.
The need for peek() and poke() would only be required for converting text created by non-Euphoria applications into Euphoria sequences. Once text was in a sequence, the standard Euphoria functions could be used. Except that when dealing with Language Text, you need to be mindful of the many weird and wonderful ways that we humans have developed spoken and written text.
For example, in German a double 's' is regarded as a single character for sorting and comparison, but can be written as either two 's' characters or the special character 'ß'. Then there is the double 'L' in Spanish, which is sometimes regarded as a single character and sometimes not, depending on the context. Consider the Thai language, where some vowels are written before the consonant though sorted as if written after it. And some vowels are written below, some above, some on either side, and some in around their consonants. Arabic and Hebrew (and some others) are written Right-To-Left.
In general, once you leave the simply world of ANSI/ASCII, text processing has complications!
The other popular system which the Linux enthusiasts use and the Web has almost adopted is UTF-8. While it is a neat system for the web, the creation (by you) of the new functions similar to what I suggested above would be more difficult. The reason for this is that UTF-8 for a majority of international characters creates anything from one to 5 bytes per character, and selection and extraction of text is a bit more complicat4ed. On the other hand, you can access a vast number of pre-defined character manipulation libraries for UTF-8 and interface to them if you know how to connect with C language.
Both UTF8 and UF16 use variable length characters. If you need speed in text processing, it might be better to convert text to UTF32 first.
Forked into: variable length characters
7. Re: euphoria text processing
- Posted by mattlewis (admin) Jun 04, 2013
- 1744 views
The simple fact is that Euphoria has a 4 byte character as DEFAULT.
This is nitpicking, I know, but by using the word "default" here, it might imply that there are alternatives. There are none. Euphoria only stores characters as integers, and an integer can hold any of the valid Unicode values (code points) (0 to #10FFFF) and each takes up 4 bytes in RAM.
Or 8 bytes if you're using a 64-bit euphoria!
In C or C plus plus, there is a methodology based on W prefix signifying wide characters or 16 bit characters. However, I don't think anybody as made an effort to allow this way of compilation.
Unless I'm misunderstanding you, the "W" suffix is actually a Windows convention in naming their API functions and has nothing to do with C or C++. In the C/C++ language, there is a way of specifying 'wide' characters, and that is to use the 'L' prefix. eg... wchar_t wszStr[] = L"1a1g";
And even then, the size of a wchar_t is platform dependent. On Windows, it's 16 bits, but on Unix like systems, it's usually 32 bits.
You can quite easily write your own functions using Peek/Poke facilities of Euphoria to create the full 16 bit LE text, and write little functions to select a character or number of characters, rotate, etc, using you own small functions. Alternatively you can modify the actual Euphoria low level functions in C language (or C plus). Whatever you do, you will have to create two levels of text functions. If you happen to want the extended Chinese then you can use the 4byte per character setup similar to what I suggested above.
The need for peek() and poke() would only be required for converting text created by non-Euphoria applications into Euphoria sequences.
Indeed, euphoria already has peek2/poke2 routines (and of course, has had peek4/poke4 forever). The memstruct functionality would allow another way to read / write wide chars to memory in a portable way.
Matt
8. Re: euphoria text processing
- Posted by useless_ Jun 04, 2013
- 1757 views
The simple fact is that Euphoria has a 4 byte character as DEFAULT.
This is nitpicking, I know, but by using the word "default" here, it might imply that there are alternatives. There are none. Euphoria only stores characters as integers, and an integer can hold any of the valid Unicode values (code points) (0 to #10FFFF) and each takes up 4 bytes in RAM.
This is nitpicking, I know, but Euphoria does not only store characters as integers, it also manipulates them as integers. But i agree Euphoria stores characters as only integers.
useless
9. Re: euphoria text processing
- Posted by EUWX Jun 05, 2013
- 1723 views
Matt: I understand the enormous power of representing characters in Euphoria. I have tried to make it clear, and perhaps I was misunderstood.
My argument is that with that tremendous power there, and the ability to use arrays and nested arrays, Euphoria has not created a separate W character (16 bit basic unicode), a separate W2 character (32 bit extended Unicode), and now, as you imply, a 64 bit WC character (to mean "wide colour") where each character has individually a 32 bit extended character and has the RBG, bold, italics, etc coded (embedded) in the other 32 bits. These character sequences would be easily manipulatable, in termes of extracting a substring, inserting, indexing etc.
I was only voicing a weakness that is easy to address and correct.
Happy computing, and happy summer to all living in the God forsaken cold north.
10. Re: euphoria text processing
- Posted by EUWX Jun 10, 2013
- 1563 views
I had post here and it showed up in the index for 2-3 day as last post but in fact, it had been removed.
In that post, I reminded everybody that contrary to a statement that there are no good file sharing site, that in fact there are many and increasing in numbers.
That post was removed.
Why would anybody want to withhold important and straight forward information from the readers?