Re: euphoria text processing
- Posted by EUWX Jun 03, 2013
- 1855 views
Hello all.
I am considering stepping into euphoria .. i have a 64 bit machine, so instead of linking the 32 bit stuff, i iwll go with the bleeding edge version.
but before i do so, would anyone like to give me a taste of text processing (e.g. processing raw LaTex text, or mathematical expressions, written with unicode) in euphoria?
thanks a bunch. s
The simple fact is that Euphoria has a 4 byte character as DEFAULT. Just to bring you current, Microsoft has a 2byte character as default. so, in theory euphoria is fully ready for a total Unicode syntax based on 16 bit character as used by Microsoft. It is known as 16 bit LE (Little Endian). Microsoft chose is that way starting with Windows XP. In C or C plus plus, there is a methodology based on W prefix signifying wide characters or 16 bit characters. However, I don't think anybody as made an effort to allow this way of compilation.
You can quite easily write your own functions using Peek/Poke facilities of Euphoria to create the full 16 bit LE text, and write little functions to select a character or number of characters, rotate, etc, using you own small functions. Alternatively you can modify the actual Euphoria low level functions in C language (or C plus). Whatever you do, you will have to create two levels of text functions. If you happen to want the extended Chinese then you can use the 4byte per character setup similar to what I suggested above.
The newly created functions should preferably be Euphoria functions in a .e file.
The other popular system which the Linux enthusiasts use and the Web has almost adopted is UTF-8. While it is a neat system for the web, the creation (by you) of the new functions similar to what I suggested above would be more difficult. The reason for this is that UTF-8 for a majority of international characters creates anything from one to 5 bytes per character, and selection and extraction of text is a bit more complicat4ed. On the other hand, you can access a vast number of pre-defined character manipulation libraries for UTF-8 and interface to them if you know how to connect with C language.
If mathematical expressions are your main target usage, then consider Freemat.
edited by jimcbrown: removed off-topic content.