OpenEuphoria: Forum: Euphoria and Unicode

1. Euphoria and Unicode

Posted by HappyGene Oct 24, 2008
947 views

Hi,

Does anyone have any experience that the Euphoria system will process and maintain Unicode without error? I will need to preserve Latin diacriticals on English cpu's.

Thanks, Gene

new topic » topic index » view message » categorize

2. Re: Euphoria and Unicode

Posted by DerekParnell (admin) Oct 24, 2008
924 views

HappyGene said...

Does anyone have any experience that the Euphoria system will process and maintain Unicode without error? I will need to preserve Latin diacriticals on English cpu's.

Define "process".

Define "maintain".

And by "Unicode" do you mean the UTF encoding (eg. UTF-8, UTF-16, UTF-32)?

Euphoria does no interpretation of the data in sequences. That means when doing comparisions it strictly uses the numeric value of the sequence items. Which in turns means that, in the main, text is treated as ASCII encoding. Case conversion is based only on ASCII encoding.

Version 4 supports code-pages but not Unicode yet.

So if your application stores unicode codepoints in a sequence, you are responsible for text comparisions, case conversions, and output rendering.

new topic » goto parent » topic index » view message » categorize

3. Re: Euphoria and Unicode

Posted by mattlewis (admin) Oct 24, 2008
947 views

HappyGene said...

Does anyone have any experience that the Euphoria system will process and maintain Unicode without error? I will need to preserve Latin diacriticals on English cpu's.

wxEuphoria works with Unicode. Basically, each euphoria 'character' is an element in a sequence, same as ASCII. Under the hood, it basically treats them as UTF-16, I believe.

Matt

new topic » goto parent » topic index » view message » categorize

4. Re: Euphoria and Unicode

Posted by DerekParnell (admin) Oct 24, 2008
938 views

mattlewis said...

HappyGene said...

Does anyone have any experience that the Euphoria system will process and maintain Unicode without error? I will need to preserve Latin diacriticals on English cpu's.

wxEuphoria works with Unicode. Basically, each euphoria 'character' is an element in a sequence, same as ASCII. Under the hood, it basically treats them as UTF-16, I believe.

By "it basically treats them as UTF-16" do you mean Euphoria or wxEuphoria? I'm very certain that Euphoria does not treat them as utf-16. In which case, how does conversion from Euphoria text to UTF-16 occur?

new topic » goto parent » topic index » view message » categorize

5. Re: Euphoria and Unicode

Posted by HappyGene Oct 24, 2008
956 views

Wayyyl...

By "process and maintain w/o error" and preservation I mean:

- Assigning and transporting wide string char data retrieved from other in- and out-of-process data objects such as DDE if available, ODBC data sets and Euphoria keyboard input; - manipulating those strings with standard functions like Trim/Mid/Replace/[=, <>, like]; - and, if not by reference, passing/retrieving full width strings as parameters...

...all without stripping any foreign language info from the tuple when placing/returning it to whatever data store I choose.

Does that clarify? I'm sure there are many other ways of getting this across and I'll be glad to re-phrase for anyone to understand. I'm good with words.

Another way is, "If I build/use an international app with Euhporia, will terrorists put a contract out on me because I didn't use the right 'n'?"

Thanks,

) Gene

new topic » goto parent » topic index » view message » categorize

6. Re: Euphoria and Unicode

Posted by mattlewis (admin) Oct 24, 2008
901 views

DerekParnell said...

By "it basically treats them as UTF-16" do you mean Euphoria or wxEuphoria? I'm very certain that Euphoria does not treat them as utf-16. In which case, how does conversion from Euphoria text to UTF-16 occur?

wxEuphoria does. And it puts each 'wide' character in a single euphoria integer as an element of a sequence. Here is the C++ code that converts a sequence to a wxString. It actually works for both the unicode and the ANSI version of wxWidgets:

wxString get_string( object seq ) 
{ 
	wxChar * str; 
	 
	if( !IS_SEQUENCE(seq) )  
		RTFatal("expected a sequence"); 
	s1_ptr s1 = SEQ_PTR(seq); 
	int len = s1->length; 
	str = new wxChar[len+1]; 
	for( int i = 1; i <= s1->length; i++ ){ 
		object x = s1->base[i]; 
		if( IS_SEQUENCE( x ) ) return wxEmptyString; 
		if( !IS_ATOM_INT( x ) ) 
			x = (long) DBL_PTR(x)->dbl; 
		str[i-1] = (wxChar) x; 
		 
	} 
	str[len] = 0; 
	wxString s( str ); 
	delete[] str; 
	return s; 
}

Matt

new topic » goto parent » topic index » view message » categorize

7. Re: Euphoria and Unicode

Posted by DerekParnell (admin) Oct 24, 2008
953 views

HappyGene said...

Wayyyl...

I'm pedantic; sorry but you'll just have to adjust any expectations you have of me blink

HappyGene said...

By "process and maintain w/o error" and preservation I mean:

- Assigning and transporting wide string char data retrieved from other in- and out-of-process data objects such as DDE if available, ODBC data sets and Euphoria keyboard input;

By "WIDE" I assume you mean UTF-16 encoding.

Assigning is just copying numeric values around. That's ok.
Transporting(?) I guess means moving text sequences to/from external storage. Not so straight forward. See below...

HappyGene said...

- manipulating those strings with standard functions like Trim/Mid/Replace/[=, <>, like];

Manipulating is fine, except when its based on the values within the string. So trim() is a problem because it only trims white-space and so far it only knows about ASCII whitespace. UTF-16 whitespace is a superset of ASCII. There are rare times when subscripting UTF-16 will fail but that is mainly when dealing with some Chinese ideographs, because these might take 32-bits rather than 16-bits to encode.

Comparisions between UTF-16 strings is not easy. Equality tests are okay, but anything based on collating order is a problem. Euphoria only does ASCII. IT can't tell if A-Grave is lower or higher than A-Acute, for example - and this is usually language based anyway aside from UTF-16 encoding. That is, different languages collate the same characters in different orders.

HappyGene said...

- and, if not by reference, passing/retrieving full width strings as parameters...

This is not so easy. There is no built-in way to convert a Euphoria text sequence (which is stored as an array of 30-bit values) to a RAM array of 16-bit values. The function to do this isn't difficult and someone can show you how.

HappyGene said...

...all without stripping any foreign language info from the tuple when placing/returning it to whatever data store I choose.

"tuple"? Do you mean sequence? What and how is "foreign language info" stored in the sequence? Euphoria reads and writes bytes. Each byte read in occupies a sequence element. If the file is stored in UTF-16 format (with or without BOM), you will have to have a special read/write routine to convert bytes read in to UTF-16 values. Likewise, to write UTF-16 values you will need to have a special routine to convert them to a byte stream.

HappyGene said...

Does that clarify? I'm sure there are many other ways of getting this across and I'll be glad to re-phrase for anyone to understand. I'm good with words.

Another way is, "If I build/use an international app with Euhporia, will terrorists put a contract out on me because I didn't use the right 'n'?"

Yes, I believe they will.

new topic » goto parent » topic index » view message » categorize

8. Re: Euphoria and Unicode

Posted by HappyGene Oct 24, 2008
916 views

Ok, I'm back in the saddle (I commute.)

Matt & Derek,

I skimmed some material from MS in the past and I believe the compliance I enjoy now is from patching XP, Office, JET and SQL Server 2K to a minimum of UTF-8 or an ISO-2XXX series. I haven't prepared a test bed, yet, to verify that. But the effect is observable and suffices for Latin and Germanic.

My hope is to move partially away from our MS entrenchment. I'm reviewing more than 11 packages and 3 platforms and can't prove concepts on all of them. So, I'm hoping some users from each camp might have already worked with wide-char and have either design or production knowledge.

Many years ago, I was using Balena's routines to strip and re-apply the encoding for api's that didn't have "...W" equivalents. I'd like to find a non-MS IDE that would process (assign and manipulate) strings natively as wide or at least strip and re-apply without my intervention.

Some authors have stated for the cursory view what is their level of compliance and I'm trying to pin down the rest.

I appreciate all of your comments, <smile - I don't know how on this forum> Gene

new topic » goto parent » topic index » view message » categorize

9. Re: Euphoria and Unicode

Posted by HappyGene Oct 24, 2008
958 views

Matt, I understand a bit more, now.

No problem, Derek, I stand adjusted!

Yes, I was stripping euphemistically via masking, hence the implied granularity with "tuple."

I thought during the first part of your second response of having a single entry point for all strings and converting them. But now I see that I would probably have to store a secondary (and keyed) sequence of the extension mask for each element and re-apply it on the way out.

That actually sounds like a fun & delicious project. And I could still pass all assignment through two master functions if I didn't get too much going on at once.

So, then, external libs (wxEuphoria & bretheren) are safe; but I'd need pre- and post-processing for each string's path through the main program.

In y'all's Euphoric experience, would this be worthwhile to a busy guy?

Thanks again, Gene.

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. Euphoria and Unicode

2. Re: Euphoria and Unicode

3. Re: Euphoria and Unicode

4. Re: Euphoria and Unicode

5. Re: Euphoria and Unicode

6. Re: Euphoria and Unicode

7. Re: Euphoria and Unicode

8. Re: Euphoria and Unicode

9. Re: Euphoria and Unicode

Search

Include:

Quick Links

User menu

Misc Menu