1. UTF-8 in Windows
- Posted by rodoval Sep 13, 2008
- 1259 views
Hello!
Can the Windows version of Euphoria output text in UTF-8 format?
I have finished a program in Linux and now I am testing it in Windows XP. In Linux, it seems that the format of the code file determines the format of the output; I have checked this in UTF-8 and ISO-8859-15 redirecting the output for a file from a terminal. In Windows this works only for DOS and ANSI (Windows-1252) modes. I am using the 3.1.1 version of Euphoria and exwc in Windows.
2. Re: UTF-8 in Windows
- Posted by DerekParnell (admin) Sep 13, 2008
- 1284 views
Can the Windows version of Euphoria output text in UTF-8 format?
Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.
There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).
By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.
So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.
3. Re: UTF-8 in Windows
- Posted by rodoval Sep 14, 2008
- 1263 views
Can the Windows version of Euphoria output text in UTF-8 format?
Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.
There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).
By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.
So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.
Thanks, Derek. It is strange that, in my tests, the output seems ok when redirected to a file, but it is not correctly displayed on the console. This is a example:
- Open a console, set the Lucida Console font and execute "chcp 65001".
- Create the file "test.ex" using Notepad in the "UTF-8 without BOM" format with this single line:
puts(1, "ñu\n")
The first letter of the string is a "n with tilde", a spanish letter not in the 0-127 range.
- Executing from the console "exwc test.ex" the "extended" letter is not displayed (a rectangle appears instead).
- "exwc test.ex > test.txt" create a UTF-8 file (checked with Notepad), but surprisingly, "type test.txt" now show correctly the "n with tilde".
4. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1248 views
Can the Windows version of Euphoria output text in UTF-8 format?
Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.
There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).
By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.
So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.
Thanks, Derek. It is strange that, in my tests, the output seems ok when redirected to a file, but it is not correctly displayed on the console. This is a example:
- Open a console, set the Lucida Console font and execute "chcp 65001".
- Create the file "test.ex" using Notepad in the "UTF-8 without BOM" format with this single line:
puts(1, "ñu\n")
The first letter of the string is a "n with tilde", a spanish letter not in the 0-127 range.
- Executing from the console "exwc test.ex" the "extended" letter is not displayed (a rectangle appears instead).
- "exwc test.ex > test.txt" create a UTF-8 file (checked with Notepad), but surprisingly, "type test.txt" now show correctly the "n with tilde".
windows gui application use ANSI code pages but cmd.exe use OEM code pages, you must use CharToOem() to convert text written in notepad+ to display it properly in cmd.exe
5. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1248 views
here a program I written for my own needs,
--NAME: OemAnsi.exw --DESCRIPTION: convert oem text file to ansi text file (or reverse). It can be used as a command line filter too. --TARGET_PLATFORM: windows 9x+ --DATE: 2007-09-14 --AUTHOR: Jacques Deschênes, Baie-Comeau, Canada --DETAIL: This program is only of interest to those who use windows in languages with accented characters like french language. -- In french when one send the output of a console command to a text file, the accented characted are -- incorrect because console use oem character set and windows applications like notepad use ansi character set. -- This is annoying if, like me, you often use command script to gather information in text files. -- -- This program can be used with given input and output file names as parameters or as a filter. -- examples: -- to convert Oem_file.txt to ansi_file.txt type on command line: exwc.exe OemAnsi.exw -i oem_file.txt -o ansi_file.txt -- to convert from ansi to oem type: exwc.exe OemAnsi.exw -r -i ansi_file.txt -o oem_file.txt -- to use as a command line filter type: for /? | exwc.exe OemAnsi.exw > for_help_ansi.txt -- then you get the help for "for" command in an ansi text file. -- file names with spaces must be quoted. -- NOTES: -- if you intent to bind or convert this program in C use -con option if you want it to work as a filter. -- 1) When translating with open watcom version 1.7, and oemansi.exe is executed without any option, -- accented character are ignored. This problem doesn't exist with borland 5.5.1 compiler. -- 2) when translated with borland 5.5.1 CTRL-Z is not recognized as end of file when reading from STDIN. -- REF: http://msdn.microsoft.com/en-us/library/ms647473(VS.85).aspx -- http://msdn.microsoft.com/en-us/library/ms647493(VS.85).aspx include machine.e include dll.e include wildcard.e with trace -- win32 api calls constant user32=open_dll("user32.dll") constant iOemToChar = define_c_func(user32,"OemToCharA",{C_POINTER,C_POINTER},C_INT) constant iCharToOem = define_c_func(user32,"CharToOemA",{C_POINTER,C_POINTER},C_INT) function CharToOem(sequence line) atom pString, fnVal sequence oem pString = allocate_string(line) fnVal = c_func(iCharToOem,{pString,pString}) oem = peek({pString,length(line)}) free(pString) return oem end function function OemToChar(sequence line) atom pString, fnVal sequence ansi pString = allocate_string(line) fnVal = c_func(iOemToChar,{pString,pString}) ansi = peek({pString,length(line)}) free(pString) return ansi end function ---------------- integer fBinded, fReverse, fFilterStdIn sequence InpFile, OutFile constant STDIN=0, STDOUT=1, STDERR=2, BAD_FILE_HANDLE = -1 procedure usage(integer exit_code) puts(STDOUT,"oemansi convert oem text to ansi text or reverse.\n"& "USAGE: oemansi [-r] [-i input_file] [-o output_file]\n"& " -r option to convert from ansi to oem\n"& " -i input_file indicate file name to convert.\n"& " If this option is missing, input is read from STDIN\n"& " -o out_file indicate name of output file.\n"& " If this option is missing output to STDOUT\n"& " filter mode usage example: type oem.txt | oemansi > ansi.txt\n\n") abort(exit_code) end procedure procedure error() puts(STDERR,"oemansi bad usage.\n\n") usage(0) end procedure procedure ParseCommandLine() sequence argv integer switch --trace(1) fBinded = 0 fReverse = 0 fFilterStdIn = 0 InpFile = "" OutFile = "" switch = 0 argv = command_line() if equal(argv[1], argv[2]) then fBinded=1 end if for i = 3 to length(argv) do if switch = 0 then if argv[i][1]= '-' then if length(argv[i])<2 then error() end if switch = upper(argv[i][2]) if switch = 'R' then fReverse = 1 switch = 0 elsif switch = '?' then usage(1) end if end if elsif switch = 'I' then InpFile = argv[i] if InpFile[1]!='"' then switch = 0 elsif InpFile[$]='"' then InpFile = InpFile[2..$-1] switch = 0 else switch = -'I' end if elsif switch = -'I' then InpFile &= ' ' & argv[i] if InpFile[$] = '"' then InpFile = InpFile[2..$-1] switch = 0 end if elsif switch = 'O' then OutFile = argv[i] if OutFile[1] != '"' then switch = 0 elsif OutFile[$] = '"' then OutFile = OutFile[2..$-1] switch = 0 else switch = -'O' end if elsif switch = -'O' then OutFile &= ' ' & argv[i] if OutFile[$] = '"' then OutFile = OutFile[2..$-1] switch = 0 end if else error() end if end for if length(InpFile) = 0 then fFilterStdIn = 1 end if end procedure procedure Convert() integer fi, fo object line if fFilterStdIn then fi = STDIN else fi = open(InpFile,"r") if fi = BAD_FILE_HANDLE then printf(STDERR,"Failed to open %s\n",{InpFile}) abort(0) end if end if if length(OutFile) then fo = open(OutFile,"w") if fo = BAD_FILE_HANDLE then printf(STDERR,"failed to open %s\n",{OutFile}) abort(0) end if else fo = STDOUT end if line = gets(fi) while sequence(line) do if fReverse then puts(fo, CharToOem(line)) else puts(fo, OemToChar(line)) end if line = gets(fi) end while if not fi = STDIN then close(fi) end if if not fo = STDOUT then close(fo) end if end procedure ParseCommandLine() Convert() abort(1)
6. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1257 views
OOP! just realized that my code doesn't works with UTF-8 as it is. OemToCharA() and CharToOemA() functions should be replaced by OemToCharW() and CharToOemW()
7. Re: UTF-8 in Windows
- Posted by SDPringle Sep 14, 2008
- 1249 views
- Last edited Sep 15, 2008
To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:
puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )
I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.
8. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1236 views
- Last edited Sep 15, 2008
To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:
puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )
I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.
9. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1244 views
- Last edited Sep 15, 2008
To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:
puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )
I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.
10. Re: UTF-8 in Windows
- Posted by jacquesd Sep 14, 2008
- 1287 views
- Last edited Sep 15, 2008
Sorry for this last 2 empty replies, Shawn, the problem is that cmd.exe don't really work with unicode, it works with OEM charater set. To use puts() to print unicode string properly on screen, you MUST use AnsiToCharW() win32 api function.
11. Re: UTF-8 in Windows
- Posted by SDPringle Sep 14, 2008
- 1296 views
- Last edited Sep 15, 2008
Jacques,
I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?
Shawn Pringle
12. Re: UTF-8 in Windows
- Posted by jacquesd Sep 15, 2008
- 1273 views
Jacques,
I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?
Shawn Pringle
Sorry I gave you the wrong name for the function, it is CharTOem. There is 2 versions of this function CharToOemA and CharToOemW
ref.: http://msdn.microsoft.com/en-us/library/ms647473.aspx
jacques d.
13. Re: UTF-8 in Windows
- Posted by jacquesd Sep 15, 2008
- 1264 views
Jacques,
I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?
Shawn Pringle
Shawn,
Here some code I tested with success. Wathever the code page I select in console the result is the same.
It seem that CharToOem() check the console code page and do the conversion accordingly.
Tested with cp select=863 and cp select=65001
include machine.e procedure pokew(atom p, object o) if atom(o) then poke(p,remainder(o,256)) poke(p+1,floor(o/256)) else for i = 1 to length(o) do poke(p+2*(i-1),o[i]) poke(p+2*(i-1)+1,floor(o[i]/256)) end for end if end procedure function allocate_unicode(sequence u) atom p p = allocate(2*length(u)+2) pokew(p,u&0) return p end function include dll.e constant user32=open_dll("user32.dll") constant iCharToOem=define_c_func(user32,"CharToOemW",{C_POINTER,C_POINTER},C_UINT) function CharToOem(sequence ustring) atom fnVal, pUStr, pChar sequence oem pUStr = allocate_unicode(ustring) pChar = allocate(2*length(ustring)+2) pChar = allocate(length(ustring)+1) fnVal = c_func(iCharToOem,{pUStr,pChar}) oem = peek({pChar,length(ustring)})--oem is not unicode free(pUStr) free(pChar) return oem end function puts(1,{#F1}&'\n') -- without conversion puts(1,CharToOem({#F1})&'\n') -- with conversion
jacques d.
14. Re: UTF-8 in Windows
- Posted by SDPringle Sep 16, 2008
- 1251 views
I don't think chcp has any affect on what you are doing here. Perhaps my Windows instalation is broken somehow, but no matter what codepage I choose the first character is displayed the same way. That is perposterous if the codepages are actually changing. It means that chcp doesn't really change the code page. What I read was you could set the codepage to unicode in DOS but my tests do not work. In fact, if I didn't already have n with tilde in your character set I couldn't display it on the command line no matter what I did. The command-line is stuck on this codepage. Correct me if you have different results. I am using a Spanish localized Windows XP. Does this work on special copies of Windows? Is there a boot-time option to enable this?
There shouldn't be a problem using the GUI with Unicode, however.
Shawn Pringle
15. Re: UTF-8 in Windows
- Posted by jcmarsh Sep 16, 2008
- 1284 views
Check which code page MS console is using when it starts up. (example: mode con) This might be the only code page v3.1.1 can output if it is not a unicode program. Also, this might be the only code page that puts() can display on the console screen if it only takes OEM code pages. Try running "ascii.bat" and see what characters display. puts(1, {164}) gives me the "n-yeah".
The default code page for MS console is OEM 437. It may be possible to substitute unicode values for OEM ones using this chart http://www.microsoft.com/globaldev/reference/oem/437.mspx
Here's a list of some localized MS operating systems and code pages http://www.microsoft.com/globaldev/reference/oslocversion.mspx
Note: English's OEM code page is 437, Spanish's OEM code page is 850. The difference between them is that 850 has more accented characters and 437 has more line art. (I think this is changed when changing the system's localization settings.)
Alternatively, you could switch between standard OEM code pages (example: mode con cp select=850) for displaying special characters that are not in the code page that console started in. Make sure to change it back to the default afterward (example, back to 437).
- Quote from a Wikipedia article (emphasis added):
- "Recent Microsoft products and application program interfaces use Unicode internally, but many applications and APIs continue to use the default encoding of the computer's locale when reading and writing text data to files or standard output. Therefore, though Unicode is the accepted standard, there is still backwards compatibility with the older Windows code pages." http://en.wikipedia.org/wiki/Windows_code_page
16. Re: UTF-8 in Windows
- Posted by jacquesd Sep 16, 2008
- 1233 views
- Last edited Sep 17, 2008
When I open a console, mode con show 850 as code page
when I run ascii.ex in console n tilde shows as character 164
if using notepad I write the following line and save it as ntilde.exw using ANSI format.
puts(1,{164})
exwc ntilde.exw
n tilde is displayed properly.
Also if I select cp 65001 in console and type at the keyboard french accented characters are replaced by graphics characters. So the "mode con cp select=65001" is effective but as no effect on output of ntilde.exw or ascii.ex programs.
Jacques d.