1. UTF-8 in Windows

Hello!

Can the Windows version of Euphoria output text in UTF-8 format?

I have finished a program in Linux and now I am testing it in Windows XP. In Linux, it seems that the format of the code file determines the format of the output; I have checked this in UTF-8 and ISO-8859-15 redirecting the output for a file from a terminal. In Windows this works only for DOS and ANSI (Windows-1252) modes. I am using the 3.1.1 version of Euphoria and exwc in Windows.

new topic     » topic index » view message » categorize

2. Re: UTF-8 in Windows

rodoval said...

Can the Windows version of Euphoria output text in UTF-8 format?

Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.

There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).

By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.

So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.

new topic     » goto parent     » topic index » view message » categorize

3. Re: UTF-8 in Windows

DerekParnell said...
rodoval said...

Can the Windows version of Euphoria output text in UTF-8 format?

Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.

There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).

By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.

So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.

Thanks, Derek. It is strange that, in my tests, the output seems ok when redirected to a file, but it is not correctly displayed on the console. This is a example:

- Open a console, set the Lucida Console font and execute "chcp 65001".

- Create the file "test.ex" using Notepad in the "UTF-8 without BOM" format with this single line:

  puts(1, "ñu\n") 

The first letter of the string is a "n with tilde", a spanish letter not in the 0-127 range.

- Executing from the console "exwc test.ex" the "extended" letter is not displayed (a rectangle appears instead).

- "exwc test.ex > test.txt" create a UTF-8 file (checked with Notepad), but surprisingly, "type test.txt" now show correctly the "n with tilde".

new topic     » goto parent     » topic index » view message » categorize

4. Re: UTF-8 in Windows

rodoval said...
DerekParnell said...
rodoval said...

Can the Windows version of Euphoria output text in UTF-8 format?

Yes, but not by default. It does not convert Extended ASCII text into UTF-8. Of course, plain ASCII (values 0 - 127) is already in UTF8 format but the byte values 128 - 255, used in code pages, are not converted.

There will be a routine in Version 4 to convert many codes pages into Unicode (UTF32, UTF16, and UTF8).

By the way, I assume you are talking about sending text to a console window and not a Windows control object. In which case you must also set the console code page to 65001 and use the "Lucinda Console" font.

So, in summary, for Euphoria 3 programs, you have to convert your extended ASCII characters into the utf-8 equivalents before sending them to a Windows console.

Thanks, Derek. It is strange that, in my tests, the output seems ok when redirected to a file, but it is not correctly displayed on the console. This is a example:

- Open a console, set the Lucida Console font and execute "chcp 65001".

- Create the file "test.ex" using Notepad in the "UTF-8 without BOM" format with this single line:

  puts(1, "ñu\n") 

The first letter of the string is a "n with tilde", a spanish letter not in the 0-127 range.

- Executing from the console "exwc test.ex" the "extended" letter is not displayed (a rectangle appears instead).

- "exwc test.ex > test.txt" create a UTF-8 file (checked with Notepad), but surprisingly, "type test.txt" now show correctly the "n with tilde".

windows gui application use ANSI code pages but cmd.exe use OEM code pages, you must use CharToOem() to convert text written in notepad+ to display it properly in cmd.exe

new topic     » goto parent     » topic index » view message » categorize

5. Re: UTF-8 in Windows

here a program I written for my own needs,

--NAME: OemAnsi.exw 
--DESCRIPTION: convert oem text file to ansi text file (or reverse). It can be used as a command line filter too. 
--TARGET_PLATFORM: windows 9x+ 
--DATE: 2007-09-14 
--AUTHOR: Jacques Deschênes,  Baie-Comeau, Canada 
--DETAIL: This program is only of interest to those who use windows in languages with accented characters like french language. 
--  In french when one send the output of a console command to a text file, the accented characted are 
--  incorrect because console use oem character set and windows applications like notepad use ansi character set.  
--  This is annoying if, like me, you often use command script to gather information in text files. 
-- 
--  This program can be used with given input and output file names as parameters or as a filter. 
-- examples:  
--   to convert Oem_file.txt to ansi_file.txt type on command line:  exwc.exe OemAnsi.exw  -i oem_file.txt -o ansi_file.txt   
--   to convert from ansi to oem  type: exwc.exe OemAnsi.exw -r -i ansi_file.txt  -o oem_file.txt 
--   to use as a command line filter type:   for /? | exwc.exe OemAnsi.exw > for_help_ansi.txt 
--                                           then you get  the help for "for" command in an ansi text file.                              
--   file names with spaces must be quoted. 
-- NOTES: 
--   if you intent to bind or convert this program in C  use -con option if you want it to work as a filter. 
--   1) When translating with open watcom version 1.7, and oemansi.exe is executed without any option, 
--      accented character are ignored. This problem doesn't exist with borland 5.5.1 compiler. 
--   2) when translated with borland 5.5.1  CTRL-Z is not recognized as end of file when reading from STDIN. 
-- REF: http://msdn.microsoft.com/en-us/library/ms647473(VS.85).aspx 
--      http://msdn.microsoft.com/en-us/library/ms647493(VS.85).aspx 
 
 
include machine.e 
include dll.e 
include wildcard.e 
with trace 
-- win32 api calls 
constant user32=open_dll("user32.dll") 
 
constant iOemToChar = define_c_func(user32,"OemToCharA",{C_POINTER,C_POINTER},C_INT) 
constant iCharToOem = define_c_func(user32,"CharToOemA",{C_POINTER,C_POINTER},C_INT) 
 
function CharToOem(sequence line) 
atom pString, fnVal 
sequence oem 
 
   pString = allocate_string(line) 
   fnVal = c_func(iCharToOem,{pString,pString}) 
   oem = peek({pString,length(line)}) 
   free(pString) 
   return oem 
end function 
 
function OemToChar(sequence line) 
atom pString, fnVal 
sequence ansi 
 
   pString = allocate_string(line) 
   fnVal = c_func(iOemToChar,{pString,pString}) 
   ansi = peek({pString,length(line)}) 
   free(pString) 
   return ansi 
end function 
 
---------------- 
 
 
integer fBinded, fReverse, fFilterStdIn 
sequence  InpFile, OutFile 
 
constant STDIN=0, STDOUT=1, STDERR=2, BAD_FILE_HANDLE = -1 
 
procedure usage(integer exit_code) 
  puts(STDOUT,"oemansi convert oem text to ansi text or reverse.\n"& 
  "USAGE: oemansi [-r] [-i input_file] [-o output_file]\n"& 
  "  -r  option to convert from ansi to oem\n"& 
  "  -i input_file  indicate file name to convert.\n"& 
  "     If this option is missing, input is read from STDIN\n"& 
  "  -o out_file  indicate name of output file.\n"& 
  "     If this option is missing output to STDOUT\n"& 
  " filter mode usage example: type oem.txt | oemansi > ansi.txt\n\n") 
   
  abort(exit_code) 
end procedure 
 
procedure error() 
  puts(STDERR,"oemansi bad usage.\n\n") 
  usage(0) 
end procedure 
 
procedure ParseCommandLine() 
sequence argv 
integer switch 
--trace(1)  
 
  fBinded = 0 
  fReverse = 0 
  fFilterStdIn = 0 
  InpFile = "" 
  OutFile = "" 
  switch = 0 
  argv = command_line() 
  if equal(argv[1], argv[2]) then fBinded=1 end if 
  for i = 3 to length(argv) do 
     if switch = 0 then 
       if argv[i][1]= '-' then 
         if length(argv[i])<2 then 
           error() 
         end if 
         switch = upper(argv[i][2]) 
         if switch = 'R' then 
           fReverse = 1 
           switch = 0 
         elsif switch = '?' then 
           usage(1) 
         end if 
       end if 
     elsif switch = 'I' then 
        InpFile = argv[i] 
        if InpFile[1]!='"' then 
           switch = 0 
        elsif InpFile[$]='"' then 
           InpFile = InpFile[2..$-1] 
           switch = 0  
        else 
          switch = -'I' 
        end if 
     elsif switch = -'I' then 
        InpFile &= ' ' & argv[i] 
        if InpFile[$] = '"' then 
          InpFile = InpFile[2..$-1] 
          switch = 0 
        end if    
     elsif switch = 'O' then 
        OutFile = argv[i] 
        if OutFile[1] != '"' then 
          switch = 0 
        elsif OutFile[$] = '"' then 
          OutFile = OutFile[2..$-1] 
          switch = 0 
        else 
          switch = -'O' 
        end if 
     elsif switch = -'O' then 
        OutFile &= ' ' & argv[i] 
        if OutFile[$] = '"' then 
          OutFile = OutFile[2..$-1] 
          switch = 0 
        end if 
     else 
       error() 
     end if 
  end for 
  if length(InpFile) = 0 then 
     fFilterStdIn = 1  
  end if 
end procedure 
 
procedure Convert() 
integer fi, fo 
object line 
 
   if fFilterStdIn then 
     fi = STDIN 
   else 
     fi = open(InpFile,"r") 
     if fi = BAD_FILE_HANDLE then 
       printf(STDERR,"Failed to open %s\n",{InpFile}) 
       abort(0) 
     end if      
   end if 
   if length(OutFile) then 
     fo = open(OutFile,"w") 
     if fo = BAD_FILE_HANDLE then 
       printf(STDERR,"failed to open %s\n",{OutFile}) 
       abort(0) 
     end if   
   else 
     fo = STDOUT 
   end if 
   line = gets(fi) 
   while sequence(line) do 
     if fReverse then 
       puts(fo, CharToOem(line)) 
     else 
       puts(fo, OemToChar(line)) 
     end if 
     line = gets(fi) 
   end while 
   if not fi = STDIN then 
     close(fi) 
   end if 
   if not fo = STDOUT then 
     close(fo) 
   end if       
end procedure 
 
ParseCommandLine() 
Convert() 
abort(1) 
 
new topic     » goto parent     » topic index » view message » categorize

6. Re: UTF-8 in Windows

OOP! just realized that my code doesn't works with UTF-8 as it is. OemToCharA() and CharToOemA() functions should be replaced by OemToCharW() and CharToOemW()

new topic     » goto parent     » topic index » view message » categorize

7. Re: UTF-8 in Windows

To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:

puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )

I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.

new topic     » goto parent     » topic index » view message » categorize

8. Re: UTF-8 in Windows

SDPringle said...

To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:

puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )

I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.

new topic     » goto parent     » topic index » view message » categorize

9. Re: UTF-8 in Windows

SDPringle said...

To eliminate the possibility of bugs in Notepad I rewrote your example by encoding utf8 myself and giving Euphoria only numbers:

puts(1, or_bits( or_bits( #80, #40 ), floor( #F1 / 64 ) ) & or_bits( #80, remainder( #F1, 64 ) ) & "u\n" ) you can also do: puts( 1, { 195, 177 } & "u\n" )

I have tried both chcp and mode con cp select=65001 and I get funny looking characters on the screen but not what is known as 'n-yeah' in Argentina. Just to be sure, I wrote an n with tilde in a file and read it in. I got: { 195, 177 }, which matches what the numeric expression prior to the "u\n" reads. So, I have agreement through jEdit and my own encoding of utf based on the specs I read online. It's the console not displaying things correctly.

new topic     » goto parent     » topic index » view message » categorize

10. Re: UTF-8 in Windows

Sorry for this last 2 empty replies, Shawn, the problem is that cmd.exe don't really work with unicode, it works with OEM charater set. To use puts() to print unicode string properly on screen, you MUST use AnsiToCharW() win32 api function.

new topic     » goto parent     » topic index » view message » categorize

11. Re: UTF-8 in Windows

Jacques,

I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?

Shawn Pringle

new topic     » goto parent     » topic index » view message » categorize

12. Re: UTF-8 in Windows

SDPringle said...

Jacques,

I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?

Shawn Pringle

Sorry I gave you the wrong name for the function, it is CharTOem. There is 2 versions of this function CharToOemA and CharToOemW
ref.: http://msdn.microsoft.com/en-us/library/ms647473.aspx

jacques d.

new topic     » goto parent     » topic index » view message » categorize

13. Re: UTF-8 in Windows

SDPringle said...

Jacques,

I couldn't find the documentation for this function on the web. It looks like it returns the little endian 16-bit representation of the Unicode string. If so, then just puts( 1, { #F1, 00 } ) should print out the n with tilde character but it doesn't even after all of that preparation... Should the codepage be set to 65001 or should I be using something else? What encoding does the function you described return?

Shawn Pringle

Shawn,
Here some code I tested with success. Wathever the code page I select in console the result is the same.
It seem that CharToOem() check the console code page and do the conversion accordingly.
Tested with cp select=863 and cp select=65001

include machine.e 
 
procedure pokew(atom p, object o) 
 if atom(o) then 
   poke(p,remainder(o,256)) 
   poke(p+1,floor(o/256)) 
else 
  for i = 1 to length(o) do 
    poke(p+2*(i-1),o[i]) 
    poke(p+2*(i-1)+1,floor(o[i]/256)) 
  end for 
end if 
end procedure 
 
function allocate_unicode(sequence u) 
atom p   
  p = allocate(2*length(u)+2) 
  pokew(p,u&0) 
  return p 
end function 
 
include dll.e 
constant user32=open_dll("user32.dll") 
constant iCharToOem=define_c_func(user32,"CharToOemW",{C_POINTER,C_POINTER},C_UINT) 
 
function CharToOem(sequence ustring) 
atom fnVal, pUStr, pChar 
sequence oem 
   pUStr = allocate_unicode(ustring) 
   pChar = allocate(2*length(ustring)+2) 
   pChar = allocate(length(ustring)+1) 
   fnVal = c_func(iCharToOem,{pUStr,pChar}) 
   oem = peek({pChar,length(ustring)})--oem is not unicode 
   free(pUStr) 
   free(pChar) 
   return oem 
end function 
puts(1,{#F1}&'\n') -- without conversion 
puts(1,CharToOem({#F1})&'\n') -- with conversion 


jacques d.

new topic     » goto parent     » topic index » view message » categorize

14. Re: UTF-8 in Windows

I don't think chcp has any affect on what you are doing here. Perhaps my Windows instalation is broken somehow, but no matter what codepage I choose the first character is displayed the same way. That is perposterous if the codepages are actually changing. It means that chcp doesn't really change the code page. What I read was you could set the codepage to unicode in DOS but my tests do not work. In fact, if I didn't already have n with tilde in your character set I couldn't display it on the command line no matter what I did. The command-line is stuck on this codepage. Correct me if you have different results. I am using a Spanish localized Windows XP. Does this work on special copies of Windows? Is there a boot-time option to enable this?

There shouldn't be a problem using the GUI with Unicode, however.

Shawn Pringle

new topic     » goto parent     » topic index » view message » categorize

15. Re: UTF-8 in Windows

Check which code page MS console is using when it starts up. (example: mode con) This might be the only code page v3.1.1 can output if it is not a unicode program. Also, this might be the only code page that puts() can display on the console screen if it only takes OEM code pages. Try running "ascii.bat" and see what characters display. puts(1, {164}) gives me the "n-yeah".

The default code page for MS console is OEM 437. It may be possible to substitute unicode values for OEM ones using this chart http://www.microsoft.com/globaldev/reference/oem/437.mspx

Here's a list of some localized MS operating systems and code pages http://www.microsoft.com/globaldev/reference/oslocversion.mspx

Note: English's OEM code page is 437, Spanish's OEM code page is 850. The difference between them is that 850 has more accented characters and 437 has more line art. (I think this is changed when changing the system's localization settings.)

Alternatively, you could switch between standard OEM code pages (example: mode con cp select=850) for displaying special characters that are not in the code page that console started in. Make sure to change it back to the default afterward (example, back to 437).

Quote from a Wikipedia article (emphasis added):
"Recent Microsoft products and application program interfaces use Unicode internally, but many applications and APIs continue to use the default encoding of the computer's locale when reading and writing text data to files or standard output. Therefore, though Unicode is the accepted standard, there is still backwards compatibility with the older Windows code pages." http://en.wikipedia.org/wiki/Windows_code_page

new topic     » goto parent     » topic index » view message » categorize

16. Re: UTF-8 in Windows

When I open a console, mode con show 850 as code page
when I run ascii.ex in console n tilde shows as character 164
if using notepad I write the following line and save it as ntilde.exw using ANSI format.

puts(1,{164}) 

exwc ntilde.exw
n tilde is displayed properly.
Also if I select cp 65001 in console and type at the keyboard french accented characters are replaced by graphics characters. So the "mode con cp select=65001" is effective but as no effect on output of ntilde.exw or ascii.ex programs.

Jacques d.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu