1. Use a BOM to identify Unicode source files
- Posted by ArthurCrump Feb 17, 2011
- 1499 views
In the recent Wiki article about Unicode plans source files may be confused with early shrouded output. If a Unicode source file was required to begin with a Unicode Byte Order Mark (BOM), could this ever look like the beginning of a shrouded file?
Files with a BOM at the front would begin:
- EF,#BB,#BF if coded in UTF-8, probably the prefered encoding
- FE,#FF if coded in UTF-16 big-endian
- FF,#FE if coded in UTF-16 little-endian
2. Re: Use a BOM to identify Unicode source files
- Posted by jimcbrown (admin) Feb 17, 2011
- 1517 views
In the recent Wiki article about Unicode plans source files may be confused with early shrouded output. If a Unicode source file was required to begin with a Unicode Byte Order Mark (BOM), could this ever look like the beginning of a shrouded file?
Files with a BOM at the front would begin:
- EF,#BB,#BF if coded in UTF-8, probably the prefered encoding
- FE,#FF if coded in UTF-16 big-endian
- FF,#FE if coded in UTF-16 little-endian
I'm not sure, but I'd really doubt it.
Scrambling added another layer of complexity ... but even so, the odds...
In any case, support for these formats were dropped in 2.5 when "shrouded" was changed to mean IL bytecode files. So backwards compatibility is no longer an issue.
3. Re: Use a BOM to identify Unicode source files
- Posted by Vinoba Feb 17, 2011
- 1469 views
In general, YTF-8 would more complexity than using UTF-16 little-endian. In fact the correct approach would be to go completely UTF-16 little-endian and make 9 bit characters an exception that can be easily handled. Whilst a lot of people would make Microsoft the excuse for going the little endian route, mine is a little more thought out approach. The Intel CPU which most of us use has a 16 and 32 bit read and write; the 8 bit is there becuase 8088 processor was that. Most other processors on the market are also 16/32/64 bit. And, of course, everything related to Windows is 16-bit little endian. In any case, I feel a BOM should be a requirement in all future string related software.
In the recent Wiki article about Unicode plans source files may be confused with early shrouded output. If a Unicode source file was required to begin with a Unicode Byte Order Mark (BOM), could this ever look like the beginning of a shrouded file?
Files with a BOM at the front would begin:
- EF,#BB,#BF if coded in UTF-8, probably the prefered encoding
- FE,#FF if coded in UTF-16 big-endian
- FF,#FE if coded in UTF-16 little-endian