Re: Use a BOM to identify Unicode source files
- Posted by Vinoba Feb 17, 2011
- 1467 views
In general, YTF-8 would more complexity than using UTF-16 little-endian. In fact the correct approach would be to go completely UTF-16 little-endian and make 9 bit characters an exception that can be easily handled. Whilst a lot of people would make Microsoft the excuse for going the little endian route, mine is a little more thought out approach. The Intel CPU which most of us use has a 16 and 32 bit read and write; the 8 bit is there becuase 8088 processor was that. Most other processors on the market are also 16/32/64 bit. And, of course, everything related to Windows is 16-bit little endian. In any case, I feel a BOM should be a requirement in all future string related software.
In the recent Wiki article about Unicode plans source files may be confused with early shrouded output. If a Unicode source file was required to begin with a Unicode Byte Order Mark (BOM), could this ever look like the beginning of a shrouded file?
Files with a BOM at the front would begin:
- EF,#BB,#BF if coded in UTF-8, probably the prefered encoding
- FE,#FF if coded in UTF-16 big-endian
- FF,#FE if coded in UTF-16 little-endian