Monday, June 11, 2007

How to write BOM(Byte of Marker) into file using fstream

The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.

Note that not all files start with a BOM however. In fact, the Unicode Standard says that text that does not begin with a BOM MUST be interpreted in big-endian form.

The character U+FEFF also serves as an encoding signature for the Unicode Encoding Forms. The table shows the encoding of U+FEFF in each of the Unicode encoding forms. Note that by definition, text labeled as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM. The endian-ness is indicated in the label.

For text that is compressed with the SCSU (Standard Compression Scheme for Unicode) algorithm, there is also a recommended signature.

Encoding Form

BOM Encoding

UTF-8

EF BB BF

UTF-16
(big-endian)

FE FF

UTF-16
(little-endian)

FF FE

UTF-16BE, UTF-32BE
(big-endian)

No BOM!

UTF-16LE, UTF-32LE
(little-endian)

No BOM!

UTF-32
(big-endian)

00 00 FE FF

UTF-32
(little-endian)

FF FE 00 00

SCSU
(compression)

0E FE FF

 

 

Solutions:

 

Case 1: using ANSI std::ofstream:

 

wchar_t BOM = 0xFEFF;
std::ofstream outFile("filename.dat", std::ios::out | std::ios::binary);
outfile.write((char *) &BOM,sizeof(wchar_t));

 

Case 2: using ANSI std::wofstream:

 

  const wchar_t BOM = 0xFEFF;

  const char *fname = "abc.txt";

  std::wofstream wfout;

  wfout.open(fname,ios_base::binary);
 

   //S1:

   testFile << BOM;

 

   //S2:

   //testFile.put(BOM);

 

1 comment:

Unknown said...

it seems UTF-16LE and UTF-16BE should be exchange the value. I mean

0xFEFF (UTF-16LE)
0xFFFE (UTF-16BE)