How to make a filestream read in UTF-8 C++

How to make a filestream read in UTF-8 C++ - c++

I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout
_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);
Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.

C++'s <iostreams> library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.
std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}
This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.
To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string, and utf-32 encoded text be stored as std::vector<uint32_t>.

Related

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character.
If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.
enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };
switch (Encodings)
{
case USASCII:
ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
return new ByteField(ascii.c_str());
case ISO88591:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54F469
return new ByteField(ascii.c_str());
case UTF8:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54C3B469
return new ByteField(ascii.c_str());
case UTF16BE:
ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
//tôi output 005400F40069
return new ByteField(ascii.c_str());
case UTF16:
ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
//tôi output FEFF005400F40069
return new ByteField(ascii.c_str());
case UTF16LE:
ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
//tôi output 5400F4006900
return new ByteField(ascii.c_str());
}
void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
int n = s.length();
for (int i = 0; i < n; i++)
{
unsigned char c = s[i];
long val = long(c);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
result->append(ConvertBinToHex(bin));
}
}
std::string ToUTF16(std::string s, std::string * result, int encodings) {
int n = s.length();
if (encodings == UTF16) {
result->append("FEFF");
}
for (int i = 0; i < n; i++)
{
int val = int(s[i]);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
if (encodings == UTF16 || encodings == UTF16BE) {
result->append("00" + ConvertBinToHex(bin));
}
if (encodings == UTF16LE) {
result->append(ConvertBinToHex(bin) + "00");
}
}
}
std::string ConvertBinToHex(std::string str) {
long long temp = atoll(str.c_str());
int dec_value = 0;
int base = 1;
int i = 0;
while (temp) {
int last_digit = temp % 10;
temp = temp / 10;
dec_value += last_digit * base;
base = base * 2;
}
char hexaDeciNum[10];
while (dec_value != 0)
{
int temp = 0;
temp = dec_value % 16;
if (temp < 10)
{
hexaDeciNum[i] = temp + 48;
i++;
}
else
{
hexaDeciNum[i] = temp + 55;
i++;
}
dec_value = dec_value / 16;
}
str.clear();
for (int j = i - 1; j >= 0; j--) {
str = str + hexaDeciNum[j];
}
return str;
}

The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?
And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it". Please show a minimal, reproducible example along with the input/output
But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"
Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different
To force UTF-8/16/32 encoding you just need to use the u8, u and U suffix respectively, along with the correct type (char8_t, char16_t, char32_t or std::u8string/std::u16string/std::u32string)
std::u8string utf8 = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";
Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string. Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string
Edit:
To convert between UTF encodings use the standard std::codecvt, std::wstring_convert, std::codecvt_utf8_utf16...
Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs
WideCharToMultiByte and MultiByteToWideChar on Windows
iconv on Linux
Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information

-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.
If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function like mbstowcs but you need to set up your locale correctly before using it.
You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.
As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

Comparing UTF8 encoded chars

There is a csv file which has the many different languages encoded in utf-8. I have to parse the file and validate for invalid characters.
I have written a sample program below as shown…
int main(void)
{
string invalidUTF8Chars = ""; // Invalid UTF-8 Chars array.
invalidUTF8Chars+= "\u00A0";
invalidUTF8Chars+= "\u005E";
invalidUTF8Chars+= "\u00FE";
invalidUTF8Chars+= "\u00BA";
invalidUTF8Chars+= "\u00AF";
FILE* fp;
char ch;
fp = fopen("unicodeUTF8TextFile.txt","r");
if(fp != NULL)
{
while(( ch = fgetc(fp) ) != EOF ) // Reading byte by byte form input file.
{
//if (strchr(invalidUTF8Chars.c_str(), ch)) // How do I validate here?
{
printf("Invalid character\n");
}
}
}
return 0;
}
How do I compare the data read from the file against the invalid chars?

When strchr() fails to find a character it returns a NULL-pointer. What you need to do is to check if the return was a NULL-pointer or not:
if(strchr(invalidUTF8Chars.c_str(), ch) == nullptr){
printf("Invalid character\n");
}
Here's the strchr() reference for your convenience.

Invalid character for UTF-8 may either mean that the UTF-8 encoding is invalid and doesn't correspond to any character, or that the UTF-8 decoding will lead to a character that you don't want.
You are interested in the second variant, where each character is encoded as one or more bytes in UTF-8, specifically "\u005E" is one byte in UTF-8 and the others are 2 bytes.
Thus you cannot reject individual bytes in your example, but would either need to decode to Unicode-characters or read everything as UTF-8 and then find the issues using something like:
if (strstr(readFile, u8"\u00A0") != nullptr || strstr(readFile, u8"\u005E") != nullptr ... ) printf("Found bad character\n");

How to remove the last character of a UTF-8 string in C++?

The text is stored in a std::string.
If the text is 8-bit ASCII, then it is really easy:
text.pop_back();
But what if it is UTF-8 text?
As far as I know, there are no UTF-8 related functions in the standard library which I could use.

You really need a UTF-8 Library if you are going to work with UTF-8. However for this task I think something like this may suffice:
void pop_back_utf8(std::string& utf8)
{
if(utf8.empty())
return;
auto cp = utf8.data() + utf8.size();
while(--cp >= utf8.data() && ((*cp & 0b10000000) && !(*cp & 0b01000000))) {}
if(cp >= utf8.data())
utf8.resize(cp - utf8.data());
}
int main()
{
std::string s = "κόσμε";
while(!s.empty())
{
std::cout << s << '\n';
pop_back_utf8(s);
}
}
Output:
κόσμε
κόσμ
κόσ
κό
κ
It relies on the fact that UTF-8 Encoding has one start byte followed by several continuation bytes. Those continuation bytes can be detected using the provided bitwise operators.

What you can do is pop off characters until you reach the leading byte of a code point. The leading byte of a code point in UTF8 is either of the pattern 0xxxxxxx or 11xxxxxx, and all non-leading bytes are of the form 10xxxxxx. This means you can check the first and second bit to determine if you have a leading byte.
bool is_leading_utf8_byte(char c) {
auto first_bit_set = (c & 0x80) != 0;
auto second_bit_set = (c & 0X40) != 0;
return !first_bit_set || second_bit_set;
}
void pop_utf8(std::string& x) {
while (!is_leading_utf8_byte(x.back()))
x.pop_back();
x.pop_back();
}
This of course does no error checking and assumes that your string is valid utf-8.

How to process CSV lines with nul char in some elements?

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?

The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

Reading Unicode characters from a file in C++

I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.
Can anyone to tell me how to do that?

First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description
Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:
(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.
(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:
UnicodeByte1 = (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);
And so on...
Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.

UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:
#include <iostream>
#include <string>
#include <fstream>
std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());
The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:
for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
char nextChar = *i;
// do stuff here.
}
Alternatively, you could open the file in binary mode, and then move through each byte that way:
std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
char nextChar;
while (fs.good()) {
fs >> nextChar;
// do stuff here.
}
}
If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.
QFile file;
if (file.open("my_file.text") {
QTextStream in(&file);
in.setCodec("UTF-8")
QString contents = in.readAll();
return;
}

In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:
{
if(i_ch == nullptr) return -1;
int l = 0;
char ch = *i_ch;
int mask = 0x80;
while(ch & mask) {
l++;
mask = (mask >> 1);
}
if (l < 4) return -1;
return l;
}
It's take less time than research how shell using mblen.

try this: get the file and then loop through the text based on it's length
Pseudocode:
String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
String the_character = s[i].
// TODO : Do your thing :o)
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js