There is a csv file which has the many different languages encoded in utf-8. I have to parse the file and validate for invalid characters.
I have written a sample program below as shown…
int main(void)
{
string invalidUTF8Chars = ""; // Invalid UTF-8 Chars array.
invalidUTF8Chars+= "\u00A0";
invalidUTF8Chars+= "\u005E";
invalidUTF8Chars+= "\u00FE";
invalidUTF8Chars+= "\u00BA";
invalidUTF8Chars+= "\u00AF";
FILE* fp;
char ch;
fp = fopen("unicodeUTF8TextFile.txt","r");
if(fp != NULL)
{
while(( ch = fgetc(fp) ) != EOF ) // Reading byte by byte form input file.
{
//if (strchr(invalidUTF8Chars.c_str(), ch)) // How do I validate here?
{
printf("Invalid character\n");
}
}
}
return 0;
}
How do I compare the data read from the file against the invalid chars?
When strchr() fails to find a character it returns a NULL-pointer. What you need to do is to check if the return was a NULL-pointer or not:
if(strchr(invalidUTF8Chars.c_str(), ch) == nullptr){
printf("Invalid character\n");
}
Here's the strchr() reference for your convenience.
Invalid character for UTF-8 may either mean that the UTF-8 encoding is invalid and doesn't correspond to any character, or that the UTF-8 decoding will lead to a character that you don't want.
You are interested in the second variant, where each character is encoded as one or more bytes in UTF-8, specifically "\u005E" is one byte in UTF-8 and the others are 2 bytes.
Thus you cannot reject individual bytes in your example, but would either need to decode to Unicode-characters or read everything as UTF-8 and then find the issues using something like:
if (strstr(readFile, u8"\u00A0") != nullptr || strstr(readFile, u8"\u005E") != nullptr ... ) printf("Found bad character\n");
Related
I'm in need of reading both UTF-8 string (std::string) and UTF-16 string (std::u16string) from a file (opened with ifstream).
The UTF-8 string is easy, I think I can just use something like std::getline(stream, str, '\0').
But about UTF-16, I'm not sure how I can actually read it. I know I can maybe loop in the file and read 2 bytes each time until a 0x0000 byte, but I'm not sure if that is the right and best way to do it.
So, how can I read it?
-- edit --
For now, I'm doing it this way, is this ok?
std::string binaryReader::ru16str_n()
{
std::u16string str;
char16_t ch = 0;
while (true)
{
binary.read(reinterpret_cast<char*>(&ch), 2);
if (ch != '\0')
str.push_back(ch);
else break;
}
return std::wstring_convert<
std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(str);
}
I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout
_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);
Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.
C++'s <iostreams> library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.
std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}
This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.
To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string, and utf-32 encoded text be stored as std::vector<uint32_t>.
When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?
The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.
My company use some code like this:
std::string(CT2CA(some_CString)).c_str()
which I believe it converts a Unicode string (whose type is CString)into ANSI encoding, and this string is for a email's subject. However, header of the email (which includes the subject) indicates that the mail client should decode it as a unicode (this is how the original code does). Thus, some German chars like "ä ö ü" will not be properly displayed as the title.
Is there anyway that I can put this header back to UTF8 and store into a std::string or const char*?
I know there are a lot of smarter ways to do this, but I need to keep the code sticking to its original one (i.e. sent the header as std::string or const char*).
Thanks in advance.
Becareful : it's '|' and not '&' !
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3F);
This sounds like a plain conversion from one encoding to another encoding: You can use std::codecvt<char, char, mbstate_t> for this. Whether your implementation ships with a suitable conversion, I don't know, however. From the sounds of it you just try to convert ISO-Latin-1 into Unicode. That should be pretty much trivial: the first 128 characters map (0 to 127) identically to UTF-8 and the second half conveniently map to the corresponding Unicode code points, i.e., you just need to encode the corresponding value into UTF-8. Each character will be replaced by two characters. That it, I think the conversion is something like that:
// Takes the next position and the end of a buffer as first two arguments and the
// character to convert from ISO-Latin-1 as third argument.
// Returns a pointer to end of the produced sequence.
char* iso_latin_1_to_utf8(char* buffer, char* end, unsigned char c) {
if (c < 128) {
if (buffer == end) { throw std::runtime_error("out of space"); }
*buffer++ = c;
}
else {
if (end - buffer < 2) { throw std::runtime_error("out of space"); }
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3f);
}
return buffer;
}
I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.
Can anyone to tell me how to do that?
First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description
Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:
(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.
(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:
UnicodeByte1 = (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);
And so on...
Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.
UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:
#include <iostream>
#include <string>
#include <fstream>
std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());
The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:
for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
char nextChar = *i;
// do stuff here.
}
Alternatively, you could open the file in binary mode, and then move through each byte that way:
std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
char nextChar;
while (fs.good()) {
fs >> nextChar;
// do stuff here.
}
}
If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.
QFile file;
if (file.open("my_file.text") {
QTextStream in(&file);
in.setCodec("UTF-8")
QString contents = in.readAll();
return;
}
In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:
{
if(i_ch == nullptr) return -1;
int l = 0;
char ch = *i_ch;
int mask = 0x80;
while(ch & mask) {
l++;
mask = (mask >> 1);
}
if (l < 4) return -1;
return l;
}
It's take less time than research how shell using mblen.
try this: get the file and then loop through the text based on it's length
Pseudocode:
String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
String the_character = s[i].
// TODO : Do your thing :o)
}