How to process CSV lines with nul char in some elements? - c++

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?

The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

Related

Copying picture (.tga) from text content

I working on the creation of a game. I want to hide all my .tga files.
I concatenate the string content of all my files on a single file in order to make it illisible for players.
I want my program to load a picture by creating a temporaly .tga file from
the saved content.
So that, I'm trying to copy a .tga file from the content of an original one.
More precisely, I read a .tga file as a text and a write it.
Eventhough Notepad++ finds original file and new file as identical, the new file can not be open as .tga file. Windows detects the size of files with 1 byte offset.
Can you explain me what I'm doing wrong ?
Or may be suggest me a better way to hide my files.
Regards
More precisely, I read a .tga file as a text and a write it
Herein may lie your problem: You have to read and write the .tga file as a binary file. Otherwise, any occurence of the byte sequence 0x0D 0x0A (CR LF, Windows line ending) may be replaced with a single 0x0A (LF, Unix line ending) or vice versa, or 0x1A (DOS end of file) may be stripped or appended. Depending on the code you are using, you may also end up stripping any 0x00 (NUL) bytes.
I tried to read / write with my program (c++) a .tga file as binary file but the generated file was still corrupted. The code is below.
std::string name = "my_picture.tga";
std::ifstream FileIn(name, std::ios_base::binary);
std::vector<char> listChar;
bool stopp = false;
if (FileIn) {
while (!(stopp))
{
char xin;
FileIn.read(reinterpret_cast<char*>(&xin), sizeof(char));
listChar.push_back(xin);
if (FileIn.eof()) stopp = true;
}
FileIn.close();
}
std::ofstream FileOut(".\\test.tga", std::ios_base::binary);
bool isCarierReturn = false;
for (char xout : listChar) {
isCarierReturn = xout == '\r';
if (!isCarierReturn) FileOut.write(reinterpret_cast<const char*>(&xout), sizeof(char));
}
FileOut.close();
I compared the original file and the new one on a hexadecimal reader and files are effectively different.
The difference between original and new file consists in a mismatch on lines ending, instead of just having 0x0A ('\n') on the original file, the new file had the byte sequence 0x0D 0x0A ('\r' and '\n'). On some other pictures the generated file was incomplete, the break is always before a 0x1A value (as said #Christoph Lipka).
I manage to write the right sequence by testing if the char is a carrier return, the char is not written on this case and only the byte 0x0D is skipped, see below :
std::ofstream FileOut(".\\test.tga", std::ios_base::binary);
bool isCarrierReturn = false;
char xout_p1 = '\0';
if (listChar.size() >= 1) xout_p1 = listChar.at(0);
for (unsigned i(0); i < listChar.size(); i++) {
char xout = xout_p1;
if (i < listChar.size() - 1) xout_p1 = listChar.at(i + 1);
else xout_p1 = '\0';
isCarrierReturn = xout == '\r' && xout_p1 == '\n';
if (!isCarrierReturn) FileOut.write(reinterpret_cast<const char*>(&xout), sizeof(char));
}
FileOut.close();
The incomplete file reading is solved by reading the file as binary file.
It works.

How to make a filestream read in UTF-8 C++

I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout
_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);
Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.
C++'s <iostreams> library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.
std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}
This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.
To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string, and utf-32 encoded text be stored as std::vector<uint32_t>.

Comparing UTF8 encoded chars

There is a csv file which has the many different languages encoded in utf-8. I have to parse the file and validate for invalid characters.
I have written a sample program below as shown…
int main(void)
{
string invalidUTF8Chars = ""; // Invalid UTF-8 Chars array.
invalidUTF8Chars+= "\u00A0";
invalidUTF8Chars+= "\u005E";
invalidUTF8Chars+= "\u00FE";
invalidUTF8Chars+= "\u00BA";
invalidUTF8Chars+= "\u00AF";
FILE* fp;
char ch;
fp = fopen("unicodeUTF8TextFile.txt","r");
if(fp != NULL)
{
while(( ch = fgetc(fp) ) != EOF ) // Reading byte by byte form input file.
{
//if (strchr(invalidUTF8Chars.c_str(), ch)) // How do I validate here?
{
printf("Invalid character\n");
}
}
}
return 0;
}
How do I compare the data read from the file against the invalid chars?
When strchr() fails to find a character it returns a NULL-pointer. What you need to do is to check if the return was a NULL-pointer or not:
if(strchr(invalidUTF8Chars.c_str(), ch) == nullptr){
printf("Invalid character\n");
}
Here's the strchr() reference for your convenience.
Invalid character for UTF-8 may either mean that the UTF-8 encoding is invalid and doesn't correspond to any character, or that the UTF-8 decoding will lead to a character that you don't want.
You are interested in the second variant, where each character is encoded as one or more bytes in UTF-8, specifically "\u005E" is one byte in UTF-8 and the others are 2 bytes.
Thus you cannot reject individual bytes in your example, but would either need to decode to Unicode-characters or read everything as UTF-8 and then find the issues using something like:
if (strstr(readFile, u8"\u00A0") != nullptr || strstr(readFile, u8"\u005E") != nullptr ... ) printf("Found bad character\n");

fread and fgetc in the same file stream not working

I have a binary file with the following repeating format: 6 float values + 3 unsigned char (byte`, integer value from 0 to 255) values.
I am parsing it like this:
FILE *file = fopen("file.bin", "r");
bool valid = true;
while(!feof(file)) {
float vals[6];
valid = valid && (fread((void*)(&vals), sizeof(float), 6, file) == 6);
unsigned char a,b,c;
a = fgetc(file); b = fgetc(file); c = fgetc(file);
(...)
}
This works fine for the first 30 iterations or so, but after that it simply stops parsing (way way before the end of the file).
What could be wrong?
I also tried parsing the unsigned char bytes with
fread((void*)&(a), sizeof(unsigned char), 1, file);
it simply stops parsing (way way before the end of the file).
You and the C Standard library are having a difference of opinion about where the end of the file is. ASCII character EOF (for DOS/Windows: decimal 26, hex 1A, aka Ctrl+Z, for Unix/Linux: decimal 4, hex 04, aka Ctrl+D) is a control character meaning "end of file". There's also the file length stored by the filesystem metadata.
The C stdio functions can operate in several modes: text, default, binary, and these control several behaviors:
Newline translations (implementation-defined): enabled in text mode, disabled in binary mode, default: ???
End of file: Implementation-defined, but usually EOF character in text mode, by filesystem file length in binary mode, default: ???
Since your file contains binary data, you should force binary mode by using "b" in the mode string to fopen, e.g.
FILE* file = fopen("file.bin", "rb");
When you do so, characters with value 26 are treated like any other byte and lose their "EOF" meaning.

Need explanation on creating utf-8 encoded files on linux using c++

I need some explanations on encodage of files using g++ on Linux.
I have an easy code :
int main ()
{
FILE * pFile;
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
pFile = fopen ("myfile", "wt, ccs=UTF-8");
//pFile = fopen ("myfile", "wt");
fwrite (buffer , sizeof(char), sizeof(buffer), pFile);
fclose (pFile);
return 0;
}
Even if the "ccs=UTF-8" part is added on the fopen line, the output file of this program is always encoded in iso-8859-1. However, if I create a file using vi on Linux containing theses charaters, the resulting file is UTF-8 encoded (I use the command "file myfile" to see the encoding mode of the file, and a "xxd -b myfile" confirm this behavior).
So I would like to undestand :
1- Why g++ on Linux doesn't create a UTF-8 file by default?
2- What is the aim of the ccs=UTF-8 if the file created is not encoded in UTF-8?
3- How can I create an UTF-8 file based on this simple code?
Thanks.
Your file may appear to be in ISO-8859-1, but it's actually not. It's simply broken.
Your file contains byte A9, which is the lower byte of UTF-8 representation of é.
When you wrote 'é', the compiler should have warned you:
aaa.c:4:38: warning: multi-character character constant [-Wmultichar]
char buffer[] = { 'x' , 'y' , 'z' ,'é' };
^
char is not a type for a character, it's a type for one byte. GCC treats multibyte character literals as big-endian integers. Here, you cast it immediately to char, leaving the lowest byte: A9
(BTW, é in ISO-8859-1 is E9, not A9)
You open your file with an encoding, but then you save bytes into it. The bytes correspond to ISO-8859-1 characters xyz©.
If you want to write characters, not bytes, then use wchar_t instead of char and fputws instead of fwrite
#include <stdio.h>
#include <wchar.h>
int main ()
{
FILE * pFile;
// note final zero and L indicating wchar_t literal
wchar_t buffer[] = { 'x' , 'y' , 'z' , L'é' , 0};
// note no space before ccs
pFile = fopen ("myfile", "wt,ccs=UTF-8");
fputws(buffer, pFile);
fclose (pFile);
return 0;
}