Qt - Reading file containing multiple NULL characters - c++

I have gotten a file, which contains multiple NULL charactes /0 in a line. My goal was to load the file and replace the /0 with something else, but I experience some problems doing that.
Qt stops reading the file, after it get's to the point, where the NULL character appears.
Code:
QTextStream fileContent;
QFile file(pendingFile);
if (file.open(QIODevice::ReadOnly | QIODevice::Text))
{
fileContent.append(file.readAll());
}
File:
Text
Text
Text /x00/x00/x00/x00/x00/x00/x00
More Text
I am currently using Qt 5.9.1 and develop with VS2017.

Read the file with as QDataStream
QFile file("raw.dat");
file.open(QIODevice::ReadOnly);
QDataStream fileStream(&file);
qint64 fileSize = file.size();
QByteArray data(fileSize, '\0');
fileStream.readRawData(data.data(), fileSize)
In the ByteArray you can replace all \0 elements.
Update (comments, look at underscore_d's answer):
data.replace('\0','_');
QString dataString(data);

The problem is not that QFile stops reading at NUL (i.e. '\0', not the NULL macro), but rather that QTextStream considers a NUL to mean end-of-string and thus stops append()ing after the first NUL it encounters.
Here is a previous thread on another site discussing something very similar, with a suggested alternative, which boils down to this:
You need to replace the NULs before feeding them into the QTextStream, as it will not pass them through. The suggestion there is to use a QDataStream. Maybe you don't need a TextStream or any Stream at all, and could just read the files into memory as binary and replace the NULs with your choice of substitute before writing out again.

Related

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?
As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();
What you're seeing is actually the expected behaviour.
The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

UTF-16LE Encoding woes with Qt text editor written in C++

So I have a QT text editor that I have started creating. I started with this http://doc.qt.io/archives/qt-5.7/gettingstartedqt.html and i have added on to it. So far I have added a proper save/save as function (the version in the link only really has a save as function), a "find" function, and an "open new window" function. Very soon, I will add a find and replace function.
I am mainly doing this for the learning experience, but I am also going to eventually add a few more functions that will specifically help me create PLC configuration files at work. These configuration files could be in many different encodings, but most of them seem to be in UTF-16LE (according to Emacs anyway.) My text editor originally had no problem reading the UTF-16LE, but wrote in plain text, I needed to change that.
Here is the snippet from the Emacs description of the encoding system of one of these UTF16-LE files.
U -- utf-16le-with-signature-dos (alias: utf-16-le-dos)
UTF-16 (little endian, with signature (BOM)).
Type: utf-16
EOL type: CRLF
This coding system encodes the following charsets:
unicode
And here is an example of the code that I am using to encode the text in my QT text editor.
First... This is similar to the link that I gave earlier. The only difference here is that "saveFile" is a global variable that I created to perform a simple "Save" function instead of a "Save As" function. This saves the text as plain text and works like a charm.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
Below is my newer version which attempts to save the code in "UTF-16LE." My text editor can read the text just fine after saving it with this, but Emacs will not read it at all. This to me says that the configuration file will probably not be readable by the programs that read it. Something changed, not sure what.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.setCodec("UTF-16LE");
QString stream3 = stream.readAll();
//QString stream2 = stream3.setUnicode();
//QTextCodec *codec = QTextCodec::codecForName("UTF-16LE");
//QByteArray stream2 = codec->fromUnicode(stream3);
//file.write(stream3);
stream.flush();
file.close();
}
}
}
The parts that are commented out I also tried, but they ended up writing the file as Asian (Chinese or Japanese) characters. Like I said my text editor, (and Notepad in Wine) can read the file just fine, but Emacs now describes the encoding as the following after saving.
= -- no-conversion (alias: binary)
Do no conversion.
When you visit a file with this coding, the file is read into a
unibyte buffer as is, thus each byte of a file is treated as a
character.
Type: raw-text (text with random binary characters)
EOL type: LF
This indicates to me that something is not right in the file. Eventually this text editor will be used to create multiple text files at once and modify their contents via user input. It would be great if I could get this encoding right.
Thanks to the kind fellows that commented on my post here, I was able to answer my own question. This code here solved my problem.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream.setCodec("UTF-16LE");
stream.setGenerateByteOrderMark(true);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
I set the codec of the stream, and then set the the generate BOM to "True." I guess that I have more to learn about encodings. I thought that the byte order mark had to be set to a specific value or something.I wasn't aware that I just had to set this value to "True" and that it would take care of itself. Emacs can now read the files that are generated by saving a document with this code, and the encoding documentation from Emacs is the same. I will eventually add options for the user to pick which encoding they need while saving. Glad that I was able to learn something here.

Why am i getting these invalid characters before my file data?

I am trying to read a file into a string either by getline function or fileContents.assign( (istreambuf_iterator<char>(myFile)), (istreambuf_iterator<char>()));
Either of the way gives me the above output which shown in the image.
First way:
string fileContents;
ifstream myFile("textFile.txt");
while(getline(myFile,fileContents))
cout<<fileContents<<endl;
Alternate way:
string fileContents;
ifstream myFile(fileName.c_str());
if (myFile.is_open())
{
fileContents.assign( (istreambuf_iterator<char>(myFile) ),
(istreambuf_iterator<char>() ) );
cout<<fileContents;
}
The file begins with those characters, most likely a BOM to tell you what the encoding of the file is.
You probably are not able to see them in Windows Notepad because Notepad hides the encoding bytes. Get a decent text editor that lets you see the binary of the file and you will see those characters.
Your file starts with a UTF-8 BOM (bytes 0xEF 0xBB 0xBF). You are reading the file's raw bytes as-is and outputting them to a display that is using an OEM font for codepage 437. To handle text files properly, especially Unicode-encoded text files, you need to read the first few bytes, check for a BOM (and there are several you can look for), and if detected then seek past the BOM and interpret the remaining bytes of the file in the specified encoding, in this case UTF-8.

QFile does not open file

QLabel* codeLabel = new Qlabel;
QFile file("C:\index.txt");
file.open(stderr, QIODevice::WriteOnly);
QByteArray data;
data = file.readAll();
codeLabel->setText("test"+QString(data));
file.close();
Then there is only "test" in QLabel.
Help, Please
Aside from the fact you should escape backslashes within C-style strings (c:\\index.txt), you have a problem with the following sequence:
// vvvvvvvvv
file.open(stderr, QIODevice::WriteOnly);
:
data = file.readAll();
// ^^^^
What exactly did you think was going to happen when you opened the file write-only, then tried to read it? You need to open it for reading such as with QIODevice::ReadOnly or QIODevice::ReadWrite.
On top of that, you should check the return code of all functions that fail by giving you a return code. You currently have no idea whether the file.open() worked or not.
I'm also not convinced that you should be opening stderr (which is really an ouput "device") for input. You'll almost certainly never get any actual data coming in on that file descriptor, which is probably why your input is empty.
You need to step back and ask what you're trying to acheive. For example, are you trying to capture everything your process sends to standard error? If so, it's not going to work that way.
If you're just trying to read the index.txt file, you're using the wrong overload. Remove the stderr parameter altogether:
file.open (QIODevice::ReadOnly);
If it's something else you're trying to do, add that to the question.
file.open(stderr, QIODevice::WriteOnly);
this closes the file again and reopens with the stderr stream in write only mode
you'll want to change that to
file.open(QIODevice::ReadOnly);
QFile file("C:\index.txt");
Here you try to open a file called: C:index.txt because '\i' is converted to i. You want to double you backslash:
QFile file("C:\\index.txt");
Because you read from a file you opened write-only.

How to convert ANSI format file to Unicode

I have one file that is encoded in ANSI format (showing in Notepad++ as Encoded in ANSI) and it also shows the special characters (degree celcius,pound etc.) and while reading i want to convert all the characters to unicode.
How can i convert ANSI to Unicode in C/C++ or Qt ?
My Qt is still very rusty, but something along the following lines:
QFile inFile("foo.txt");
if (!inFile.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QFile outFile("foo.out.txt");
if (!outFile.open(QIODevice::WriteOnly | QIODevice::Truncate))
return;
QTextStream in(&inFile);
QTextStream out(&outFile);
out.setCodec("UTF-8");
while (!in.atEnd()) {
QString line = in.readLine();
out << line;
}
Pieced together from the documentation of QFile and QTextStream, both of which include examples for reading and writing files. The default for QTextStream is to use the legacy encoding, so we only need to set an explicit encoding on the output QTextStream.
If the file isn't too large you could probably also use
out << in.readAll();
instead of the loop over the lines. The loop especially might add a trailing line break to the output file (although the docs aren't very clear on that).
Hope this helps.
http://geekswithblogs.net/dastblog/archive/2006/11/24/98746.aspx
Just read it with QTextStream. It will apply QTextCodec::codecForLocale, which uses the default ("ANSI") translation of 8 bits characters to Unicode.
Note that this won't work if you've copied an ANSI text file to Mac or Linux, as they don't have the notion of ANSI. For them, the ANSI text file will be ASCII-like so you should first convert to Unicode (UTF-8) and then copy.