How to convert ANSI format file to Unicode - c++

I have one file that is encoded in ANSI format (showing in Notepad++ as Encoded in ANSI) and it also shows the special characters (degree celcius,pound etc.) and while reading i want to convert all the characters to unicode.
How can i convert ANSI to Unicode in C/C++ or Qt ?

My Qt is still very rusty, but something along the following lines:
QFile inFile("foo.txt");
if (!inFile.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QFile outFile("foo.out.txt");
if (!outFile.open(QIODevice::WriteOnly | QIODevice::Truncate))
return;
QTextStream in(&inFile);
QTextStream out(&outFile);
out.setCodec("UTF-8");
while (!in.atEnd()) {
QString line = in.readLine();
out << line;
}
Pieced together from the documentation of QFile and QTextStream, both of which include examples for reading and writing files. The default for QTextStream is to use the legacy encoding, so we only need to set an explicit encoding on the output QTextStream.
If the file isn't too large you could probably also use
out << in.readAll();
instead of the loop over the lines. The loop especially might add a trailing line break to the output file (although the docs aren't very clear on that).

Hope this helps.
http://geekswithblogs.net/dastblog/archive/2006/11/24/98746.aspx

Just read it with QTextStream. It will apply QTextCodec::codecForLocale, which uses the default ("ANSI") translation of 8 bits characters to Unicode.
Note that this won't work if you've copied an ANSI text file to Mac or Linux, as they don't have the notion of ANSI. For them, the ANSI text file will be ASCII-like so you should first convert to Unicode (UTF-8) and then copy.

Related

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?
As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();
What you're seeing is actually the expected behaviour.
The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

UTF-16LE Encoding woes with Qt text editor written in C++

So I have a QT text editor that I have started creating. I started with this http://doc.qt.io/archives/qt-5.7/gettingstartedqt.html and i have added on to it. So far I have added a proper save/save as function (the version in the link only really has a save as function), a "find" function, and an "open new window" function. Very soon, I will add a find and replace function.
I am mainly doing this for the learning experience, but I am also going to eventually add a few more functions that will specifically help me create PLC configuration files at work. These configuration files could be in many different encodings, but most of them seem to be in UTF-16LE (according to Emacs anyway.) My text editor originally had no problem reading the UTF-16LE, but wrote in plain text, I needed to change that.
Here is the snippet from the Emacs description of the encoding system of one of these UTF16-LE files.
U -- utf-16le-with-signature-dos (alias: utf-16-le-dos)
UTF-16 (little endian, with signature (BOM)).
Type: utf-16
EOL type: CRLF
This coding system encodes the following charsets:
unicode
And here is an example of the code that I am using to encode the text in my QT text editor.
First... This is similar to the link that I gave earlier. The only difference here is that "saveFile" is a global variable that I created to perform a simple "Save" function instead of a "Save As" function. This saves the text as plain text and works like a charm.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
Below is my newer version which attempts to save the code in "UTF-16LE." My text editor can read the text just fine after saving it with this, but Emacs will not read it at all. This to me says that the configuration file will probably not be readable by the programs that read it. Something changed, not sure what.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.setCodec("UTF-16LE");
QString stream3 = stream.readAll();
//QString stream2 = stream3.setUnicode();
//QTextCodec *codec = QTextCodec::codecForName("UTF-16LE");
//QByteArray stream2 = codec->fromUnicode(stream3);
//file.write(stream3);
stream.flush();
file.close();
}
}
}
The parts that are commented out I also tried, but they ended up writing the file as Asian (Chinese or Japanese) characters. Like I said my text editor, (and Notepad in Wine) can read the file just fine, but Emacs now describes the encoding as the following after saving.
= -- no-conversion (alias: binary)
Do no conversion.
When you visit a file with this coding, the file is read into a
unibyte buffer as is, thus each byte of a file is treated as a
character.
Type: raw-text (text with random binary characters)
EOL type: LF
This indicates to me that something is not right in the file. Eventually this text editor will be used to create multiple text files at once and modify their contents via user input. It would be great if I could get this encoding right.
Thanks to the kind fellows that commented on my post here, I was able to answer my own question. This code here solved my problem.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream.setCodec("UTF-16LE");
stream.setGenerateByteOrderMark(true);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
I set the codec of the stream, and then set the the generate BOM to "True." I guess that I have more to learn about encodings. I thought that the byte order mark had to be set to a specific value or something.I wasn't aware that I just had to set this value to "True" and that it would take care of itself. Emacs can now read the files that are generated by saving a document with this code, and the encoding documentation from Emacs is the same. I will eventually add options for the user to pick which encoding they need while saving. Glad that I was able to learn something here.

Qt - Reading file containing multiple NULL characters

I have gotten a file, which contains multiple NULL charactes /0 in a line. My goal was to load the file and replace the /0 with something else, but I experience some problems doing that.
Qt stops reading the file, after it get's to the point, where the NULL character appears.
Code:
QTextStream fileContent;
QFile file(pendingFile);
if (file.open(QIODevice::ReadOnly | QIODevice::Text))
{
fileContent.append(file.readAll());
}
File:
Text
Text
Text /x00/x00/x00/x00/x00/x00/x00
More Text
I am currently using Qt 5.9.1 and develop with VS2017.
Read the file with as QDataStream
QFile file("raw.dat");
file.open(QIODevice::ReadOnly);
QDataStream fileStream(&file);
qint64 fileSize = file.size();
QByteArray data(fileSize, '\0');
fileStream.readRawData(data.data(), fileSize)
In the ByteArray you can replace all \0 elements.
Update (comments, look at underscore_d's answer):
data.replace('\0','_');
QString dataString(data);
The problem is not that QFile stops reading at NUL (i.e. '\0', not the NULL macro), but rather that QTextStream considers a NUL to mean end-of-string and thus stops append()ing after the first NUL it encounters.
Here is a previous thread on another site discussing something very similar, with a suggested alternative, which boils down to this:
You need to replace the NULs before feeding them into the QTextStream, as it will not pass them through. The suggestion there is to use a QDataStream. Maybe you don't need a TextStream or any Stream at all, and could just read the files into memory as binary and replace the NULs with your choice of substitute before writing out again.

C++ Qt cannot read the whole text file

I'm writing a tool for private use. The problem is that Qt cannot read a text file containing all contents published here.
It only reads this
The three points were pasted by Qt.
My code for reading the file is following
QFile file;
file.setFileName(m_filename);
if (!file.open(QIODevice::ReadOnly))
return;
QTextStream in(&file);
while (!in.atEnd()) {
m_fileContents += in.readLine();
}
file.close();
Do you have any idea why it doesn't work?
QFile file;
file.setFileName(m_filename);
if (!file.open(QIODevice::ReadOnly))
return;
m_fileContents = file.readAll();
I just tested your code on my own computer with your data and it works well.
If you're using an IDE, maybe it does not display all the text of your final string and this is why you have three dot at the end of your sample.
Also as evilruff suggest you can use QFile::readAll method directly.

How to read unicode from a file and display the data in a QTextEdit?

I want to read unicode from file and display the corresponding data in a QTextEdit.Please give me some suggestions.
Your question is a bit poor, but you need to use QFile and QTextEdit for this as follows:
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QTextStream in(&file);
while (!in.atEnd())
myTextEdit.append(in.readLine());
or if you are not dealing with a huge file and small memory, you can read the file in as a whole without reading lines and chunks:
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
myTextEdit.setText(file.readAll());
// or setPlainText(file.readAll());
These will read the data in as unicode by default based on the documentation.
There are several ways of doing it, so this answer is just giving you some taste, and you will need to fine-tune this based on your specific scenario. You will need to add proper error handling, includes, build system files, etc.