Detect text file encoding - c++

In my program I load plain text files supplied by the user:
QFile file(fileName);
file.open(QIODevice::ReadOnly);
QTextStream stream(&file);
const QString &text = stream.readAll();
This works fine when the files are UTF-8 encoded, but some users try to import Windows-1252 encoded files, and if they have words with special characters (for example "è" in "boutonnière"), those will show incorrectly.
Is there a way to detect the encoding, or at least distinguish between UTF-8 (possibly without BOM), and Windows-1252, without asking the user to tell me the encoding?

Turns out that auto-detecting the encoding is impossible for the general case.
However, there is a workaround to at least fall back to the system locale if the text is not valid UTF-8/UTF-16/UTF-32 text. It uses QTextCodec::codecForUtfText(), which tries to decode a byte array using UTF-8, UTF-16 and UTF-32, and returns the supplied default codec if it fails.
Code to do it:
QTextCodec *codec = QTextCodec::codecForUtfText(byteArray, QTextCodec::codecForName("System"));
const QString &text = codec->toUnicode(byteArray);
Update
The above code will not detect UTF-8 without BOM, however, as codecForUtfText() relies on the BOM markers. To detect UTF-8 without BOM, see https://stackoverflow.com/a/18228382/492336.

This trick works for me, at least so far. This method does not require BOM to work:
QTextCodec::ConverterState state;
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
const QByteArray data(readSource());
const QString text = codec->toUnicode(data.constData(), data.size(), &state);
if (state.invalidChars > 0)
{
// Not a UTF-8 text - using system default locale
QTextCodec * codec = QTextCodec::codecForLocale();
if (!codec)
return;
ui->textBrowser->setPlainText(codec->toUnicode(readSource()));
}
else
{
ui->textBrowser->setPlainText(text);
}

Related

Convert QByteArray to QString

I want to encrypt the data of a database and to do this, I used AES_128 in this link for encryption.
The result of encryption is a QByteArray and the QByteArray is saved on the text file in the correct shape and I could decode it correctly, but and I need to convert it to the QString and reverse to QByteArray to store and read it on the Sqlite DB. I tried some options like
QByteArray encodedText; QString DataAsString = QString(encodedText);
and
string DataAsString1 = encodedText.toStdString();
and
QString DataAsString = QTextCodec::codecForName("UTF-8") >toUnicode(encodedText);
and other solutions like this link, but the outputs of these options aren't in the correct way. Because after casting, I couldn't convert the encoded text to decoded correctly.
This is the input string of encoded text:
"\x14r\xF7""6#\xFE\xDB\xF0""D\x1B\xB5\x10\xEDx\xE1""F"
and these are the outputs for the different options:
\024r�6#���D\033�\020�x�F
and
\024r�6#���D\033�\020�x�F
Does anybody suggestion about the right conversion?
try to use this:
QString QString::fromUtf8(const QByteArray &str)

Image loading error from base64 string Qt

string text = "Some base64 string";
QByteArray bytes = QByteArray::fromBase64(text.c_str());
cover.loadFromData(bytes);
This is my code. It's work fine, when base64 string starts with "iV". But in another case (for example when it's start with "/9j/") cover returns NULL. What's the problem?

Why Unicode fonts are not showing properly in the QTextBrowser when Unicode contents are read from an html file?

I am reading an html file. The file basically contains Unicode texts as follows:
<b>akko- sati (ā + kruś), akkhāti (ā + khyā), abbahati (ā + bṛh)</b>
But the QTextBrowser is not interpreting the Unicode fonts. So the QTextBrowser shows them as follows:
akko- sati (Ä + kruÅ›), akkhÄti (Ä + khyÄ), abbahati (Ä + bá¹›h)
The QTextBrowser is correctly interpreting the html tags. But what’s wrong with the Unicode fonts?
Following are my codes for reading and populating the Unicode contents:
void MainWindow::populateTextBrowser(const QModelIndex &index)
{
QFile file("Data\\" + index.data().toString() + ".html");
if (!file.open(QFile::ReadOnly | QFile::Text)) {
statusBar()->showMessage("Cannot open file: " + file.fileName());
}
QTextStream textStream1(&file);
QString string = "<meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><link rel='stylesheet' type='text/css' href='Data/Accessories/qss.css' />";
string += textStream1.readAll();
ui->textBrowser->setHtml(string);
}
However, if I do not read the Unicode content from an html file but directly type them into the parameter, then only it interprets the Unicode fonts. For example, if I do as follows it is fine:
ui->textBrowser->setHtml("<b>akko- sati (ā + kruś), akkhāti (ā + khyā), abbahati (ā + bṛh)</b>");
How can I read the Unicode contents from html files and show them in the QTextBrowser?
I shall be very thankful if someone shows me the buggy parts in my codes or tells me a better way of solving my problem.
You read a binary file into QString but do not tell the program, which bytes correspond to which unicode character, i.e. you don't specify the "encoding" aka. "codec".
To debug your problem, ask QTextStream which codes it uses by default:
QTextStream textStream1(&file);
qDebug() << textStream1.codec()->name();
On my Linux system, that is already "UTF-8" but it might be different on your system. To force QTextStream interpreting the input as UTF-8, use QTextStream::setCodec.

Load and display QString with proper encoding

I am trying to load a name from file that has several special characters and if it is in file (looks like meno: Marek Ružička/) display it. Code here:
QFile File("info/"+meno+".txt");
File.open(QIODevice::ReadOnly);
QVariant Data(File.readAll());
QString in = Data.toString(), pom;
if(in.contains("meno:")){
pom = in.split("meno:").at(1);
pom=pom.split("/").at(0);
ui->label_meno->setText(trUtf8("Celé meno: ")+pom);}
the part trUtf8("Celé meno: ") displays well but I cant find how to display string in pom, it alone looks like Marek RužiÄka, using toUtf8() function makes it Marek RuþiÃÂka, I've tried to convert it to stdString too but doesn't work either. I am not sure if the conversion from QFile to QVariant and to QString is right, if this causes problem how to read data properly?
Try this:
QTextCodec* utf = QTextCodec::codecForName("UTF-8");
QByteArray data = <<INPUT QBYTEARRAY>>.toUtf8();
QString utfString = utf->toUnicode(data);
qDebug() << utfString;
One of the right ways is to use QTextStream for the reading, and then you can specify the codec for utf 8 as follow:
in.setCodec("UTF-8");
See the documentation for further details:
void QTextStream::setCodec(const char * codecName)
Sets the codec for this stream to the QTextCodec for the encoding specified by codecName. Common values for codecName include "ISO 8859-1", "UTF-8", and "UTF-16". If the encoding isn't recognized, nothing happens.
Example:
QTextStream out(&file);
out.setCodec("UTF-8");
Another right way would be to fix your current code without using QTextStream by using the dedicated QString method as follows:
QString in = QString::fromUtf8(File.readAll()), pom;
Please note that though you may wish to add more error handling into your code than available now.

QXmlStreamWriter and cyrillic

I have a problem with encoding when writing XML files via QXmlStreamWriter in windows, how can I resolve it? Using stream.setCodec("UTF-8") or "windows-1251" is not helped.
QFile *file = new QFile(filename);
if (file->open(QIODevice::WriteOnly | QIODevice::Text))
{
QXmlStreamWriter stream(file);
stream.setAutoFormatting(true);
stream.writeStartDocument();
stream.writeStartElement("СЕКЦИЯ"); // start root section
stream.writeStartElement("FIELD");
stream.writeAttribute("name", "Имя");
stream.writeAttribute("value", "Иван");
stream.writeEndElement();
stream.writeEndElement(); // END СЕКЦИЯ
file->close();
}
Most likely the interpretation of the string literals in your source file is the problem, not the configuration of the stream writer.
Make sure your source file is encoded in UTF-8 and use QString::fromUtf8("Imja") etc. (Imja in cyrillic of course) instead of the implicit literal to QString conversion.