Search for UTF encoded characters using QRegExp - regex

I'm trying to check for characters §£¤ using QRegExp.
QString string = "§¤£";
int res = string.count(QRegExp("[§¤£]"));
And the res returns 0.

Edit your .pro file and set the following:
CODECFORSRC = UTF-8
CODECFORTR = UTF-8
Then add to your .cpp file:
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8"));
That will give you the UTF-8 support for your soruce and internationalization if you need it.

Related

Django DJFile encoding

I'm trying to save .txt files with utf-8 text inside. There are sometimes emojis or chars like "ü","ä","ö", etc.
Opening the file like this:
with file.open(mode='rb') as f:
print(f.readlines())
newMessageObject.attachment = DJFile(f, name=file.name)
sha256 = get_checksum(attachment, algorithm="SHA256")
newMessageObject.media_sha256 = sha256
newMessageObject.save()
logger.debug(f"[FILE][{messageId}] Added file to database")
Readlines is binary, but the file that is created with DJFile is not utf-8 encoded. How can I do that?

Qt: Safe parsing of Windows format data under Linux

I have a Server-Client application in which JSON data is send between those. The Client has a Linux and a Windows version, while the Server application runs under Linux.
The Linux Client communicates just find, but I have problems with the Windows Client.
The problematic JSON data contains a text field with an apostrophe. Let's say the content is "a dog`s name", then the Windows client sends this as "a dog\x92s name", while the Linux client goes for "a dog\xE2\x80\x99s name", at least that is what qDebug() shows me.
I parse the JSON data with the lines
QJsonDocument document = QJsonDocument::fromJson(body);
if(document.isArray()) json_data = document.array();
if(document.isObject()) json_data.append(document.object());
where body is a QByteArray and json_data is a QJsonArray.
If the Windows data is fed into this, it seems that the Qt JSON parser does not recognize it as valid JSON and thus json_data end up being empty.
I really don't want to do anything manually with that text specific to those very characters, as I want it not only to work with that apostrophe but with all kinds of special characters that a user might enter in general. Is there some way to handle this in general? I assume the Windows is in something like the Windows-1252 encoding?
I think windows client sends strings encoded in CP1251 or CP1252. And json decoder expects utf-8.
Maybe source code is not in utf-8 and has string literals. Qt4 has QTextCodec::setCodecForCStrings. Qt5 assume string literals encoded in utf-8.
$ echo -n "’" | iconv -f utf-8 -t cp1251 | xxd
00000000: 92
$ echo -n "’" | xxd
00000000: e280 99
If you don't want to fix windows client the proper way (fixing it's output encoding) you can deal with this situation by converting all input from windows client to unicode before building QJsonDocument on server.
QByteArray bodycp1252;
QTextCodec* cp1252 = QTextCodec::codecForName("CP1252");
QTextCodec* utf8 = QTextCodec::codecForName("UTF-8");
QByteArray body = utf8->fromUnicode(cp1252->toUnicode(bodycp1252));
QJsonDocument document = QJsonDocument::fromJson(body);
It's possible to check if QByteArray contains valid utf-8 data with QUtf8::isValidUtf8(const char *chars, qsizetype len) function. It is defined in private headers, so you need to add QT += core-private. Unfortunately implementation is not visible by linker (not exported from QtCore.lib) so you need to add qutfcodec.cpp from qt sources to your project to resolve linker errors.
////////////////// is-valid-utf8.pro
QT -= gui
QT += core core-private
CONFIG += c++11 console
CONFIG -= app_bundle
qt_src = "C:/Qt/5.15.1/Src"
SOURCES += \
main.cpp \
$$qt_src/qtbase/src/corelib/codecs/qutfcodec.cpp
////////////////// main.cpp
#include <QCoreApplication>
#include <private/qutfcodec_p.h>
#include <QTextCodec>
#include <QDebug>
bool isValidUtf8(const QByteArray& data) {
return QUtf8::isValidUtf8(data.data(), data.size()).isValidUtf8;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QTextCodec* utf8 = QTextCodec::codecForName("UTF-8");
QTextCodec* cp1251 = QTextCodec::codecForName("CP1251");
QByteArray utf8data1 = utf8->fromUnicode("Привет мир");
QByteArray cp1251data1 = cp1251->fromUnicode("Привет мир");
QByteArray utf8data2 = utf8->fromUnicode("Hello world");
QByteArray cp1251data2 = cp1251->fromUnicode("Hello world");
Q_ASSERT(isValidUtf8(utf8data1));
Q_ASSERT(isValidUtf8(cp1251data1) == false);
Q_ASSERT(isValidUtf8(utf8data2));
Q_ASSERT(isValidUtf8(cp1251data2));
qDebug() << "test passed";
return 0;
}
source

How to encode text in Choregraphe NAO

Encoded text
I want to read list from file but its getting all coded and .encode doesn't really work
import json,sys
with open('your_file.txt') as f:
lines = f.read().splitlines()
self.logger.info(lines)
self.tts.say(lines[1])
If your file is saved with UTF-8 encoding, this should work:
with open('text.txt', encoding = 'utf-8', mode = 'r') as my_file:
If this doesn't work, your text file's encoding is not UTF-8. Write your file's encoding in place of utf-8. How to determine the encoding of text?
Or if you share your input file as is, I can figure that out for you.

C++ MySQL Connector no utf8

I have a problem getting UTF-8 strings from MySQL database. I use C++ connector 1.1 and connect with following code:
sql::ConnectOptionsMap connection_properties;
connection_properties["hostName"] = server;
connection_properties["userName"] = user;
connection_properties["password"] = password;
connection_properties["schema"] = database;
connection_properties["port"] = 3306;
connection_properties["OPT_CHARSET_NAME"] = "utf8";
connection_properties["characterSetResults"] = "utf8";
connection_properties["preInit"] = "SET NAMES utf8";
driver = get_driver_instance();
con = driver->connect(connection_properties);
con->setSchema(database);
I already tried different utf8 options as you see....
If a statement should return database strings like "アフガニスタン" I only see chars like this "アフガニスタン" when I use Visual Studio debugger. The observed code:
std::string name = res->getString(2);
After Json encode it prints "ÒéóÒâòÒé¼ÒâïÒé╣Òé┐Òâ│" into command line.
Other utf8 columns with normal latin characters are returned as expected. It only affects translation columns with non latin chars.
Same database call from PHP with same logic (db connection and json encode) on same PC prints out following chars "\u30a2\u30d5\u30ac\u30cb\u30b9\u30bf\u30f3".
Any ideas about that?
Actually there is no problem. I wrote returned data into a file and all UTF-8 characters are correct. Debugger and CMD are not able to display UTF-8 data as expected...

Python 2.7 and Sublime 2 + unicode don't mix

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.