Why are ANSI escape codes not displaying properly? - console-application

I am implementing a Python console application, which uses ANSI escape codes to colorize various things. I develop on Pop OS (an Ubuntu derivative), and colorization works as designed.
I just tried the app on a Centos machine, and while the colors come out correctly, there is additional text (tiny boxes containing digits, stacked vertically), surrounding the colorized text, that apparently corresponds to the escape codes.
The escape codes are all specified in this bit of Python:
style = ('\033[1m\033[3m' if bold and italic else
'\033[1m' if bold else
'\033[3m' if italic else
'\033[0m')
return f'\001{style}\002\001\033[38;5;{color.code}m\002{s}\001\033[0m\002'
(The project I'm working on is https://github.com/geophile/marcel, and the above code comes from marcel.util.colorize().)
What's really odd is that in some cases, the extra characters aren't there, in other cases they are. Also, if I ssh from my pop os machine to my centos machine, the text is colorized correctly in all cases.
What explains this difference in behavior -- something in .bashrc? Something about X configuration?

That \002 is not an "ANSI" escape code. Some programs (not terminals) may interpret it, and depending on how the string is used, that may bypass the programs which were intended to process the extra escapes. (Some terminals may of course provide their own interpretation for 002, etc., but you're unlikely to find that documented anywhere except for their source-code).

Related

C++ char spacing in console output, UTF-16 characters

I'm making a game in C++ console using UTF-16 characters to make it little bit more interesting, but some characters are different size then others. So, when I print the level, things after character are moved further than others. Is there any way how to add spacing between characters with some console function, I try to google something helpful, but I have not found nothing.
I tried to change font size by CONSOLE_FONT_INFOEX, but it changed nothing, maybe i implement it in the wrong way, or it not work with UTF-16 characters.
// i tried this
CONSOLE_FONT_INFOEX cfi;
cfi.cbSize = sizeof(cfi);
cfi.dwFontSize.X= 24;
cfi.dwFontSize.Y= 24;
Unfortunately I expect that this will heavily depend on the particular console you're using. Some less Unicode-friendly consoles will treat all characters as the same size (possibly cutting off the right half of larger characters), and some consoles will cause larger characters to push the rest of the line to the right (which is what I see in the linked image). The most reasonable consoles I've observed have a set of characters considered "double-wide" and reserve two monospace columns for those characters instead of one, so the rest of the line still fits into the grid.
That said, you may want to experiment with different console programs. Can I assume you are on Windows? In that case, you might want to give Windows Terminal a try. If that doesn't work, there are other console programs available, such as msys's Mintty, or ConEmu.
So, after some intense googling i found the solution. And solution is fight fire with fire. Unicode include character Thin Space, that is 1/5 of the normal space, so if i include two of them with one normal space after my problematic character, the output is diplaying how i want. If anybody runs into some simliar issue, unicode have lot of different sized spaces. I found Website that shows them all of them, with their propperties.
fixed output picture

:QML word wrap with nbsp on Linux

I have the following problem:
When I build my application on Windows QML texts do actually wrap correctly with respect to the nbsp character (U+00A0 I think). On my Raspberry Pi with Raspbian however, it seems that the nbsp is ignored and the text is wrapped as if it was just a normal space.
There are several things that may have some importance here:
On Windows I have QT 5.4 whereas on the Raspberry Pi there is 5.2
I think it may have something to do with encoding. The thing is I remember it worked before I forced the G++ compiler on Pi to take the input files as CP1250 (I added QMAKE_CXXFLAGS += -finput-charset=CP1250 to the project file). Well I had to make this tweak because of the diacritics in some of the string literals (otherwise the texts are absolutely broken on raspberry). So as I said I think the word wrap have worked before I changed this compiler switch.
But still, there is not a single problem with the displaying of anything except that the texts happen to be breaked where they shouldn't. Note that there is not any "random" character or something but a regular space. That's absolutely strange as this looks there is no problem with encoding but rather with the word wrapping algorith itslef. But as I said it used to work when it thought the string literals are whatever the default on Linux is (UTF-8 I guess...).
As for the QML Text assignment these strings are taken from C array and assigned to the QML text using QObject::setProperty if that is of any importance...
Also note that I probably cannot change the encoding of my sources to UTF-8 because the file with the strings is shared also for some embedded project that works on the other side of the communication and this one has to be CP1250 because of the IDE.
Thanks in advance
EDIT:
I have some additional information: If I go through one of the affected string literals on Windows, it is in fact shorter than the same literal compiled on Raspberry, even when the source encoding is set to CP1250. For example the nbsp is encoded in only one byte on Windows (160d), but it is two bytes on Raspberry (194d,160d). That's strange, isn't it? I'd expect that after explaining g++ that the source code is encoded in CP1250, it should encode the literals in the same way? Or maybe not because this is then encoding of the string in the memory which is different by default on both Windows and Linux. But still I don't see where's the problem.
As suggested by Kevin Krammer,
QString::fromLocal8Bit()
was the solution.

Encoding issue using XZIP

I wrote a c++ program that needs to zip files in it's work. For creating these zip files I used the XZip library. While developing this program ran on a Win7 machine and it works fine.
Now the program should be used on a WindowsXP machine. The issue I run into is:
If I let XZip create the zip archive "ü.zip" and add the file "ü.txt" to it on Win7 it is working as intended. On WindowsXP however I end up having the "ü.zip" file with "³.txt" as file in it.
The "³" => "ü" thing is of course an encoding issue between UTF8 and Ascii (ü = 252 in UTF8 and 252 = ³ in Ascii) BUT I can't really imagine how this could affect the creating of the internal zip structure in different ways depending on the OS.
//EDIT to clear it up:
the problem is that I run a test with XZip on Win7 and get the archive "ü.zip" containing the file with name "ü.txt".
When I run that test on an XP machine I get the archive "ü.zip" containing the file "³.txt".
//Edit2:
The thing that makes me wonder about that is, what exactly causes the zip to change between XP and Win7. The fact that it does change means that either a windows function behaves differently or XZip has specific behavior for different OS built in.
When having a quick look at XZip I can't see that it changes the encoding flag on the zip archives. The question of course only can be answered by people who did have a closer look into this exact problem before.
As a general rule, if you want any sort of portability between locales, OS's (including different versions) and what have you, you should limit your filenames to the usual 26 letters, the 10 digits, and perhaps '_' and '-' (and I'm not even sure about the latter), and one '.', no more than three characters from the end. Once you start using letters beyond the original ASCII character set, you're at the merci of the various programs which interpret the character set.
Also, 252 isn't anything in ASCII, since ASCII only uses character codes in the range 0...127. And in UTF-8, 252 would be the first byte of a six byte character. Something that doesn't exist in Unicode: in UTF-8, LATIN SMALL LETTER U WITH DIAERESIS would be the two byte sequence 0xC3, 0xBC. 256 is the encoding of LATIN SMALL LETTER U WITH DIAERESIS in ISO 8859-1, otherwise known as Latin-1; it's also the encoding in UTF-16 and UTF-32.
None of this, of course, should affect what is in the file.
May be you are building your Win32 program (or the library) as ASCII (not as UNICODE). It may help if you build your Win32 applications with UNICODE configuration setting (you may change it in your Visual Studio project settings).
It is impossible to say what happened in your program without seeing your code. May be your library or the archive format is not UNICODE-aware, may be your program's code is not UNICODE-aware, may be you don't handle strings careful enough, or may be you just have to change your project setting to UNICODE. Also your "8-bit encoding for non-Unicode programs" Windows OS setting matters if you don't use UNICODE strings.
As for 252, UTF8 and ASCII read post by James Kanze. It is more or less safe to use ASCII file names with no ':', '?', '*', '/', '\' characters. Using non-ASCII characters may lead to encoding problems if you are not using UNICODE-based programs and file-systems.

Why aren't my hyphens displaying correctly using std::cout?

I am trying to print out the following string using std::cout :
"Encryptor –pid1 0x34f –pid2"
the '-' characters appear as u's with a circumflex above them (I'm not sure how to type this).
How do I print out the hyphen as intended?
That was not a hyphen.
It was a "n-dash", which will render differently across consoles based on encoding settings.
The hyphen key is usually on the number row of your keyboard, on Western layouts.
Make sure your terminal's idea of the character encoding matches that of your source code. How to do this, of course, depends on your operating system, which terminal emulator (assuming it's an emulator at all) you're using, and so on, neither of which you state.
Also, that's not a hyphen in your example, it's too long. It's probably an "em dash".

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.