wrong text encoding on linux

wrong text encoding on linux - c++

I downloaded a source code .rar file from internet to my linux server. Then, I extract all source files into local directory. When I use "cat" command to see the content of each file, the wrong text encoding is shown on my terminal (There are some chinese characters in the source file).
I use
file -bi testapi.cpp
then shows:
text/plain; charset=iso-8859-1
I tried to convert that file to uft-8 encoding with following command:
iconv -f ISO88591 -t UTF8 testapi.cpp > new.cpp
But it doesn't work.
I set my .vimrc file with following two lines:
set encoding=utf-8
set fileencoding=utf-8
After this, when I vim testapi.cpp, the chinese characters will be normally displayed in the vim. But cat testapi.cpp doesn't work.
When I compile and run the program, the printf statement with chinese characters will print wrong characters like ????
What should I do to display correct chinese characters when I run the program?

TLDR Quickest Solution: Copy/Paste the Visible Text to a Brand-New, Confirmed UTF-8 File
Your file is marked as latin1, but the data is stored as utf8.
When you set set-enc=utf8 or set fileencoding=utf-8 in VIM, you're not changing the data or converting it. You're looking at the same exact data, but interpreting as if it is the utf8 charset. So, good news: Your data is good. No conversion or changing necessary.
You just need to put the same exact data into a file already marked as UTF-8 encoding. That can be done easily by simply making a brand new file in vim, using set enc=utf8, and then copy-pasting your old data into the new file. You can test this out by making a testfile with the only text "汉语" ("chinese language"), set enc, save, close, reopen, and see that the text didn't get corrupted. And you can test with file -bi $pathtofile, though that is not super reliable.
Anyway, TLDR: Make a brand new UTF-8 file, confirm that it's utf-8, make your data visible, and then copy/paste and/or transfer it to the new UTF-8 file, without doing any conversion.
Also, theoretically, I considered that iconv -f utf8 -t utf8 would work, since all I wanted to do was make utf-8-encoded data be marked as utf-8-encoded, without changing it. But this gave me an error that indicated it was still trying to do a data conversion.

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.

You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.

TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.

replace part of file name with wrong encoding

Need some guidance how to solve this one. Have 10 000s of files in multiple subfolders where the encoding got screwed up. Via ls command I see a filename named like this 'F'$'\366''ljesedel.pdf', that includes the ' at beginning and end. That's just one example where the Swedish characters åäö got wrong, in this example this should have been 'Följesedel.pdf'. If If I run
#>find .
Then I see a list of files like this:
./F?ljesedel.pdf
Not the same encoding. How on earth solving this one? The most obvious ways:
myvar='$'\366''
char="ö"
find . -name *$myvar* -exec rename 's/$myvar/ö' {} \;
and other possible ways fails since
find . -name cannot find it due to the ? instead of the "real" characters " '$'\366'' "
Any suggestions or guidance would be very much appreciated.

The first question is what encoding your terminal expects. Make sure that is UTF-8.
Then you need to find what bytes the actual filename contains, not just what something might display it as. You can do this with a perl oneliner like follows, run in the directory containing the file:
perl -E'opendir my $dh, "."; printf "%s: %vX\n", $_, $_ for grep { m/jesedel\.pdf/ } readdir $dh'
This will output the filename interpreted as UTF-8 bytes (if you've set your terminal to that) followed by the hex bytes it actually contains.
Using that you can determine what your search pattern should be. Your replacement must be the UTF-8 encoded representation of ö, which it will be by default as part of the command arguments if your terminal is set to that.

I'm not an expert - but it might not be a problem with the file name (which seems to hold the correct Unicode file name) - but with the way ls (and many other utilities) show the name to the terminal.
I was able to show the correct name by setting the terminal character encoding to Unicode. Also I've noticed the GUI programs (file manager, etc), were able to show the correct file name.
Gnome Terminal: "Terminal .. set character encoding - Unicode UTF8
It is still a challenge with many utilities to 'select' those files (e.g., REGEXP, wildcard). In few cases, you will have to select those character using '*' pattern. If this is a major issue considering using Ascii only - may be use the 'o' instead of 'ö'. Not sure if this is acceptable.

c++ visual studio code appears weird on windows command line?

opening on windows
opening on powershell
I had the problem of exporting my c++ files from visual studio to my school server/folder, where I would use powershell to open and run the files on the command line. The code is all spaced out and weird font when I open them on file, and it appears as strange characters when I open them on the command line. This causes the code to not run at all.
How do I fix this issue?
edit: I have added some pictures for better reference

This may be because the file is encoded UTF-8 but being read as ANSI or vice-versa (or some other mismatch of encodings). Try navigating to the files directly in powershell, i.e.
cd C:\Users\username\source\repos\projectname\projectname
if you are using the default path, and open a file with notepad then click 'Save as' and check the encoding (left of save button). The default indicates what encoding is being used, try changing to UTF-8 or ANSI - whichever the default is not. If that doesn't work you can also try UTF-16 and UTF-32 (which I believe are listed as Unicode and Unicode Big Endian in notepad, but I haven't verified that).
In visual studio, per this article, you can do this from the save dialog by going to File > Save As and in the Save As dialog you click the down arrow next to Save and select Save with encoding... The default appears to be code 1252, I would recommend trying UTF-8 first and see if that works.

What you have is an encoding problem. The first file starts with Unicode byte order mark ÿþ. That is, UTF-16 little endian. Because UTF-16 uses two bytes for each character and your characters are from ASCII subset, each other byte is 00 - which is rendered as extra spaces.
The second file is harder to dechipher, as Nano doesn't render the characters properly. I'd guess it has exactly the same problem - UTF-16.
It seems that some version of Visual Studio ninja-changed default file encoding as UTF-16.
As how to fix the situation, save the files in ASCII or UTF8 encoding on your Windows system, then upload those just like #Ghost adviced.

How to write Russian characters to file with cffile

I have a string I need to write to create an XML file. The string has Russian characters in it, which I can cfoutput to the page no problem, but when I write the file with cffile, those characters return with a ?. I tried changing the charset to the following with no success:
windows-1252
iso-8859-1
cp1251
cp866
I'm sure the charset is the problem here. Any suggestions?
Here is one of the strings in question: Другие
I'm running ColdFusion 10 on a Windows Server 2008 R2 System.

Untested, but I have had pageencoding problems in the past. Try <cfprocessingDirective pageencoding="utf-8"> Make sure you put it every template in the operation. Putting it in application.cfm or application.cfc may not be sufficient.

Set the charset = "utf-8" when writing to the file

c++ opening file with non-Latin name

I have such files. I just want to open files with non-Latin names correctly.
I have no problems with files that have Latin names only with non-Latin names.
I use QDir for scanning directory and I hold names in QString, so it's held fine inside.
But there is a bottleneck with opening the file.
It gets so that I don't want to use QFile, I can use only C++ streams (more preferred) or C files.
When I want to open file, I do so:
fstream stream(source.toStdString().c_str(),ios_base::in | ios_base::binary);
After that I check whether attempt was successful:
if(!stream.is_open())
{ cout<<"file wasn't opened " <<source.toStdString().c_str())<<"\n";
return false; // cout was redirected to file // just a notice
}
I get in my log file:
file wasn't opened /home/sh/.mozilla/firefox/004_??????? - ????? - ?????.mp3
It doesn't work for any file with non-Latin name and it does work fine for every file with Latin names.
I understand that this problem can be jumped over using QFile.
But I wonder, is it possible to get it done without third-party libraries or are there some another ways for solving it?
Thanks in advance for any tips.

Things are going wrong when you call toStdString() on your QString. It will convert the contents based on QTextCodec::codecForCStrings(), if it has been set, and latin-1 will be used otherwise. Latin-1 will collapse your non-latin characters to '?'s.
Using source.toLocal8Bit().data() or source.toUtf8().data() instead will likely do what you want, but failing that you'll need to deal with QTextCodecs to get the right 8-bit encoding.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

wrong text encoding on linux - c++

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

replace part of file name with wrong encoding

c++ visual studio code appears weird on windows command line?

How to write Russian characters to file with cffile

c++ opening file with non-Latin name

Categories

Resources