Encoding problems C++ - c++

I have this program that is supposed to load everything from a .txt file into a string and then display it. The problem I'm getting is that when I import the contents of the file they look different than if you view it in a simple text editor. This is what it looks like in a text editor:
bvwÅ.wÅ.Å}.ÅsqÄsÇ.sÑs|.]po{o.r}sÅ|Ç.y|}Ö.op}ÉÇ.wÇ
And this is what it looks like when it's imported and printed in my program:
bvw\201.w\201.\201}.\201sq\200s\202.s\204s|.]po{o.r}s\201|\202.y|}\205.op}\203\202.w\202
It seems like some characters are being encoded in a strange way, e.g. swedish "å" is stored as "/201". I want all of the text that my program handles to be Unicode, so that I can convert characters back and forth between chars and ints. This is how I import the text file:
//Imports the entire file as a string
string toBeDecrypted;
while(getline(inputFile, toBeDecrypted)){
string appendtemp;
getline(inputFile, appendtemp);
toBeDecrypted.append("\n");
toBeDecrypted.append(appendtemp);
}
inputFile.close();
My program also writes to files, so I want it to write in Unicode too.
EDIT
I solved the problem by changing the way that the input file is created, it no longer consists of any ASCII-extended characters.

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.
You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.
TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.

Formatting issue with getline()

I'm trying to read some data out of a text file. One of the data names is Chamber temperature [°C]. I read the file with the command: getline(myfile, tab, '\t'); out.
The problem is that the degree sign is formatted into "Chamber temperature [�C]".
How can I prevent c++ from deformatting the degree sign?
P.S. : In the text file the sign is formatted correctly
Code:
//just create a txt file on your desktop which only stores "Chamber Temperature [°C]
myfile.open("C:\\Users\\user\\Desktop\\test.txt");
string tab = "";
getline(myfile, tab, '\t');
cout << tab << endl;
When you have the same setting as i described below you should have the same problem, well it is not a Problem just a language difference. UTF-8 just cant interpret the signs as ANSI.
There are solutions in which I can look for the substring and then replace the format as I wish, but I would like to have a foolproof and safe way to use this code in any case. So I'm looking for a conversion between these 2 languages.
Additional Information about my environment:
I use eclipse with a MinGW compiler and the accent c++11. I use default text file encoding UTF-8 and the new Text file delimiter UNIX.
I opened the file in notepad++ and it gives me the estimation of the file format "ANSI".
I use a simple ifstream to read the data into a 3D vector (first dimension: file; second dimension: row data; third dimension:columns). I use the getline to read each sequence delimited by a tab into a variable ... and in the end into my matrix.
Now after I have stored the data into my matrix I so some data searching and here comes my problem. Because the file is formatted in ANSI I cant compare the string Chamber Temperature [°C] with the stored data, since it will never find it.
I need to convert the text file into UTF-8 format and then store it into my 3D matrix. Is this possible? I new into coding, so could you please provide me with an example code or pseudo code?

C++ in Windows I can't put the Enter character into a .txt file

I made a program wich use the Huffman Coding to compress and decompress .txt files (ANSI, Unicode, UTF-8, Big Endian Unicode...).
In the decompression I take characters from a binary tree and I put them into a .txt in binary mode:
Ofstream F;
F.open("example.txt", ios::binary);
I have to write into .txt file in binary mode because I need to decompress every type of .txt file (not only ANSI) so my simbols are the single bytes.
On Windows it puts every simbol but doesn't care about the Enter character!
For example, if I have this example.txt file:
Hello
World!
=)
I compress it into example.dat file and I save the Huffman tree into another file (exampletree.dat).
Now to decompress example.dat I take characters from the tree saved in exampletree.dat and I put them into a new .txt file through put() or fwrite(), but on Windows it will be like this:
HelloWorld!=)
On Ubuntu it works perfectly and saves also the Enter character!
It isn't a code error because if I print in the console the decompressed .txt file, it also prints the enter characters! So there is a problem in Windows! Could someone help me?
Did you try opening the file using a wordpad or any other advanced text editor(Notepad++) which identify LF as newline character. The default editor notepad would put it in a single line like you described.
This may not be the solution you are looking for. But the problem looks to be due to having LF as the line break instead of windows default CR/LF.
It looks like it will be the difference in handling EndOfLine on Linux vs. Windows. The EOL can be just "\n" or "\r\n" - i.e. Windows usually puts 0x0d,0x0a at the end of lines.
On Windows there's a difference between:
fopen( "filename", "w" );
fopen( "filename", "tw" );
quote:
In text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output

C++ doesn't save UTF-8 encode file

I'm working in a software that need to save a utf-8 file with special characters (like 'çäüëé').
I received the content to save (regular string with special characters encoded) from a webservice (with gsoap). When I try to save, using ofstream, the file saves a square and other strange characters instead of special characters.
When I try to convert the regular string to a wide string, it's lost the special characters (it is replaced by different ones). And, using wofstream, the file not saves the file when there are special characters.
I try to use also utf8-cpp, but the file it wasn't wrote correct too.

Read Chinese Characters in Dicom Files

I have just started to get a feel of Dicom standard. I am trying to write a small program, that would read a dicom file and dump the information to a text file. I have a dataset that has the patient names in Chinese. How can I read and store these names?
Currently, I am reading the names as Char* from the dicom file, converting this char* to wchar* using code page "950" for Chinese and writing to a text file. Instead of seeing Chinese characters I see * ? and % in my text file. What am I missing?
I am working in C++ on Windows.
If the text file contains UTF-16, have you included a BOM?
There may be multiple issues at hand.
First, do you know the character encoding of the Chinese name, e.g. Big5 or GB*? See http://en.wikipedia.org/wiki/Chinese_character_encoding
Second, do you know the encoding of your output text file? If it is ascii, then you probably won't ever be able to view the Chinese characters. In which case, I would suggest changing it to unicode (i.e. UTF-8).
Then, when you read the Chinese name, convert the raw bytes and write out the result. For example, if the DICOM stores it in Big5, and your text file is UTF-8, you will need a Big5->UTF-8 converter.