I am trying to read the contents of a text file into Glib::ustrings. As i understand for the gtkmm-list at gnome.org I should read a line into an std::string and only then do the conversion. It works when the text is in ASCII, but fails when given something else.
Example code is:
#include <glibmm/ustring.h>
#include <fstream>
#include <iostream>
int main()
{
std::ifstream fin("testfile");
while(fin)
{
Glib::ustring str;
{
std::string s;
std::getline(fin, s);
std::cout << s << std::endl;
str.assign(s);
}
std::cout << str << std::endl;
}
return 0;
}
And the file contents (saved as UTF8) are
hello
привет
The first line is printed twice, meaning that ustring is getting constructed just fine, but the second one comes out fine as std::string, but then Glib::ConvertError gets thrown. The code of the error is ILLEGAL_SEQUENCE.
I have double checked and the file is as follows:
00000000: 68 65 6c 6c 6f 0a d0 bf d1 80 d0 b8 d0 b2 d0 b5 hello...........
00000010: d1 82 0a -- -- -- -- -- -- -- -- -- -- -- -- -- ...-------------
It does appear to be UTF-8. For example d1 82 are 11010001 10000010, which would make it the character 10001000010 in binary or U+0442, in other words Cyrillic 'т'. I have displayed every character or a read std::string and have confirmed that the file was read correctly.
For me the answer actually was not in the reading, but in the output. After changing it to
std::cout << str.raw() << std::endl;
Everything worked like a charm.
Related
So, here is some simple code to recreate my issue:
#include <cstdio>
const char* badString = u8"aš𒀀";
const char* anotherBadString = u8"\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";
const char* goodString = "\xef\x96\xae\xef\x96\xb6\x61\xc5\xa1\xf0\x92\x80\x80";
void printHex(const char* str)
{
for (; *str; ++str)
{
printf("%02X ", *str & 0xFF);
}
puts("");
}
int main(int argc, char *argv[])
{
printHex(badString);
printHex(anotherBadString);
printHex(goodString);
return 0;
}
I would expect all of these strings to print out the same result, EF 96 AE EF 96 B6 61 C5 A1 F0 92 80 80 . However, in MSVC 2019, the first two strings print out C3 AF C2 96 C2 AE C3 AF C2 96 C2 B6 61 C3 85 C2 A1 C3 B0 C2 92 C2 80 C2 80. This seems to be a result of encoding into UTF-8 an extra time.
I've read in other threads that a solution to this problem is to add the /utf-8 flag to the project, but I've tried that and it doesn't make any difference. Is there something more fundamental that I'm not understanding here?
Thanks a bunch!
The first character of the first string is ï (U+00EF, Latin Small Letter I With Diaeresis), whose UTF-8 encoding is C3 AF.
You apparently want the first string to begin with U+F5AE, but whatever editor you opened the source file in agrees with MSVC that it doesn't begin with that character.
The source file is probably encoded as UTF-8 with a BOM, and that's why the /utf-8 flag doesn't change anything. The string was corrupted at some point, and now its corrupted form is faithfully represented in the file, and MSVC is faithfully preserving it in the compiled code.
The second string begins with \xef, which MSVC is interpreting as equivalent to \u00ef, which is ï again. I can't find any clear statement in the C++20 draft standard regarding what \x is supposed to mean in UTF-8 strings (although I didn't look very hard). From experimentation, it appears that most compilers other than MSVC treat \x followed by hex digits as a literal byte, even if that makes the string not valid UTF-8. I think you shouldn't use \x in u8 prefixed strings because it isn't portable (except for \x00 through \x7f, probably). If you want U+F5AE then write \uf5ae.
So I am writing a program to turn a Chinese-English definition .txt file into a vocab trainer that runs through the CLI. However, in windows when I try to compile this in VS2017 it turns into gibberish and I'm not sure why. I think it was working OK in linux but windows seems to mess it up quite a bit. Does this have something to do with the encoding table in windows? Am I missing something? I wrote the code in Linux as well as the input file, but I tried writing the characters using windows IME and still has the same result. I think the picture speaks best for itself. Thanks
Note: Added sample of input/output as it appears in Windows, as requested. Also, input is UTF-8.
Sample of input
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
Sample of output
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
土(tǔ),earth
Picture of Input file & Output
TL;DR: The Windows terminal hates Unicode. You can work around it, but it's not pretty.
Your issues here are unrelated to "char versus wchar_t". In fact, there's nothing wrong with your program! The problems only arise when the text leaves through cout and arrives at the terminal.
You're probably used to thinking of a char as a "character"; this is a common (but understandable) misconception. In C/C++, the char type is usually synonymous with an 8-bit integer, and thus is more accurately described as a byte.
Your text file chineseVocab.txt is encoded as UTF-8. When you read this file via fstream, what you get is a string of UTF-8-encoded bytes.
There is no such thing as a "character" in I/O; you're always transmitting bytes in a particular encoding. In your example, you are reading UTF-8-encoded bytes from a file handle (fin).
Try running this, and you should see identical results on both platforms (Windows and Linux):
int main()
{
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
{
cout << "Number of bytes in the line: " << dec << line.length() << endl;
cout << " ";
for (char c : line)
{
// Here we need to trick the compiler into displaying this "char" as an integer:
unsigned int byte = (unsigned char)c;
cout << hex << byte << " ";
}
cout << endl;
cout << endl;
}
return 0;
}
Here's what I see in mine (Windows):
Number of bytes in the line: 16
e4 ba ba 28 72 c3 a9 6e 29 2c 70 65 72 73 6f 6e
Number of bytes in the line: 15
e5 88 80 28 64 c4 81 6f 29 2c 6b 6e 69 66 65
Number of bytes in the line: 14
e5 8a 9b 28 6c c3 ac 29 2c 70 6f 77 65 72
Number of bytes in the line: 27
e5 8f 88 28 79 c3 b2 75 29 2c 72 69 67 68 74 20 68 61 6e 64 3b 20 61 67 61 69 6e
Number of bytes in the line: 15
e5 8f a3 28 6b c7 92 75 29 2c 6d 6f 75 74 68
So far, so good.
The problem starts now: you want to write those same UTF-8-encoded bytes to another file handle (cout).
The cout file handle is connected to your CLI (the "terminal", the "console", the "shell", whatever you wanna call it). The CLI reads bytes from cout and decodes them into characters so they can be displayed.
Linux terminals are usually configured to use a UTF-8 decoder. Good news! Your bytes are UTF-8-encoded, so your Linux terminal's decoder matches the text file's encoding. That's why everything looks good in the terminal.
Windows terminals, on the other hand, are usually configured to use a system-dependent decoder (yours appears to be DOS codepage 437). Bad news! Your bytes are UTF-8-encoded, so your Windows terminal's decoder does not match the text file's encoding. That's why everything looks garbled in the terminal.
OK, so how do you solve this? Unfortunately, I couldn't find any portable way to do it... You will need to fork your program into a Linux version and a Windows version. In the Windows version:
Convert your UTF-8 bytes into UTF-16 code units.
Set standard output to UTF-16 mode.
Write to wcout instead of cout
Tell your users to change their terminals to a font that supports Chinese characters.
Here's the code:
#include <fstream>
#include <iostream>
#include <string>
#include <windows.h>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
using namespace std;
// Based on this article:
// https://msdn.microsoft.com/magazine/mt763237?f=255&MSPPError=-2147217396
wstring utf16FromUtf8(const string & utf8)
{
std::wstring utf16;
// Empty input --> empty output
if (utf8.length() == 0)
return utf16;
// Reject the string if its bytes do not constitute valid UTF-8
constexpr DWORD kFlags = MB_ERR_INVALID_CHARS;
// Compute how many 16-bit code units are needed to store this string:
const int nCodeUnits = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of the source UTF-8 string, in bytes
nullptr, // Unused - no conversion done in this step
0 // Request size of destination buffer, in wchar_ts
);
// Invalid UTF-8 detected? Return empty string:
if (!nCodeUnits)
return utf16;
// Allocate space for the UTF-16 code units:
utf16.resize(nCodeUnits);
// Convert from UTF-8 to UTF-16
int result = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of source UTF-8 string, in bytes
&utf16[0], // Pointer to destination buffer
nCodeUnits // Size of destination buffer, in code units
);
return utf16;
}
int main()
{
// Based on this article:
// https://blogs.msmvps.com/gdicanio/2017/08/22/printing-utf-8-text-to-the-windows-console/
_setmode(_fileno(stdout), _O_U16TEXT);
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
wcout << utf16FromUtf8(line) << endl;
return 0;
}
In my terminal, it mostly looks OK after I change the font to MS Gothic:
Some characters are still messed up, but that's due to the font not supporting them.
I'm trying to implement AES for a school project. My goal is to output the encrypted text to both the screen and a .txt file. The encryption goes totally as expected, and I can verify this by looking at this:
for (int j = 0; j<object.words * 4; j++)
{
printf("%02x ", Encryptor.out[j]);
}
The text it is encrypting is "im so glad this works", with the 128-bit key 'dog', and this loop prints the first 16 characters of the encryption, which reads:
c8 88 45 0d 5d 40 ff 5b a4 55 91 c9 c4 00 f5 a4
I've verified that this is what AES should print in this context. Later, I have the following lines of output:
this is Encryptor.out[0] in cout: ╚
This is Encryptor.out[0] in printf with the format code '%02x': c8
Press any key to continue . . .
My cout call probably just needs a formatting code, so I'm not concerned about that. The complication is at this point:
ofstream OutFile("Encrypted.txt");
Outfile << Encryptor.out[0];
At this point, the only thing contained within Encrypted.txt is the single character 'È'. I know that c8 in hex is 'È' in ASCII, but I want it to print the original hex value.
So ultimately, my question is, how do I get this character to be saved in my output file as 'c8'? Is there a formatting code that ofstream can use, or do I have to jump through some hoops?
Thanks guys!
Like #stark commented, to print data in hex you can use std::hex which modifies the way your data is formatted. However, std::hex only changes the way that numbers are printed so you need to tell the compiler to treat your text as numbers. Fortunately there's an easy way to do this. You can use
ofstream OutFile("Encrypted.txt");
OutFile << std::hex;
for (const char c : Encryptor.out[0])
{
OutFile << static_cast<int>(c);
}
// Reset back to normal printing
OutFile << std::dec;
and you will get the correct hex value and not the accented character E.
Check out std::hex here http://en.cppreference.com/w/cpp/io/manip/hex
I have written a small C++ program to understand the use of \b. The program is given below -
#include <iostream>
using namespace std;
int main(){
cout << "Hello World!" << "\b";
return 0;
}
So, this program gives the desired output Hello World.This should not happen because backspace only moves cursor one space back and not delete it from the buffer.So,why ! is not printed?
Now,Consider another program-
#include <iostream>
using namespace std;
int main(){
cout << "Hello World!" << "\b";
cout << "\nAnother Line\n";
return 0;
}
So, here the output is -
Hello World!
Another Line
Why does the backspace does not work here? Newline should not flush the buffer,so ! should be deleted.What is the issue here?
Also,when i add either endl or \n after \b,in both the cases,the output is Hello World!.But,newline character does not flush the buffer whereas endl flushes the buffer.So, how the output is same in both the cases?
I assume the output from your first program looks something like this?
$ ./hello
Hello World$
If so, the ! is not deleted from the buffer; it is clobbered when the shell prints the prompt.
With regard to the second program, when the buffer is flushed only influences when \b is sent to the terminal, not how it is processed. The \b is a part of the stream and a terminal happens to interpret this to mean "back up one column". If this is not clear, take a look at the actual bytes sent to stdout:
$ ./hello2 | hexdump -C
00000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 08 0a 41 6e |Hello World!..An|
00000010 6f 74 68 65 72 20 4c 69 6e 65 0a |other Line.|
0000001b
The \b is followed by the \n (08 and 0a respectively), matching what you wrote to cout in your program.
Finally, cout is flushed when the program exits so it does not matter whether you pass \n or endl in this example. In fact, \n will likely flush anyway since stdout is connected to a terminal.
I would like to ask if there is a method in C++ to turning a .txt file with hexa digits for example
0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68 69 73 20 70 72
to a new .txt with that looking
\x0E\x1F\xBA\x0E\x00\xB4\x09\xCD\x21\xB8\x01\x4C\xCD\x21\x54\x68\x69\x73\x20\x70\x72"
I searched the answer in google but found nothing and tried a script in C++ but does not work with error message "24 11 \x used with no following hex digits"
#include <iostream>
#include <fstream>
#include<vector>
using namespace std;
int main()
{
string hexaEnter;
ifstream read;
ofstream write;
write.open ("newhexa.txt",std::ios_base::app);
read.open("hexa.txt");
while (!read.eof() )
{
read >> hexaEnter;
write << "\x" + hexaEnter;
}
write.close();
read.close();
system("pause");
return 1;
}
write << "\x" + hexaEnter;
// ^^
Here, C++ sees the beginning of a hex escape sequence, like \x0E or \x1F, but it can't find the actual hex values because you didn't provide any.
That's because what you intended to do was literally write the character \ and the character x, so escape the backslash to make that happen:
write << "\\x" + hexaEnter;
// ^^^
As an aside, your loop condition is wrong.