I want to read Unicode file (UTF-8) character by character, but I don't know how to read from a file one by one character.
Can anyone to tell me how to do that?
First, look at how UTF-8 encodes characters: http://en.wikipedia.org/wiki/UTF-8#Description
Each Unicode character is encoded to one or more UTF-8 byte. After you read first next byte in the file, according to that table:
(Row 1) If the most significant bit is 0 (char & 0x80 == 0) you have your character.
(Row 2) If the three most significant bits are 110 (char & 0xE0 == 0xc0), you have to read another byte, and the bits 4,3,2 of the first UTF-8 byte (110YYYyy) make the first byte of the Unicode character (00000YYY) and the two least significant bits with 6 least significant bits of the next byte (10xxxxxx) make the second byte of the Unicode character (yyxxxxxx); You can do the bit arithmetic using shifts and logical operators of C/C++ easily:
UnicodeByte1 = (UTF8Byte1 << 3) & 0xE0;
UnicodeByte2 = ( (UTF8Byte1 << 6) & 0xC0 ) | (UTF8Byte2 & 0x3F);
And so on...
Sounds a bit complicated, but it's not difficult if you know how to modify the bits to put them in proper place to decode a UTF-8 string.
UTF-8 is ASCII compatible, so you can read a UTF-8 file like you would an ASCII file. The C++ way to read a whole file into a string is:
#include <iostream>
#include <string>
#include <fstream>
std::ifstream fs("my_file.txt");
std::string content((std::istreambuf_iterator<char>(fs)), std::istreambuf_iterator<char>());
The resultant string has characters corresponding to UTF-8 bytes. you could loop through it like so:
for (std::string::iterator i = content.begin(); i != content.end(); ++i) {
char nextChar = *i;
// do stuff here.
}
Alternatively, you could open the file in binary mode, and then move through each byte that way:
std::ifstream fs("my_file.txt", std::ifstream::binary);
if (fs.is_open()) {
char nextChar;
while (fs.good()) {
fs >> nextChar;
// do stuff here.
}
}
If you want to do more complicated things, I suggest you take a peek at Qt. I've found it rather useful for this sort of stuff. At least, less painful than ICU, for doing largely practical things.
QFile file;
if (file.open("my_file.text") {
QTextStream in(&file);
in.setCodec("UTF-8")
QString contents = in.readAll();
return;
}
In theory strlib.h has a function mblen which shell return length of multibyte symbol. But in my case it returns -1 for first byte of multibyte symbol and continue it returns all time. So I write the following:
{
if(i_ch == nullptr) return -1;
int l = 0;
char ch = *i_ch;
int mask = 0x80;
while(ch & mask) {
l++;
mask = (mask >> 1);
}
if (l < 4) return -1;
return l;
}
It's take less time than research how shell using mblen.
try this: get the file and then loop through the text based on it's length
Pseudocode:
String s = file.toString();
int len = s.length();
for(int i=0; i < len; i++)
{
String the_character = s[i].
// TODO : Do your thing :o)
}
Related
I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout
_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);
Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.
C++'s <iostreams> library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.
std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}
This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.
To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string, and utf-32 encoded text be stored as std::vector<uint32_t>.
The text is stored in a std::string.
If the text is 8-bit ASCII, then it is really easy:
text.pop_back();
But what if it is UTF-8 text?
As far as I know, there are no UTF-8 related functions in the standard library which I could use.
You really need a UTF-8 Library if you are going to work with UTF-8. However for this task I think something like this may suffice:
void pop_back_utf8(std::string& utf8)
{
if(utf8.empty())
return;
auto cp = utf8.data() + utf8.size();
while(--cp >= utf8.data() && ((*cp & 0b10000000) && !(*cp & 0b01000000))) {}
if(cp >= utf8.data())
utf8.resize(cp - utf8.data());
}
int main()
{
std::string s = "κόσμε";
while(!s.empty())
{
std::cout << s << '\n';
pop_back_utf8(s);
}
}
Output:
κόσμε
κόσμ
κόσ
κό
κ
It relies on the fact that UTF-8 Encoding has one start byte followed by several continuation bytes. Those continuation bytes can be detected using the provided bitwise operators.
What you can do is pop off characters until you reach the leading byte of a code point. The leading byte of a code point in UTF8 is either of the pattern 0xxxxxxx or 11xxxxxx, and all non-leading bytes are of the form 10xxxxxx. This means you can check the first and second bit to determine if you have a leading byte.
bool is_leading_utf8_byte(char c) {
auto first_bit_set = (c & 0x80) != 0;
auto second_bit_set = (c & 0X40) != 0;
return !first_bit_set || second_bit_set;
}
void pop_utf8(std::string& x) {
while (!is_leading_utf8_byte(x.back()))
x.pop_back();
x.pop_back();
}
This of course does no error checking and assumes that your string is valid utf-8.
I'm trying to read data from binary file using folloing code:
fstream s;
s.open(L"E:\\test_bin.bin", ios::in | ios::binary);
int c = 0;
while (!s.eof())
{
s >> c;
cout << c;
}
c is always 0 (current value of c. If I set c to 1, result is 1). File exists and it has data that is not zeros, so problem is not at file. I can read this file using fread and using s.get(), but why given code not working?
Using the ios::binary flag doesn't necessarily mean that you read and write binary data. Take a look at https://stackoverflow.com/a/2225612/2372604 . ios::binary means "data is read or written without translating..."
What you probably want to do is use s.read(...). In your case the stream operator attempt to read a complete integer (something like "1234") rather then X number of bits that will fit into your integer.
For reading 4 bytes, something like the folling might work (untested):
int n;
while (s.read((char*) &n, 4) && s.gcount() != 0 ) {}
What's wrong with:
int c = 0;
char ch;
int shift = 32;
while ( s.get( ch ) && shift != 0 ) {
shift -= 8;
c |= (ch & 0xFF) << shift;
}
if ( shift != 0 ) {
// Unexpected end of file...
}
This is the (more or less) standard way of reading binary 32 bit
integers off the network. (This supposes that native int is
32 bits 2's complement, of course.)
Some protocols use different representation of 32 bit ints, and
so will require different code.
As for your original code: the test s.eof() is always wrong,
and >> is for inputting text; in particular, it will skip
leading whitespace (and binary data may contain codes which
correspond to whitespace).
I might also add that you should ensure that the stream is
imbued with the "C" locale, so that no code translation
occurs.
I have this bit of code below that I've written that uses utfcpp to convert from a utf16 encoded file to a utf8 string.
I think I must be using it improperly, because the result isnt changing. The utf8content variable comes out with null characters (\0) every other character exactly like the uft16 that I put into it.
//get file content
string utf8content;
std::ifstream ifs(path);
vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
//convert
if(!utf8::is_valid(utf16line.begin(), utf16line.end())){
utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8content));
}
I found the location in the library that is doing the append, it treats everything in the first octet the same, whereas my thought is that it should handle 0's differently.
From checked.h here is the append method (line 106). This is called by utf16to8 (line 202). Notice that I added first part of the if, so that it skips the null chars in an attempt to fix the problem.
template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result)
{
if (!utf8::internal::is_code_point_valid(cp))
throw invalid_code_point(cp);
if(cp < 0x01) //<===I added this line and..
*(result++); //<===I added this line
else if (cp < 0x80) // one octet
*(result++) = static_cast<uint8_t>(cp);
else if (cp < 0x800) { // two octets
*(result++) = static_cast<uint8_t>((cp >> 6) | 0xc0);
*(result++) = static_cast<uint8_t>((cp & 0x3f) | 0x80);
}
else if (cp < 0x10000) { // three octets
*(result++) = static_cast<uint8_t>((cp >> 12) | 0xe0);
*(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f) | 0x80);
*(result++) = static_cast<uint8_t>((cp & 0x3f) | 0x80);
}
else { // four octets
*(result++) = static_cast<uint8_t>((cp >> 18) | 0xf0);
*(result++) = static_cast<uint8_t>(((cp >> 12) & 0x3f) | 0x80);
*(result++) = static_cast<uint8_t>(((cp >> 6) & 0x3f) | 0x80);
*(result++) = static_cast<uint8_t>((cp & 0x3f) | 0x80);
}
return result;
}
I cant imagine that this is the solution however, simply removing the null chars from the string and why wouldnt the library have found this? So clearly I'm doing something wrong.
So, my question is, what is wrong with the way that I'm implementing my utfcpp in the first bit of code? Is there some type conversion that I've done wrong?
My content is a UTF16 encoded xml file. It seems to truncate the results at the first null character.
std::ifstream reads the file in 8bit char units. UTF-16 uses 16bit units instead. So if you want to read the file and fill your vector with proper UTF-16 units, then use std::wifstream instead (or std::basic_ifstream<char16_t> or equivalent if wchar_t is not 16-bit on your platform).
And do no call utf8::is_valid() here. It expects UTF-8 input but you have UTF-16 input instead.
If sizeof(wchar_t) is 2:
std::wifstream ifs(path);
std::istreambuf_iterator<wchar_t> ifs_begin(ifs), ifs_end;
std::wstring utf16content(ifs_begin, ifs_end);
std::string utf8content;
try {
utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
// bad UTF-16 data!
}
Otherwise:
// if char16_t is not available, use unit16_t or unsigned short instead
std::basic_ifstream<char16_t> ifs(path);
std::istreambuf_iterator<char16_t> ifs_begin(ifs), ifs_end;
std::basic_string<char16_t> utf16content(ifs_begin, ifs_end);
std::string utf8content;
try {
utf8::utf16to8(utf16content.begin(), utf16content.end(), std::back_inserter(utf8content));
}
catch (const utf8::invalid_utf16 &) {
// bad UTF-16 data!
}
The problem is where you're reading the file:
vector<unsigned short> utf16line((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
This line is taking a char iterator and using it to fill a vector one byte at a time. You're essentially casting each byte instead of reading two bytes at a time.
This is breaking each UTF-16 entity into two pieces, and for much of your input one of those two pieces will be a null byte.
My company use some code like this:
std::string(CT2CA(some_CString)).c_str()
which I believe it converts a Unicode string (whose type is CString)into ANSI encoding, and this string is for a email's subject. However, header of the email (which includes the subject) indicates that the mail client should decode it as a unicode (this is how the original code does). Thus, some German chars like "ä ö ü" will not be properly displayed as the title.
Is there anyway that I can put this header back to UTF8 and store into a std::string or const char*?
I know there are a lot of smarter ways to do this, but I need to keep the code sticking to its original one (i.e. sent the header as std::string or const char*).
Thanks in advance.
Becareful : it's '|' and not '&' !
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3F);
This sounds like a plain conversion from one encoding to another encoding: You can use std::codecvt<char, char, mbstate_t> for this. Whether your implementation ships with a suitable conversion, I don't know, however. From the sounds of it you just try to convert ISO-Latin-1 into Unicode. That should be pretty much trivial: the first 128 characters map (0 to 127) identically to UTF-8 and the second half conveniently map to the corresponding Unicode code points, i.e., you just need to encode the corresponding value into UTF-8. Each character will be replaced by two characters. That it, I think the conversion is something like that:
// Takes the next position and the end of a buffer as first two arguments and the
// character to convert from ISO-Latin-1 as third argument.
// Returns a pointer to end of the produced sequence.
char* iso_latin_1_to_utf8(char* buffer, char* end, unsigned char c) {
if (c < 128) {
if (buffer == end) { throw std::runtime_error("out of space"); }
*buffer++ = c;
}
else {
if (end - buffer < 2) { throw std::runtime_error("out of space"); }
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3f);
}
return buffer;
}