How do file names work between `char` and 2-byte characters? - c++

I'm creating a wrapper for boost::filesystem for my application. I'm investigating what's going to happen if I have some non-ASCII characters in the file names.
On Windows, the documentation says that all characters are wchar_t. That's very understandable and coherent.
But on Linux, the the documentation says that all characters are char! So 1-byte characters. I was wondering, will this even work and read non-ASCII characters? So I created a directory with an Arabic name تجريب (It's a 5-letter word), and read it with boost::filesystem. I printed that in the terminal, and it worked fine (apart from that the terminal, terminator, wrote it incorrectly as left-to-right). The printed result on the terminal was:
/mnt/hgfs/D/تجريب
Something doesn't add up there. How could this be 1-byte char string, and still print Arabic names? So I did the following:
std::for_each(path.string().begin(), path.string().end(), [](char c) {
std::cout<<c<<std::endl;
});
And running this gave where path is the directory I mentioned above, gave:
/
m
n
t
/
h
g
f
s
/
D
/
�
�
�
�
�
�
�
�
�
�
And at this point, I really, really got lost. The Arabic word is 10 bytes, which creates a 5-letter word.
Here comes my question: Part of the characters are 1-byte, and part of the characters are 2-bytes. How does linux know that those 2-characters are a single 2-byte character? Does this mean that I never need to have a 2-byte character on linux for its file system, and char is good for all languages?
Could someone please explain how this works?

OK. The answer is that this is UTF-8 encoding, which is variable length by design. In Wikipedia, it answers my question: "How does linux know that those 2-characters are a single 2-byte character?"
The answer is quoted from there:
Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as end of string.
So, there's no ambiguity when interpreting the letters.

Related

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.シ,ァ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?
Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.
For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

Qt QString from string - Strange letters

Whenever I try to convert a std::string into a QString with this letter in it ('ß'), the QString will turn into something like "Ã" or some other really strange letters. What's wrong? I used this code and it didn't cause any errors or warnings!
std::string content = "Heißes Teil.";
ui->txtFind_lang->setText(QString::fromStdString(content));
The std::string has no problem with this character. I even wrote it into a text file without problems. So what am I doing wrong?
You need to set the codec to UTF-8 :
QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
By default, Qt uses the Latin-1 encoding, which is limited. By adding this code, you set the default encoding to UTF-8 which allow you to use much more characters.
Though antoyo's answer works, I wasn't too sure why. So, I decided to investigate.
All of my documents are encoded in UTF-8, as are most web-pages. The ß character has the UTF code point of UTF+00DF.
Since UTF-8 is a variable length encoding, in the binary form, ß would be encoded as 11000011 10011111 or C3 9F. Since by default Qt relies on Latin1 encoding. It would read ß as two different characters. The first one C3 will map to à and the second one 9F will not map to anything as Latin1 does not recognize bytes in between 128-159 (in decimal).
That's why ß appears as à when using Latin1 encoding.
Side note: You might want to brush up on how UTF-8 encoding works, because otherwise it seems a little unintuitive that ß takes two bytes even though its code point DF is less than FF and should consume just one byte.

Weird ASCII/Unicode Character

Peter Thiel's CS183 Notes has a filename with the ASCII string: "Peter Thiel's CS183.pdf" or at least that is how it prints out in Windows Explorer. However, while debugging my program, I noticed that the ' character isn't the plain apostrophe, it has a (unsigned char) value of 146, not the expected 39.
To test to see if it was a bug in my program, I renamed the file and erased the character and retyped apostrophe. Sure enough, this time my program displayed the correct value. I reasoned therefore that it must be a Unicode character (since I don't see it in the ASCII table). However, it isn't a multibyte character because the next byte in the string is an 's'.
Can someone help explain whats going on here?
Your mistake is believing this string is ASCII.
If you are using a Windows machine with character encoding CP-1252 (see http://en.wikipedia.org/wiki/Windows-1252), then your "code" 146 is a
kind of quote (see the table at the wikipedia page).
It is the right single quote mark in the Windows codepage CP1252, neither in ASCII (or ISO-8859-1) or any form of Unicode.
It's a Right single quotation mark instead of a Single quote:
http://www.ascii-code.com/
Like you said, 39 is a Single quote, but the file must have been named using a Right single quotation mark, decimal value 146 in the Windows Latin-1 extended characters, CP-1252.

ASCII character problem on mac. Can't print the black square(which is char(219))

When I'm trying to do this code in C++
cout << char(219);
the output on my mac is question mark ?
However, on PC it gives me a black square.
Does anyone have any idea why on mac there is only 128 characters, when it should be 256?
Thanks for your help.
There's no such thing as ASCII character 219. ASCII only goes up to 127. chars 128-255 are defined in different ways in different character encodings for different languages and different OSs.
MacRoman defines it as €.
IBM code page 437 (used at the Windows command prompt) defines it as █.
Windows code page 1252 (used in Windows GUI programs) defines it as Û.
UTF-8 defines it as a part of a 2-byte character. (Specifically, the lead byte of the characters U+06C0 to U+06FF.)
ASCII is really a 7-bit encoding. If you are printing char(219) that is using some other encoding: on Windows most probably CP 1252. On Mac, I have no idea...
When a character is missing from an encoding set, it shows a box on Windows (it's not character 219, which doesn't exist) Macs show the question mark in a diamond symbol because a designer wanted it that way. But they both mean the same thing, missing/invalid character.

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.