Weird ASCII/Unicode Character - c++

Peter Thiel's CS183 Notes has a filename with the ASCII string: "Peter Thiel's CS183.pdf" or at least that is how it prints out in Windows Explorer. However, while debugging my program, I noticed that the ' character isn't the plain apostrophe, it has a (unsigned char) value of 146, not the expected 39.
To test to see if it was a bug in my program, I renamed the file and erased the character and retyped apostrophe. Sure enough, this time my program displayed the correct value. I reasoned therefore that it must be a Unicode character (since I don't see it in the ASCII table). However, it isn't a multibyte character because the next byte in the string is an 's'.
Can someone help explain whats going on here?

Your mistake is believing this string is ASCII.
If you are using a Windows machine with character encoding CP-1252 (see http://en.wikipedia.org/wiki/Windows-1252), then your "code" 146 is a
kind of quote (see the table at the wikipedia page).

It is the right single quote mark in the Windows codepage CP1252, neither in ASCII (or ISO-8859-1) or any form of Unicode.

It's a Right single quotation mark instead of a Single quote:
http://www.ascii-code.com/
Like you said, 39 is a Single quote, but the file must have been named using a Right single quotation mark, decimal value 146 in the Windows Latin-1 extended characters, CP-1252.

Related

How do file names work between `char` and 2-byte characters?

I'm creating a wrapper for boost::filesystem for my application. I'm investigating what's going to happen if I have some non-ASCII characters in the file names.
On Windows, the documentation says that all characters are wchar_t. That's very understandable and coherent.
But on Linux, the the documentation says that all characters are char! So 1-byte characters. I was wondering, will this even work and read non-ASCII characters? So I created a directory with an Arabic name تجريب (It's a 5-letter word), and read it with boost::filesystem. I printed that in the terminal, and it worked fine (apart from that the terminal, terminator, wrote it incorrectly as left-to-right). The printed result on the terminal was:
/mnt/hgfs/D/تجريب
Something doesn't add up there. How could this be 1-byte char string, and still print Arabic names? So I did the following:
std::for_each(path.string().begin(), path.string().end(), [](char c) {
std::cout<<c<<std::endl;
});
And running this gave where path is the directory I mentioned above, gave:
/
m
n
t
/
h
g
f
s
/
D
/
�
�
�
�
�
�
�
�
�
�
And at this point, I really, really got lost. The Arabic word is 10 bytes, which creates a 5-letter word.
Here comes my question: Part of the characters are 1-byte, and part of the characters are 2-bytes. How does linux know that those 2-characters are a single 2-byte character? Does this mean that I never need to have a 2-byte character on linux for its file system, and char is good for all languages?
Could someone please explain how this works?
OK. The answer is that this is UTF-8 encoding, which is variable length by design. In Wikipedia, it answers my question: "How does linux know that those 2-characters are a single 2-byte character?"
The answer is quoted from there:
Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as end of string.
So, there's no ambiguity when interpreting the letters.

JFlex String Regex Strange Behaviour

I am trying to write a JSON string parser in JFlex, so far I have
string = \"((\\(\"|\\|\/|b|f|n|r|t|u[0-9a-fA-F]{4})) | [^\"\\])*\"
which I thought captured the specs (http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf).
I have tested it on the control characters and standard characters and symbols, but for some reason it does not accept £ or ( or ) or ¬. Please can someone let me know what is causing this behaviour?
Perhaps you are running in JLex compatability mode? If so, please see the following from the official JFlex User's Manual. It seems that it will use 7bit character codes for input by default, whereas what you want is 16bit (unicode).
You can fix this by adding the line %unicode after the first %%.
Input Character sets
%7bit
Causes the generated scanner to use an 7 bit input character set (character codes 0-127). If an input character with a code greater than 127 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Not only because of this, you should consider using the %unicode directive. See also Encodings for information about character encodings. This is the default in JLex compatibility mode.
%full
%8bit
Both options cause the generated scanner to use an 8 bit input character set (character codes 0-255). If an input character with a code greater than 255 is encountered in an input at runtime, the scanner will throw an ArrayIndexOutofBoundsException. Note that even if your platform uses only one byte per character, the Unicode value of a character may still be greater than 255. If you are scanning text files, you should consider using the %unicode directive. See also section Econdings for more information about character encodings.
%unicode
%16bit
Both options cause the generated scanner to use the full Unicode input character set, including supplementary code points: 0-0x10FFFF. %unicode does not mean that the scanner will read two bytes at a time. What is read and what constitutes a character depends on the runtime platform. See also section Encodings for more information about character encodings. This is the default unless the JLex compatibility mode is used (command line option --jlex).

ASCII character problem on mac. Can't print the black square(which is char(219))

When I'm trying to do this code in C++
cout << char(219);
the output on my mac is question mark ?
However, on PC it gives me a black square.
Does anyone have any idea why on mac there is only 128 characters, when it should be 256?
Thanks for your help.
There's no such thing as ASCII character 219. ASCII only goes up to 127. chars 128-255 are defined in different ways in different character encodings for different languages and different OSs.
MacRoman defines it as €.
IBM code page 437 (used at the Windows command prompt) defines it as █.
Windows code page 1252 (used in Windows GUI programs) defines it as Û.
UTF-8 defines it as a part of a 2-byte character. (Specifically, the lead byte of the characters U+06C0 to U+06FF.)
ASCII is really a 7-bit encoding. If you are printing char(219) that is using some other encoding: on Windows most probably CP 1252. On Mac, I have no idea...
When a character is missing from an encoding set, it shows a box on Windows (it's not character 219, which doesn't exist) Macs show the question mark in a diamond symbol because a designer wanted it that way. But they both mean the same thing, missing/invalid character.

Why aren't my hyphens displaying correctly using std::cout?

I am trying to print out the following string using std::cout :
"Encryptor –pid1 0x34f –pid2"
the '-' characters appear as u's with a circumflex above them (I'm not sure how to type this).
How do I print out the hyphen as intended?
That was not a hyphen.
It was a "n-dash", which will render differently across consoles based on encoding settings.
The hyphen key is usually on the number row of your keyboard, on Western layouts.
Make sure your terminal's idea of the character encoding matches that of your source code. How to do this, of course, depends on your operating system, which terminal emulator (assuming it's an emulator at all) you're using, and so on, neither of which you state.
Also, that's not a hyphen in your example, it's too long. It's probably an "em dash".

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.