C++ string/char and accents

C++ string/char and accents - c++

I'm writing a text writer in C++, which I'll have a string of a phrase and display the appropriate bitmap font for each char value.
For now, it's working for the regular characters, but I'm getting weird values for accents and other characters such as À, Á, Â, Ã, etc
I'm doing this:
int charToPrint = 'a';
//use this value to figure which bitmap font to display
The bitmap font does have these characters, but on this line I'm not getting the values I'm supposed to get, such as: 195 for Ã, 199 for Ç, etc...
I tried changing my project's character set from Multi Byte to Unicode, but I don't think that does anything for the char->int conversion...
How can I get this conversion with chars?
Edit: I'm using Visual Studio 2012, Windows 7, and it's an OpenGL application with a bitmap font.
I've mapped the positions/width/height of each character, according to it's char value, so the character a is at the position 97 of my bitmap font (plus width accounted for).
To draw, I just need to figure the position based on the char code.
I have a string of a phrase I want to display, and I loop through each character, figure the charCode, and call my draw function.
For these characters with accents, I'm getting negative values, so my draw function doesn't do anything (there's no position -30 for Ç for example).
I need to figure how to get these values properly and send to the draw function.

Use Unicode, it is year 2013 already :) The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
You will use wchar_t as a type and UTF-16 / UTF-32 encoding. That will make your code supporting not only "irregular" characters but many more "irregular" characters :) (there is no such a thing as regular characters).
Example
wchar_t c = L'Á';
printf("char: %lc encoding: %d\n", c, c);
c = 0xc1;
printf("char: %lc encoding: %d\n", c, c);
Output
char: Á encoding: 193
char: Á encoding: 193

Related

C++ - A few quetions about my textbook's ASCII table

In my c++ textbook, there is an "ASCII Table of Printable Characters."
I noticed a few odd things that I would appreciate some clarification on:
Why do the values start with 32? I tested out a simple program and it has the following lines of code: char ch = 1; std::cout << ch << "\n"; of code and nothing printed out. So I am kind of curious as to why the values start at 32.
I noticed the last value, 127, was "Delete." What is this for, and what does it do?
I thought char can store 256 values, why is there only 127? (Please let me know if I have this wrong.)
Thanks in advance!

The printable characters start at 32. Below 32 there are non-printable characters (or control characters), such as BELL, TAB, NEWLINE etc.
DEL is a non-printable character that is equivalent to delete.
char can indeed store 256 values, but its signed-ness is implementation defined. If you need to store values from 0 to 255 then you need to explicitly specify unsigned char. Similarly from -128 to 127, have to specify signed char.
EDIT
The so called extended ASCII characters with codes >127 are not part of the ASCII standard. Their representation depends on the so called "code page" chosen by the operating system. For example, MS-DOS used to use such extended ASCII characters for drawing directory trees, window borders etc. If you changed the code page, you could have also used to display non-English characters etc.

It's a mapping between integers and characters plus other "control" "characters" like space, line feed and carriage return interpreted by display devices (possibly virtual). As such it is arbitrary, but they are organized by binary values.
32 is a power of 2 and an alphabet starts there.
Delete is the signal from your keyboard delete key.
At the time the code was designed only 7 bits were standard. Not all bytes (parts words) were 8 bits.

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.ｼ,ｧ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?

Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.

For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

Qt QString from string - Strange letters

Whenever I try to convert a std::string into a QString with this letter in it ('ß'), the QString will turn into something like "Ã" or some other really strange letters. What's wrong? I used this code and it didn't cause any errors or warnings!
std::string content = "Heißes Teil.";
ui->txtFind_lang->setText(QString::fromStdString(content));
The std::string has no problem with this character. I even wrote it into a text file without problems. So what am I doing wrong?

You need to set the codec to UTF-8 :
QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
By default, Qt uses the Latin-1 encoding, which is limited. By adding this code, you set the default encoding to UTF-8 which allow you to use much more characters.

Though antoyo's answer works, I wasn't too sure why. So, I decided to investigate.
All of my documents are encoded in UTF-8, as are most web-pages. The ß character has the UTF code point of UTF+00DF.
Since UTF-8 is a variable length encoding, in the binary form, ß would be encoded as 11000011 10011111 or C3 9F. Since by default Qt relies on Latin1 encoding. It would read ß as two different characters. The first one C3 will map to Ã and the second one 9F will not map to anything as Latin1 does not recognize bytes in between 128-159 (in decimal).
That's why ß appears as Ã when using Latin1 encoding.
Side note: You might want to brush up on how UTF-8 encoding works, because otherwise it seems a little unintuitive that ß takes two bytes even though its code point DF is less than FF and should consume just one byte.

determine whether a unicode character is fullwidth or halfwidth in C++

I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.
Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.
Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.
Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.
As an example, this is a fullwidth character (U+3004, the JIS symbol)
〄
12
It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.
To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?

You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.
For example:
bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}
Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.

There's no need to build tables, people from Unicode have already done that:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
The same code is used in terminal emulating software such as xterm[1], konsole[2] and quite likely others...

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Î•Ï…Î³. ÎšÎ±ÏÎ±Î²Î¹Î¬" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?Â¢Â»?Âµ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)

There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ string/char and accents - c++

Related

C++ - A few quetions about my textbook's ASCII table

How to find whether byte read is japanese or english?

Qt QString from string - Strange letters

determine whether a unicode character is fullwidth or halfwidth in C++

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Categories

Resources