C++ Visual Studio Unicode confusion - c++

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.
I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.
The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.
I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.
#include <iostream>
using namespace std;
int main()
{
unsigned int character = 122; // Starting at "z"
for (int i = 0; i < 100; i++)
{
cout << (char)character << endl;
cout << "decimal code point = " << (int)character << endl;
cout << "size of character = " << sizeof(character) << endl;
character++;
system("pause");
cout << endl;
}
return 0;
}
By the way, here is the Unicode chart
http://unicode-table.com/en/#control-character

Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).
For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.
There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

Ascii character codes in C++ visual studio don't match up with actual character symbols?

I am trying to represent 8 bit numbers with characters and I don't want to have to use an int to define the character, I want to use the actual character symbol. But when I use standard ASCII codes, everything outside of the range 32-100 something are completely different.
So I looped through all 256 ascii codes and printed them, for instance it said that this character '±' is code 241. But when I copy and paste that symbol from the console, or even use the alt code, it says that character is code -79... How can I get these two things to match up? Thank you!
It should be a problem of codepage. Windows console applications have a codepage that "maps" the 0-255 characters to a subset of Unicode characters. It is something that exists from the pre-Unicode era to support non-american character sets. There are some Windows API to select the codepage for the console: SetConsoleOutputCP (and SetConsoleCP for input)
#include <iostream>
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
int main() {
unsigned int cp = ::GetConsoleOutputCP();
std::cout << "Default cp of console: " << cp << std::endl;
::SetConsoleCP(28591);
::SetConsoleOutputCP(28591);
cp = ::GetConsoleOutputCP();
std::cout << "Now the cp of console is: " << cp << std::endl;
for (int i = 32; i < 256; i++) {
std::cout << (char)i;
if (i % 32 == 31) {
std::cout << std::endl;
}
}
}
The 28591 codepage is the iso-8859-1 codepage that maps the 0-255 characters to the unicode 0-255 codepoints (the 28591 number is taken from https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers)

C++ Non ASCII letters

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}

Decimal values of Extended ASCII characters

I wrote a function to test if a string consists only of letters, and it works well:
bool is_all_letters(const char* src) {
while (*src) {
// A-Z, a-z
if ((*src>64 && *src<91) || (*src>96 && *src<123)) {
*src++;
}
else {
return false;
}
}
return true;
}
My next step was to include “Extended ASCII Codes”, I thought it was going to be really easy but that’s where I ran into trouble. For example:
std::cout << (unsigned int)'A' // 65 <-- decimal ascii value
std::cout << (unsigned int)'ñ'; // 4294967281 <-- what?
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
My goal is to restrict user input to only letters in ISO 8859-1 (latin 1). I’ve only worked with single byte characters and would like to avoid multi-byte characters if possible.
I am guessing that I can compare the unsigned int values above, i.e.: 4294967281, but it does not feel right to me and besides, I don’t know if that large integer is VC 8.0 representation of 'ñ' and changes from compiler to compiler.
Please advise
UPDATE - Per some suggestions made by Christophe, I ran the following code:
locale loc("spanish") ;
cout<<loc.name() << endl; // Spanish_Spain.1252
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
It does return Spanish_Spain.1252 but unfortunately, the loop iterations print the same data as the default C locale (using VC++ 8 / VS 2005).
Christophe shows different (desired) results as you can see in his screen shots below, but he uses a much newer version of VC++.
The code chart you found on the internet is actually Windows OEM code page 437, which was never endorsed as a standard. Although it is sometimes called "extended ASCII", that description is highly misleading. (See the Wikipedia article Extended ASCII: "The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue."
You can find the history of OEM437 on Wikipedia, in various versions.
What was endorsed as a standard 8-bit encoding is ISO-8859-1, which later became the first 256 code points in Unicode. (It's one of a series of 8-bit encodings designed for use in different parts of the world; ISO-8859-1 is specified to the Americas and Western Europe.) So that's what you will find in most computers produced in this century in those regions, although more recently more and more operating systems are converting to full Unicode support.
The value you see for (unsigned int)'ñ' is the result of casting the ISO-8859-1 code 0xF1 from a (signed) char (that is, -15) to an unsigned int. Had you cast it to an int, you would have seen -15.
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
Asciitable.com appears to give the code for the old IBM437 DOS character set (still used in the Windows command prompt), in which ñ is indeed 164. But that's just one of hundreds of “extended ASCII” variants.
The value 4294967281 = 0xFFFFFFF1 you got is a sign-extension of the (signed) char value 0xF1, which is how ñ is encoded in ISO-8859-1 and close variants like Windows-1252.
To start with, you're trying to reinvent std::isalpha. But you'll need to pass the ISO-8859-1 locale IIRC, by default that just checks ASCII.
The behavior you see is because char is signed (because you didn't compile with /J, which is the smart thing to do when you use more than just ASCII - VC++ defaults to signed char).
There is already plenty of information here. However, I'd like to propose some ideas to adress your inital problem, being the categorisation of extended character set.
For this, I suggest the use of <locale> (country specific topics), and especially the new locale-aware form of isalpha(), isspace(), isprint(), ... .
Here a little piece of code to help you to find out what chars could be a letter in your local alphabet:
std::locale::global(std::locale("")); // sets the environment default locale currently in place
std::cout << std::locale().name() << std::endl; // display name of current locale
std::locale loc ; // use a copy of the active global locale (you could use another)
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
This will print out the ascii code from 0 to 255, followed by an indicator if it is a letter according to the local settings, and the character itself if it's printable.
FOr example, on my PC, I get:
And all the accented chars, as well as ñ, and greek letters are considered as alpha, whereas £ and mathematical symbols are considered as non alpha printable.

g++ unicode character ifstream

this is a question about unicode characters in a text input file. This discussion was close but not quite the answer. Compiled with VS2008 and executed on Windows these charcters are recognized on read (maybe represented as a different symbol but read) - compiled with g++and executed on linux they are displayed as blank.
‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
The rest of the Unicode symbols appear to work fine, I did not check them all but found this set did not work.
Questions:
(1) Why?
(2) Is there a solution?
void Lexicon::buildMapFromFile(string filename ) //map
{
ifstream file;
file.open(filename.c_str(), ifstream::binary);
string wow, mem, key;
unsigned int x = 0;
while(true) {
getline(file, wow);
cout << wow << endl;
if (file.fail()) break; //boilerplate check for error
while (x < wow.length() ) {
if (wow[x] == ',') { //look for csv deliniator
key = mem;
mem.clear();
x++; //step over ','
} else
mem += wow[x++];
}
//cout << mem << " code " << key << " is " << (key[0] - '€') << " from €" << endl;
cout << "enter 1 to continue: ";
while (true) {
int choice = GetInteger();
if (choice == 1) break;
}
list_map0[key] = mem; //char to string
list_map1[mem] = key; //string to char
mem.clear(); //reset memory
x = 0;//reset index
}
//printf("%d\n", list_map0.size());
file.close();
}
The unicode symbols are read from a csv file and parsed for the unicode symbol and an associated string. Initially I though there was a bug in the code but in this post the review found it is fine and I followed the issue to how the characters are handled.
The test is cout << wow << endl;
The characters you show are all characters from Windows codepage 1252 which do not exist in the ISO-8859 1 encoding. These two encodings are similar and so are often confused.
If the input is CP1252 and you are reading it as though it were ISO-8859 1 then those characters are read as control characters and will not behave as normal, visible characters.
There are many possible things you could be doing incorrectly to cause this, but you'll have to post more details in order to determine which. A more complete answer requires knowing how you are reading the data, how you are converting and storing it internally, how you are testing the read data, and the input data and/or encoding.
Your displayed code doesn't do any conversions while reading the data, and the commented-out code to print the data is the same; no conversions. This means to print the data you are simply relying on the input data to be correct for the platform you run the program on. That means that, for example, if you run your program in the console on Windows then your input file needs to be encoded using the console's codepage*.
To fix the problem you can either; ensure the input file matches the encoding required by the particular console you run the program on; or specify the input encoding, convert to a known internal encoding when reading and then convert to the console encoding when printing.
* and if it's not, for example if the console is cp437 and the file is cp1252 then the characters you listed will instead show up as: É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
Your problem-statement does not detail out thee platform for g++, but from your tags it appears that you are compiling that same code on linux.
Windows and linux are both unicode enabled. so, if your code in windows/vs-2008 had wstring class, then, you have to change it back to string on linux/g++. If you are using wstring in linux, it will not work the same way.
Unicode handling in C++ code is not straightforward and it is implementation-depending (you have already seen that the output change between VS2008 and g++). Furthermore, Unicode can be implemented by different character encodings (like UTF-8 and UTF-16).
There is an enlightening article in this page. It talks about Unicode conversion for STL-based software. For text i/o the main weapon is codecvt, a library function that can be used for translating strings between different character encoding systems.