How to convert from ASCII to string or symbol - c++

I want to output this ^ to the console, but I want to do it using ASCII code and not the value itself. Does anyone have an idea of how to do this?

That symbol is called a caret. The ASCII code is 0x5e in hexadecimal (= 94 in decimal).
C version:
printf("%c", 0x5e);
C++ version:
std::cout << static_cast<char>(0x5e);
Both of these assume that you are running on a system where the default character encoding assigns the caret symbol the value 0x5e.
To avoid having to rely on this assumption it is better to not use the ASCII code but instead use '^'.

The hexadecimal value for the caret character (^) is most often 0x5e (94 in decimal).
std::cout << static_cast<char> (0x5e) << " " << (char)94 << " " << '\x5e';
output on my playform: "^ ^ ^"
I write "most often" because the standard doesn't guarantee what integer value is used to represent a certain character, therefore you shouldn't do what you are implying.
Even though it will probably work out the way you want to (since most modern operating-system represent caret using that value) it's not recommended.. if it's not in the standard no one can guarantee you that it is going to work on all platforms, in all cases.
What does the standard say?
2.3/3   Character sets   [lex.charset]
The execution character set and the execution wide-character set are
implementation-defined supersets of the basic execution character set
and the basic execution wide-character set, respectively. The values
of the members of the execution character sets and the sets of
additional members are locale-specific.

According to this page: http://www.ascii-code.com/
It's ASCII code 0x5e.

Related

Why C++ returns wrong codes of some characters, and how to fix this?

I have a simple line of code:
std::cout << std::hex << static_cast<int>('©');
This character's the Copyright Sign Emoji, its code's a9, but the app writes c2a9. The same happens to lots of Unicode characters. Another example: ™ (this's 2122) suddenly returns e284a2. Why C++ returns wrong codes of some characters, and how to fix this?
Note: I'm using Microsoft Visual Studio, a file with my code is saved in UTF-8.
An ordinary character literal (one without prefix) usually has type char and can store only elements of the execution character set that are representable as a single byte.
If the character is not representable in this way, the character literal is only conditionally-supported with type int and implementation-defined value. Compilers typically warn when this happens with some of the generic warning flags since it is a mistake most of the time. That might depend on what warning flags exactly you have enabled.
A byte is typically 8 bits and therefore it is impossible to store all of unicode in it. I don't know what execution character set your implementation uses, but clearly neither © nor ™ are in it.
It also seems that your implementation chose to support the non-representable character by encoding it in UTF-8 and using that as the value of the literal. You are seeing a representation of the numeric value of the UTF-8 encoding of the two characters.
If you want the numeric value of the unicode code point for the character, then you should use a character literal with U prefix, which implies that the value of the character according to UTF-32 is given with type char32_t, which is large enough to hold all unicode code points:
std::cout << std::hex << static_cast<std::uint_least32_t>(U'©');

Visual Studio C++ C2022. Too big for character error occurs when trying to print a Unicode character

When I try to print a Unicode character to console. Visual Studio gives me an error. How do I fix this and get Visual Studio to print the Unicode character?
#include <iostream>
int main() {
std::cout << "\x2713";
return 0;
}
Quite simply, \x2713 is too large for a single character. If you wanted two characters, you need to do \x27\x13, if you wanted the wide character, then you need to prefix with L, i.e. L"\x2713", then use std::wcout instead of std::cout.
Note, from the C++20 standard (draft) [lex.ccon]/7 (emphasis mine):
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character-literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character-literals with no prefix) or wchar_t (for character-literals prefixed by L).
Essentially, the compiler may treat that character how it wants; g++ issues a warning, and MSVC (for me) is a compiler error (clang also treats as an error)
\xNNN (any positive number of hex digits) means a single byte whose value is given by NNN; unless in a string literal or character literal prefixed by L in which case it means a wchar_t whose value is given by NNN.
If you are looking to encode a Unicode code point, the the syntax is \uNNNN (exactly 4 digits) or \UNNNNNNNN (exactly 8 digits). Note that this is the code point, not a UTF representation.
Using the u or U forms instead of L avoids portability problems due to wchar_t having different size on different platforms.
To get well-defined behaviour you can manually specify the encoding of a string literal, e.g.:
std::cout << u8"\u2713" << std::endl;
which will encode the code point as UTF-8. Of course you still need a UTF-8 aware terminal to see meaningful output.
Without the encoding prefix then it is up to the compiler (I think) in what way to encode the code point.
See:
Escape sequences
String literal

Error in getting ASCII of character in C++

I saw this question : How to convert an ASCII char to its ASCII int value?
The most voted answer (https://stackoverflow.com/a/15999291/14911094) states the solution as :
Just do this:
int(k)
But i am having issues with this.
My code is :
std::cout << char(144) << std::endl;
std::cout << (int)(char(144)) << std::endl;
std::cout << int('É') << std::endl;
Now the output comes as :
É
-112
-55
Now i can understand the first line but what is happening for the second an the third lines?
Firstly how can some ASCII be negative and secondly how can that be different for the same character.
Also as far as i have tested this is not some random garbage from the memory as this stays same for every time i run the program also :
If i change it to 145 :
æ
-111
The output to changes by 1 so as far as i guess this may due to some kind of overflow.
But i cannot get it exactly as i am converting to int and that should be enough(4 bytes) to store the result.
Can any one suggest a solution?
If your platform is using ASCII for the character encoding (most do these days), then bear in mind that ASCII is only a 7 bit encoding.
It so happens that char is a signed type on your platform. (The signedness or otherwise of char doesn't matter for ASCII as only the first 7 bits are required.)
Hence char(144) gives you a char with a value of -112. (You have a 2's complement char type on your platform: from C++14 you can assume that, but you can't in C).
The third line implies that that character (which is not in the ASCII set) has a value of -55.
int(unsigned char('É'))
would force it to a positive value on all but the most exotic of platforms.
The C++ standard only guarantees that characters in the basic execution character set1 have non-negative encodings. Characters outside that basic set may have negative encodings - it depends on the locale.
Upper- and lowercase Latin alphabet, decimal digits, most punctuation, and control characters like tab, newline, form feed, etc.

Unicode string indexing in C++

I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201.
For example:
string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
Output:
ÐðŁłŠšÝýÞþŽž
/201
Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?
Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.
The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus #, $, and the backtick).
C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...
Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).
All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...
C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.
Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)
However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.
And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>
#include <iostream>
int main()
{
// make sure your source file is UTF-8 encoded...
icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
std::cout << ramp << "\n";
std::cout << ramp[5] << "\n";
u_printf( "%C\n", ramp[5] );
}
Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc.
ÐðŁłŠšÝýÞþŽž
353
š
(1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.
C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.
To access codepoints individually, use u32string, which represents a string as a sequence of UTF-32 code units of type char32_t.
u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
In my opinion, the best solution is to do any task with strings using iterators. I can't imagine a scenario where one really has to index strings: if you need indexing like ramp[5] in your example, then the 5 is usually computed in other part of the code and usually you scan all the preceding characters anyway. That's why Standard Library uses iterators in its API.
A similar problem comes up if you want to get the size of a string. Should it be character (or code point) count or merely number of bytes? Usually you need the size to allocate a buffer so byte count is more desirable. You only very, very rarely have to get Unicode character count.
If you want to process UTF-8 encoded strings using iterators then I would definitely recommend UTF8-CPP.
Answering about what is going on, cplusplus.com makes it clear:
Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).
About the solution, others had it right: ICU if you are not using C++11; u32string if you are.

Representing any universal character within the range of 0x00 to 0x7F in C++?

I am writing a Lexer in MSVC and I need a way to represent an exact character match for all 128 Basic Latin unicode characters.
However, according to this MSDN article, "With the exception of 0x24 and 0x40, characters in the range of 0 to 0x20 and 0x7f to 0x9f cannot be represented with a universal character name (UCN)."
...Which basically means I am not allowed to declare something like wchar_t c = '\u0000';, let alone use a switch statement on this 'disallowed' range of characters. Also, for '\n' and '\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's...
(i.e. Windows uses '\r\n', while Unix simply uses '\n' and older versions of MacOS use '\r')
...and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used.
But this C3850 compiler error simply refuses to allow me to do things my way...
So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input?
In C++11 the restrictions on what characters you may represent with universal character names do not apply inside character and string literals.
C++11 2.3/2
Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.15
That means that those restrictions on UCNs don't apply inside character and string literals:
wchar_t c = L'\u0000'; // perfectly okay
switch(c) {
case L'\u0000':
;
}
This was different in C++03 and I assume from your question that Microsoft has not yet updated their compiler to allow this. However I don't think this matters because using UCNs does not solve the problem you're trying to solve.
and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used
Using UCNs does not do anything to determine the encoding scheme used. A UCN is a source encoding independent method of including a particular character in your source, but the compiler is required to treat it exactly the same as if that character had been written literally in the source.
For example, take the code:
int main() {
unsigned char c = 'µ';
std::cout << (int)c << '\n';
}
If you save the source as UTF-16 and build this with Microsoft's compiler on a Windows system configured to use code page 1252 then the compiler will convert the UTF-16 representation of 'µ' to the CP1252 representation. If you build this source on a system configured with a different code page, one that does not contain the character, then the compiler will give a warning/error when it fails to convert the character to that code page.
Similarly, if you save the source code as UTF-8 (with the so-called 'BOM', so that the compiler knows the encoding is UTF-8) then it will convert the UTF-8 source representation of the character to the system's code page if possible, whatever that is.
And if you replace 'µ' with a UCN, '\u00B5', the compiler will still do exactly the same thing; it will convert the UCN to the system's code page representation of U+00B5 MICRO SIGN, if possible.
So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input?
I'm not sure what you're asking. I'm guessing you want to ensure that the integral values of char or wchar_t variables/literals are consistent with a certain encoding scheme (probably ASCII since you're only asking about characters in the ASCII range), but what is the 'source input'? The encoding of your lexer's source files? The encoding of the input to your lexer? How do you expect the 'source input' to vary?
Also, for '\n' and '\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's...
(i.e. Windows uses '\r\n', while Unix simply uses '\n' and older versions of MacOS use '\r')
This is a misunderstanding of text mode I/O. When you write the character '\n' to a text mode file the OS can replace the '\n' character with some platform specific representation of a new line. However this does not mean that the actual value of '\n' is any different. The change is made purely within the library for writing files.
For example you can open a file in text mode, write '\n', then open the file in binary mode and compare the written data to '\n', and the written data can differ from '\n':
#include <fstream>
#include <iostream>
int main() {
char const * filename = "test.txt";
{
std::ofstream fout(filename);
fout << '\n';
}
{
std::ifstream fin(filename, std::ios::binary);
char buf[100] = {};
fin.read(buf, sizeof(buf));
if (sizeof('\n') == fin.gcount() && buf[0] == '\n') {
std::cout << "text mode written '\\n' matches value of '\\n'\n";
} else {
// This will be executed on Windows
std::cout << "text mode written '\\n' does not match value of '\\n'\n";
}
}
}
This also doesn't depend on using the '\n' syntax; you can rewrite the above using 0xA, the ASCII newline character, and the results will be the same on Windows. (I.e., when you write the byte 0xA to a text mode file Windows will actually write the two bytes 0xD 0xA.)
I found that omitting the string literal and simply using the hexadecimal value of the character allows everything to compile just fine.
For example, you would change the following line:
wchar_t c = L'\u0000';
...to:
wchar_t c = 0x0000;
Though, I'm still not sure if this actually holds the same independent values provided by a UCN.