How do I create a character array using decimal/hexadecimal representation of characters instead of actual characters.
Reason I ask is because I am writing C code and I need to create a string that includes characters that are not used in English language. That string would then be parsed and displayed to an LCD Screen.
For example '\0' decodes to 0, and '\n' to 10. Are there any more of these special characters that i can sacrifice to display custom characters. I could send "Temperature is 10\d C" and degree sign is printed instead of '\d'. Something like this would be great.
Assuming you have a character code that is a degree sign on your display (with a custom display, I wouldn't necessarily expect it to "live" at the common place in the extended IBM ASCII character set, or that the display supports Unicode character encoding) then you can use the encoding \nnn or \xhh, where nnn is up to three digits in octal (base 8) or hh is up to two digits of hex code. Unfortunately, there is no decimal encoding available - Dennis Ritchie and/or Brian Kernighan were probably more used to using octal, as that was quite common at the time when C was first developed.
E.g.
char *str = "ABC\101\102\103";
cout << str << endl;
should print ABCABC (assuming ASCII encoding)
You can directly write
char myValues[] = {1,10,33,...};
Use \u00b0 to make a degree sign (I simply looked up the unicode code for it)
This requires unicode support in the terminal.
Simple, use std::ostringstream and casting of the characters:
std::string s = "hello world";
std::ostringstream os;
for (auto const& c : s)
os << static_cast<unsigned>(c) << ' ';
std::cout << "\"" << s << "\" in ASCII is \"" << os.str() << "\"\n";
Prints:
"hello world" in ASCII is "104 101 108 108 111 32 119 111 114 108 100 "
A little more research and i found answer to my own question.
Characters follower by a '\' are called escape sequence.
You can put octal equivalent of ascii in your string by using escape sequence from'\000' to '\777'.
Same goes for hex, 'x00' to 'xFF'.
I am printing my custom characters by using 'xC1' to 'xC8', as i only had 8 custom characters.
Every thing is done in a single line of code: lcd_putc("Degree \xC1")
Related
I do not understand what's going on here. This is compiled with GCC 10.2.0 compiler. Printing out the whole string is different than printing out each character.
#include <iostream>
int main(){
char str[] = "“”";
std::cout << str << std::endl;
std::cout << str[0] << str[1] << std::endl;
}
Output
“”
��
Why are not the two outputted lines the same? I would expect the same line twice. Printing out alphanumeric characters does output the same line twice.
Bear in mind that, on almost all systems, the maximum value a (signed) char can hold is 127. So, more likely than not, your two 'special' characters are actually being encoded as multi-byte combinations.
In such a case, passing the string pointer to std::cout will keep feeding data from that buffer until a zero (nul-terminator) byte is encountered. Further, it appears that, on your system, the std::cout stream can properly interpret multi-byte character sequences, so it shows the expected characters.
However, when you pass the individual char elements, as str[0] and str[1], there is no possibility of parsing those arguments as components of multi-byte characters: each is interpreted 'as-is', and those values do not correspond to valid, printable characters, so the 'weird' � symbol is shown, instead.
"“”" contains more bytes than you think. It's usually encoded as utf8. To see that, you can print the size of the array:
std::cout << sizeof str << '\n';
Prints 7 in my testing. Utf8 is a multi-byte encoding. That means each character is encoded in multiple bytes. Now, you're printing bytes of a utf8 encoded string, which are not printable themselves. That's why you get � when you try to print them.
I tried:
#define EURO char(128)
cout << EURO ; //only worked on my windows desktop, not linux
Or is there a character similar to the euro sign to display ?
According to this https://www.compart.com/en/unicode/U+20AC following should work if you Linux session configured to use UTF-8
std::cout << "\xe2\x82\xac" << std::endl;
Note it has to be a string literal not a single char as there are 3 bytes in UTF8 encoding for euro.
I have to write a code in C++ that identifies and counts English and non-English characters in a string.
The user writes an input and the program must count user's letters and notify when it finds non-English letters.
My problem is that I get a question mark instead of the non-English letter!
At the beginning of the code I wrote:
...
#include <clocale>
int main() {
std::setlocale(LC_ALL, "sv_SE.UTF-8");
...
(the locale is Swedish)
If I try to print out Swedish letters before the counting loops (as a test), it does work, so I guess the clocale is working fine.
But when I launch the counting loop below,
for (unsigned char c: rad) {
if (c < 128) {
if (isalpha(c) != 0)
bokstaver++;
}
if (c >= 134 && c <= 165) {
cout << "Your text contains a " << c << '\n';
bokstaver++;
}
}
my non-English letter is taken into account but not printed out with cout.
I used unsigned char since non-English letters are between ASCII 134 and 165, so I really don't know what to do.
test with the word blå:
non-English letters are between ASCII 134 and 165
No, they aren't. Non english characters are not between any ASCII characters in UTF-8. Non ASCII characters consist of two or more code units (those individual code units themselves can represent some character in ASCII) . å for example consists of 0xC3 followed by 0xA5.
The C and C++ library functions which only accept a single char (such as std::isalpha) are not useful when using UTF-8 because that single char can only represent a single code unit.
cout << hex << setfill('0');
cout << 12 << setw(2);
output : 0a??????
I have no lead on UTF-8, From my understanding unless it is a non ascii character there is no difference.
What is the C++ equivalent to find the hex of a UTF-8 encoded not included in ASCII?
Correct me if i am wrong, very new to C++ if i use this expression does it mean if i take an output, let's say 12, i set 2, i will get an output of 0a?
The expression itself does not create any output of any sort.
How do i tweak it so it can take a UTF-8 character? Right now i can only deal with ASCII.
ASCII ranging from 32 to 126 are printable. 127 is DEL and thereafter are considered the extended characters.
To check, how are they stored in the std::string, I wrote a test program:
int main ()
{
string s; // ASCII
s += "!"; // 33
s += "A"; // 65
s += "a"; // 97
s += "â"; // 131
s += "ä"; // 132
s += "Ã "; // 133
cout << s << endl; // Print directly
for(auto i : s) // Print after iteration
cout << i;
cout << "\ns.size() = " << s.size() << endl; // outputs 9!
}
The special characters visible in the code above actually look different and those can be seen in this online example (also visible in vi).
In the string s, first 3 normal characters acquire 1 byte each as expected. The next 3 extended characters take surprisingly 2 bytes each.
Questions:
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
[Note: This may also apply to C and other languages.]
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
If you define 'being ASCII' as containing only bytes in the range [0, 256), then all data is ASCII: [0, 256) is the same as the range a byte is capable of representing, thus all data represented with bytes is ASCII, under your definition.
The issue is that your definition is incorrect, and you're looking incorrectly at how data types are determined; The kind of data represented by a sequence of bytes is not determined by those bytes. Instead, the data type is metadata that is external to the sequence of bytes. (This isn't to say that it's impossible to examine a sequence of bytes and determine statistically what kind of data it is likely to be.)
Let's examine your code, keeping the above in mind. I've taken the relevant snippets from the two versions of your source code:
s += "â"; // 131
s += "ä"; // 132
s += "â"; // 131
s += "ä"; // 132
You're viewing these source code snippets as text rendered in a browser, rather than as raw binary data. You've presented these two things as the 'same' data, but in fact they're not the same. Pictured above are two different sequences of characters.
However there is something interesting about these two sequences of text elements: one of them, when encoded into bytes using a certain encoding scheme, is represented by the same sequence of bytes as the other sequence of text elements when that sequence is encoded into bytes using a different encoding scheme. That is, the same sequence of bytes on disk may represent two different sequences of text elements depending on the encoding scheme! In other words, in order to figure out what the sequence of bytes means, we have to know what kind of data it is, and therefore what decoding scheme to use.
So here's what probably happened. In vi you wrote:
s += "â"; // 131
s += "ä"; // 132
You were under the impression that vi would represent those characters using extended ASCII, and thus use the bytes 131 and 132. But that was incorrect. vi did not use extended ASCII, and instead it represented those characters using a different scheme (UTF-8) which happens to use two bytes to represent each of those characters.
Later, when you opened the source code in a different editor, that editor incorrectly assumed the file was extended ASCII and displayed it as such. Since extended ASCII uses a single byte for every character, it took the two bytes vi used to represent each of those characters, and showed one character for each byte.
The bottom line is that you're incorrect that the source code is using extended ASCII, and so your assumption that those characters would be represented by single bytes with the values 131 and 132 was incorrect.
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
Your program isn't doing this. The characters are printing okay in your ideone.com example because independently printing out the two bytes that represent those characters works to display that character. Here's an example that makes this clear: live example.
std::cout << "Printed together: '";
std::cout << (char)0xC3;
std::cout << (char)0xA2;
std::cout << "'\n";
std::cout << "Printed separated: '";
std::cout << (char)0xC3;
std::cout << '/';
std::cout << (char)0xA2;
std::cout << "'\n";
Printed together: 'â'
Printed separated: '�/�'
The '�' character is what shows up when an invalid encoding is encountered.
If you're asking how you can write a program that does do this, the answer is to use code that understands the details of the encoding being used. Either get a library that understands UTF-8 or read the UTF-8 spec yourself.
You should also keep in mind that the use of UTF-8 here is simply because this editor and compiler use UTF-8 by default. If you were to write the same code with a different editor and compile it with a different compiler, the encoding could be completely different; assuming code is UTF-8 can be just as wrong as your earlier assumption that the code was extended ASCII.
Your terminal probably uses UTF-8 encoding. It uses one byte for ASCII characters, and 2-4 bytes for everything else.
The basic source character set for C++ source code does not include extended ASCII characters (ref. §2.3 in ISO/IEC 14882:2011) :
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’
So, an implementation has to map those characters from the source file to characters in the basic source character set, before passing them on to the compiler. They will likely be mapped to universal character names, following ISO/IEC 10646 (UCS) :
The universal-character-name construct provides a way to name other characters.
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN.
A universal character name in a narrow string literal (as in your case) may be mapped to multiple chars, using multibyte encoding (ref. §2.14.5 in ISO/IEC 14882:2011) :
In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding.
That's what you're seeing for those 3 last characters.