I'm trying to learn more about encoding, I knew that CR is 0x0d in hex and LF is 0x0a but CRLF is not 0x0d 0x0a and not 0x0d0a, I tried std::cout << std::hex << (int)'\r\n' in C++ and the result was 0x0d.
So, is CRLF == CR? and are these hex values the same on all operating systems?
Edit
The following is the result when tried on Windows 10 machine using MSVC (v16.2.0-pre.2.0)
const char crlf = '\r\n';
std::cout << std::hex << (int)crlf << std::endl;
std::cout << std::hex << (int)'\r\n' << std::endl;
std::cout << std::hex << (int)'\n\r' << std::endl;
output
d
a0d
d0a
If you write ‘\r\n’ your compiler should warm you since that’s a multi character literal which is implementation specific and not usually used due to that. In this case it looks like the compiler discarded the other characters.
Yes CR is 0xd and LF is 0xa in ASCII standard. The C standard doesn’t require ASCII by itself as far as I know so theoretically they could be something else. That’s why we write \n instead of 0xa (also for clarity). But practically every system in use now uses ASCII as the basis of the character set and may extend it if needed.
Conclusion
CR is char and equals to 0x0d in hex
LF is char and equals to 0x0a in hex
CRLF is a character sequence and it's equal CR LFseperatedly so it's equal to 0x0d 0x0a in hex (as mentioned by #tkausl)
The explanation of the result I got is that const char crlf = '\r\n'; will be compiled to \n (when compiled by MSVC)
when I looked at the assembly output I've found that comment ; 544 : // doing \r\n -> \n translation
thanks for all of the helpful comments.
Related
When i wrte this code :
using namespace std;
int main(){
char x[] = "γεια σας";
cout << x;
return 0;
}
I noticed that compiler gave me output which i excepted γεια σας Although the type of array is char, That is, it should just accept ASCII characters.
So why compiler didn't give error?
Here's some code showing what C++ really does:
#include <iostream>
#include <iomanip>
using namespace std;
int main(){
char x[] = "γεια σας";
cout << x << endl;
auto len = strlen(x);
cout << "Length (in bytes): " << len << endl;
for (int i = 0; i < len; i++)
cout << "0x" << setw(2) << hex << static_cast<int>(static_cast<unsigned char>(x[i])) << ' ';
cout << endl;
return 0;
}
The output is:
γεια σας
Length (in bytes): 15
0xce 0xb3 0xce 0xb5 0xce 0xb9 0xce 0xb1 0x20 0xcf 0x83 0xce 0xb1 0xcf 0x82
So the string takes up 15 bytes and is encoded as UTF-8. UTF-8 is a Unicode encoding using between 1 and 4 bytes per character (in the sense of the smallest unit you can select with the text cursor). UTF-8 can be saved in a char array. Even though it's called char, it basically corresponds to a byte and not what we typically think of as a character.
What you have got with 99.99% likelihood is Unicode code points stored in UTF-8 format. Each code point is turned into one to four chars.
Unicode in the ASCII range is turned into one ASCII byte from 0x00 to 0x7f. There are 2048 code points translated to two bytes with the binary pattern 110x xxxx 10yy yyyy, 65536 are translated to three code points 1110 xxxx 10yy yyyy 10zz zzzz, and the rest becomes four chars 1111 0xxx 10yy yyyy 10zz zzzz 10uu uuuu.
Most C and C++ string functions work just fine with UTF-8. An exception is strncpy or strncat which could create an incomplete code point. The old Interview problem “reverse the strings in a character” becomes more complicated because reversing the bytes inside a code point produces nonsense.
Although the type of array is char, That is, it should just accept ASCII characters.
You've assumed wrongly.
Unicode has several transformation formats. One popular such format is UTF-8. The code units of UTF-8 are 8 bits wide, as implied by the name. It is always possible to use char to represent the code units of UTF-8, because char is guaranteed to be at least 8 bits wide.
I do not understand what's going on here. This is compiled with GCC 10.2.0 compiler. Printing out the whole string is different than printing out each character.
#include <iostream>
int main(){
char str[] = "“”";
std::cout << str << std::endl;
std::cout << str[0] << str[1] << std::endl;
}
Output
“”
��
Why are not the two outputted lines the same? I would expect the same line twice. Printing out alphanumeric characters does output the same line twice.
Bear in mind that, on almost all systems, the maximum value a (signed) char can hold is 127. So, more likely than not, your two 'special' characters are actually being encoded as multi-byte combinations.
In such a case, passing the string pointer to std::cout will keep feeding data from that buffer until a zero (nul-terminator) byte is encountered. Further, it appears that, on your system, the std::cout stream can properly interpret multi-byte character sequences, so it shows the expected characters.
However, when you pass the individual char elements, as str[0] and str[1], there is no possibility of parsing those arguments as components of multi-byte characters: each is interpreted 'as-is', and those values do not correspond to valid, printable characters, so the 'weird' � symbol is shown, instead.
"“”" contains more bytes than you think. It's usually encoded as utf8. To see that, you can print the size of the array:
std::cout << sizeof str << '\n';
Prints 7 in my testing. Utf8 is a multi-byte encoding. That means each character is encoded in multiple bytes. Now, you're printing bytes of a utf8 encoded string, which are not printable themselves. That's why you get � when you try to print them.
cout << hex << setfill('0');
cout << 12 << setw(2);
output : 0a??????
I have no lead on UTF-8, From my understanding unless it is a non ascii character there is no difference.
What is the C++ equivalent to find the hex of a UTF-8 encoded not included in ASCII?
Correct me if i am wrong, very new to C++ if i use this expression does it mean if i take an output, let's say 12, i set 2, i will get an output of 0a?
The expression itself does not create any output of any sort.
How do i tweak it so it can take a UTF-8 character? Right now i can only deal with ASCII.
How do I create a character array using decimal/hexadecimal representation of characters instead of actual characters.
Reason I ask is because I am writing C code and I need to create a string that includes characters that are not used in English language. That string would then be parsed and displayed to an LCD Screen.
For example '\0' decodes to 0, and '\n' to 10. Are there any more of these special characters that i can sacrifice to display custom characters. I could send "Temperature is 10\d C" and degree sign is printed instead of '\d'. Something like this would be great.
Assuming you have a character code that is a degree sign on your display (with a custom display, I wouldn't necessarily expect it to "live" at the common place in the extended IBM ASCII character set, or that the display supports Unicode character encoding) then you can use the encoding \nnn or \xhh, where nnn is up to three digits in octal (base 8) or hh is up to two digits of hex code. Unfortunately, there is no decimal encoding available - Dennis Ritchie and/or Brian Kernighan were probably more used to using octal, as that was quite common at the time when C was first developed.
E.g.
char *str = "ABC\101\102\103";
cout << str << endl;
should print ABCABC (assuming ASCII encoding)
You can directly write
char myValues[] = {1,10,33,...};
Use \u00b0 to make a degree sign (I simply looked up the unicode code for it)
This requires unicode support in the terminal.
Simple, use std::ostringstream and casting of the characters:
std::string s = "hello world";
std::ostringstream os;
for (auto const& c : s)
os << static_cast<unsigned>(c) << ' ';
std::cout << "\"" << s << "\" in ASCII is \"" << os.str() << "\"\n";
Prints:
"hello world" in ASCII is "104 101 108 108 111 32 119 111 114 108 100 "
A little more research and i found answer to my own question.
Characters follower by a '\' are called escape sequence.
You can put octal equivalent of ascii in your string by using escape sequence from'\000' to '\777'.
Same goes for hex, 'x00' to 'xFF'.
I am printing my custom characters by using 'xC1' to 'xC8', as i only had 8 custom characters.
Every thing is done in a single line of code: lcd_putc("Degree \xC1")
For example, I want to write std::string my_str( "foobar\n" ); literally into std::ostream& with the backslash-n intact (no formatting).
Perhaps I need to convert my_str with a formatting function that converts backslash to double-backslash first? Is there a standard library function for that?
Or maybe there is a directive I can pass to std::ostream&?
The easiest way to do this for larger strings is with raw string literals:
std::string my_str(R"(foobar\n)");
If you want parentheses in it, use a delimiter:
R"delim(foobar\n)delim"
I don't know of anything that will let you keep the escape codes etc. in the string, but output it without, but std::transform with an std::ostream_iterator<std::string> destination and a function that handles the cases you want should do it, as it does in this example:
std::string filter(char c) {
if (c == '\n') return "\\n";
if (c == '\t') return "\\t";
//etc
return {1, c};
}
int main() {
std::string str{"abc\nr\tt"};
std::cout << "Without transform: " << str << '\n';
std::cout << "With transform: ";
std::transform(std::begin(str), std::end(str), std::ostream_iterator<std::string>(std::cout), filter);
}
Since "foobar\n" is a constant literal, just write it as "foobar\\n", so that "\\" becomes "\", letting n to leave in peace.
What you call "formatting" is not.
The stream does not format the string. The substitution of "\n" with char(10) is made by the compiler when producing the actual value form the literal. (ostream will at most translate char(10) into { char(10), char(13) } if the underlying platform requires it, but that's another story).
This is not related to std::ostream or std::string. "foobar\n" is a literal where \n already means end of line.
You have two options:
Escape \n symbol by \ std::string str("foobar\\n");
Use modern C++11 raw string literal std::string str(R"(foobar\n)");
The "backslash + n" is not formatted by the stream; for example the length of std::string("\n") is 1 (and not 2). Likewise '\n' is a single character. The fact that you write backslash + n is just a shortcut to represent non-printable (and non-ASCII) characters.
As another example, '\x0a' == '\n' (because 0a is the hexadecimal code for the line-feed character). And std::string("\x0a").size() == 1 too.
If (on Linux1) you open a std::ofstream and write '\x0a' to that stream, you will thus end up with a file containing a single byte; whose hexadecimal value is 0a.
As such, it is not your stream that is transforming what you wrote, it is the compiler. Depending on your usecase, you may either want to:
change the string as written in the code: using "foobar\\n" (note this increase the length by 1)
perform a transformation while streaming to print the hexadecimal or escape code of non-printable characters
1 on Windows, the '\n' character is translated to "\r\n" (carriage-return + line-feed) in text mode.