I do not understand what's going on here. This is compiled with GCC 10.2.0 compiler. Printing out the whole string is different than printing out each character.
#include <iostream>
int main(){
char str[] = "“”";
std::cout << str << std::endl;
std::cout << str[0] << str[1] << std::endl;
}
Output
“”
��
Why are not the two outputted lines the same? I would expect the same line twice. Printing out alphanumeric characters does output the same line twice.
Bear in mind that, on almost all systems, the maximum value a (signed) char can hold is 127. So, more likely than not, your two 'special' characters are actually being encoded as multi-byte combinations.
In such a case, passing the string pointer to std::cout will keep feeding data from that buffer until a zero (nul-terminator) byte is encountered. Further, it appears that, on your system, the std::cout stream can properly interpret multi-byte character sequences, so it shows the expected characters.
However, when you pass the individual char elements, as str[0] and str[1], there is no possibility of parsing those arguments as components of multi-byte characters: each is interpreted 'as-is', and those values do not correspond to valid, printable characters, so the 'weird' � symbol is shown, instead.
"“”" contains more bytes than you think. It's usually encoded as utf8. To see that, you can print the size of the array:
std::cout << sizeof str << '\n';
Prints 7 in my testing. Utf8 is a multi-byte encoding. That means each character is encoded in multiple bytes. Now, you're printing bytes of a utf8 encoded string, which are not printable themselves. That's why you get � when you try to print them.
I have an issue in which the size of the string is effected with the presence of a '\0' character. I searched all over in SO and could not get the answer still.
Here is the snippet.
int main()
{
std::string a = "123123\0shai\0";
std::cout << a.length();
}
http://ideone.com/W6Bhfl
The output in this case is
6
Where as the same program with a different string having numerals instead of characters
int main()
{
std::string a = "123123\0123\0";
std::cout << a.length();
}
http://ideone.com/mtfS50
gives an output of
8
What exactly is happening under the hood? How does presence of a '\0' character change the behavior?
The sequence \012 when used in a string (or character) literal is an octal escape sequence. It's the octal number 12 which corresponds to the ASCII linefeed ('\n') character.
That means your second string is actually equal to "123123\n3\0" (plus the actual string literal terminator).
It would have been very clear if you tried to print the contents of the string.
Octal sequences are one to three digits long, and the compiler will use as many digits as possible.
If you check the coloring at ideone you will see that \012 has a different color. That is because this is a single character written in octal.
I am making an OpenVG application for Raspberry Pi that displays some text and I need a support for foreign characters (Polish in this case). I plan to prepare a function that maps unicode characters to literals in C in some higher level language but for now there's a problem with printing those literals in C.
Given the code below:
//both output the "ó" character, as expected
char A[] = "\xF3";
wchar_t B[] = L"\xF3";
//"ś" is expected as output but instead I get character with code 0x5B - "["
char A[] = "\x15B";
wchar_t B[] = L"\x15B";
Most of Polish characters have 3-digit hexadecimal codes. When I attempt to print "ś" (0x15B), it prints character "[" (0x5B) instead. It turns out I cannot print any unicode characters with more than 2-digit codes.
Is used data type the cause? I have considered using char16_t and char32_t but the header files are nowhere to be found in the system.
It's what in this
char A[]={'\xc5','\x9b'};
c59b is "ś" (0x15B) by UTF-8.
Assume I've an array of strings which contain some chinese chars inside.
Eg: " This is a sample 在按键 needs to be tested"
^ ^
| |
start end
I need to extract only the chinese alone from the char array.
Thanks
Vijay
Pseudo-code (in my gcc world ...sorry, no MS dev access tonight):
wcsncpy(wcDest, wcschr(" This is a sample 在按键 needs to be tested", "在"), 4);
The wcschr() function is the wide-character equivalent of the strchr() function.
From the wcschr() man page:
"It searches the first occurrence of wc in the wide-character string pointed to by wcs."
The wcsncpy() function is the wide-character equivalent of the strncpy().
From the wcsncpy() man page:
"It copies at most n wide characters from the wide-character string pointed to by src, including the terminating null wide character (Laq\0aq), to the array pointed to by dest. Exactly n wide characters are written at dest. If the length wcslen(src) is smaller than n, the remaining wide characters in the array pointed to by dest are filled with null wide characters. If the length wcslen(src) is greater or equal to n, the string pointed to by dest will not be terminated by a null wide character."
I'm doing a review of my first semester C++ class, and I think I missing something. How many bytes does a string take up? A char?
The examples we were given are, some being character literals and some being strings:
'n', "n", '\n', "\n", "\\n", ""
I'm particularly confused by the usage of newlines in there.
#include <iostream>
int main()
{
std::cout << sizeof 'n' << std::endl; // 1
std::cout << sizeof "n" << std::endl; // 2
std::cout << sizeof '\n' << std::endl; // 1
std::cout << sizeof "\n" << std::endl; // 2
std::cout << sizeof "\\n" << std::endl; // 3
std::cout << sizeof "" << std::endl; // 1
}
Single quotes indicate characters.
Double quotes indicate C-style strings with an invisible NUL
terminator.
\n (line break) is only a single char and so is \\ (backslash). \\n is just a backslash followed by n.
'n': is not a string, is a literal char, one byte, the character code for the letter n.
"n": string, two bytes, one for n and one for the null character every string has at the end.
"\n": two bytes as \n stand for "new line" which takes one byte, plus one byte for the null char.
'\n': same as the first, literal char, not a string, one byte.
"\\n": three bytes.. one for \, one for newline and one for the null character
"": one byte, just the null character.
A char, by definition, takes up one byte.
Literals using ' are char literals; literals using " are string literals.
A string literal is implicitly null-terminated, so it will take up one more byte than the observable number of characters in the literal.
\ is the escape character and \n is a newline character.
Put these together and you should be able to figure it out.
The following will take x consecutive chars in memory:
'n' - 1 char (type char)
"n" - 2 chars (above plus zero character) (type const char[2])
'\n' - 1 char
"\n" - 2 chars
"\\n" - 3 chars ('\', 'n', and zero)
"" - 1 char
edit: formatting fixed
edit2: I've written something very stupid, thanks Mooing Duck for pointing that out.
The number of bytes a string takes up is equal to the number of characters in the string plus 1 (the terminator), times the number of bytes per character. The number of bytes per character can vary. It is 1 byte for a regular char type.
All your examples are one character long except for the second to last, which is two, and the last, which is zero. (Some are of type char and only define a single character.)
'n' -> One char. A char is always 1 byte. This is not a string.
"n" -> A string literal, containing one n and one terminating NULL char. So 2 bytes.
'\n' -> One char, A char is always 1 byte. This is not a string.
"\n" -> A string literal, containing one \n and one terminating NULL char. So 2 bytes.
"\\n" -> A string literal, containing one \, one '\n', and one terminating NULL char. So 3 bytes.
"" -> A string literal, containing one terminating NULL char. So 1 byte.
You appear to be referring to string constants. And distinguishing them from character constants.
A char is one byte on all architectures. A character constant uses the single quote delimiter '.
A string is a contiguous sequence of characters with a trailing NUL character to identify the end of string. A string uses double quote characters '"'.
Also, you introduce the C string constant expression syntax which uses blackslashes to indicate special characters. \n is one character in a string constant.
So for the examples 'n', "n", '\n', "\n":
'n' is one character
"n" is a string with one character, but it takes two characters of storage (one for the letter n and one for the NUL
'\n' is one character, the newline (ctrl-J on ASCII based systems)
"\n" is one character plus a NUL.
I leave the others to puzzle out based on those.
'n' - 0x6e
"n" - 0x6e00
'\n' - 0x0a
"\n" - 0x0a00
"\\n" - 0x5c6e00
"" - 0x00
Depends if using UTF8 a char is 1byte if UTF16 a char is 2bytes doesn't matter if the byte is 00000001 or 10000000 a full byte is registered and reserved for the character once declared for initialization and if the char changes this register is updated with the new value.
a strings bytes is equal to the number of char between "".
example: 11111111 is a filled byte,
UTF8 char T = 01010100 (1 byte)
UTF16 char T = 01010100 00000000 (2 bytes)
UTF8 string "coding" = 011000110110111101100100011010010110111001100111 (6 bytes)
UTF16 string "coding" = 011000110000000001101111000000000110010000000000011010010000000001101110000000000110011100000000 (12 bytes)
UTF8 \n = 0101110001101110 (2 bytes)
UTF16 \n = 01011100000000000110111000000000 (4 bytes)
Note: Every space and every character you type takes up 1-2 bytes in the compiler but there is so much space that unless you are typing code for a computer or game console from the early 90s with 4mb or less you shouldn't worry about bytes in regards to strings or char.
Things that are problematic to memory are calling things that require heavy computation with floats, decimals, or doubles and using math random in a loop or update methods. That would better be ran once at runtime or on a fixed time update and averaged over the time span.