how to copy a wchar_t from a char array - c++

Assume I've an array of strings which contain some chinese chars inside.
Eg: " This is a sample 在按键 needs to be tested"
^ ^
| |
start end
I need to extract only the chinese alone from the char array.
Thanks
Vijay

Pseudo-code (in my gcc world ...sorry, no MS dev access tonight):
wcsncpy(wcDest, wcschr(" This is a sample 在按键 needs to be tested", "在"), 4);
The wcschr() function is the wide-character equivalent of the strchr() function.
From the wcschr() man page:
"It searches the first occurrence of wc in the wide-character string pointed to by wcs."
The wcsncpy() function is the wide-character equivalent of the strncpy().
From the wcsncpy() man page:
"It copies at most n wide characters from the wide-character string pointed to by src, including the terminating null wide character (Laq\0aq), to the array pointed to by dest. Exactly n wide characters are written at dest. If the length wcslen(src) is smaller than n, the remaining wide characters in the array pointed to by dest are filled with null wide characters. If the length wcslen(src) is greater or equal to n, the string pointed to by dest will not be terminated by a null wide character."

Related

Base64 encoded String too big, trailing characters truncated in c++

I have an image which I have to convert to base64. After the conversion, below is its value:
"
and so on...
This a quite a big value. I need to put this in a char data[] like below:
char sPostData[21070] = "{ \"image\" : \"<base64 encoded value>\" , \"name\": \"dev\"}";
but it throws this error:
Error C2026 string too big, trailing characters truncated
How can I resolve it?
The Microsoft compiler imposes a limit of 16380 single-byte characters for a string literal. The documentation says
Prior to adjacent strings being concatenated, a string cannot be longer than 16380 single-byte characters.
Break the string into adjacent chunks, something like
char[] = "a whole bunch of characters"
"a whole bunch more characters"
" and even more characters";
According to the documentation for that error, there is a limit of 16380 bytes in a character array (characters for narrow strings, fewer for Unicode).
Character string pointers (const char *) have a different limit, 65535 bytes.

C++ character array doesn't take inputs of more than 4 characters

I'm trying to make a char array in C++ that will store a limited number of characters that I set (in this case 5). My program looks like this:
char name[5];
cout << "Enter 5 character name: ";
cin.getline(name, 5);
cout << name;
I defined a char variable named "name" and set it to store only 5 characters, but whenever I run the program and try to enter anything more than 4 characters, the program truncates anything longer than 4 characters. This happens even if I change the number of characters in the char definition or use a cin statement.
I think it's because of the null terminator character, it might also depend on the version of your C++ compiler, anyways reading the reference here http://www.cplusplus.com/reference/istream/istream/getline/...
It says:
Extracts characters from the stream as unformatted input and stores them into s as a c-string, until either the extracted character is the delimiting character, or n characters have been written to s (including the terminating null character).
If it includes the null character terminator, then it should be reserved a place at the very end of the character array, that's why if you have a 5 characters char array, cin.getline fills in 4 only.

Printing unicode literals in C

I am making an OpenVG application for Raspberry Pi that displays some text and I need a support for foreign characters (Polish in this case). I plan to prepare a function that maps unicode characters to literals in C in some higher level language but for now there's a problem with printing those literals in C.
Given the code below:
//both output the "ó" character, as expected
char A[] = "\xF3";
wchar_t B[] = L"\xF3";
//"ś" is expected as output but instead I get character with code 0x5B - "["
char A[] = "\x15B";
wchar_t B[] = L"\x15B";
Most of Polish characters have 3-digit hexadecimal codes. When I attempt to print "ś" (0x15B), it prints character "[" (0x5B) instead. It turns out I cannot print any unicode characters with more than 2-digit codes.
Is used data type the cause? I have considered using char16_t and char32_t but the header files are nowhere to be found in the system.
It's what in this
char A[]={'\xc5','\x9b'};
c59b is "ś" (0x15B) by UTF-8.

How many bytes does a string take? A char?

I'm doing a review of my first semester C++ class, and I think I missing something. How many bytes does a string take up? A char?
The examples we were given are, some being character literals and some being strings:
'n', "n", '\n', "\n", "\\n", ""
I'm particularly confused by the usage of newlines in there.
#include <iostream>
int main()
{
std::cout << sizeof 'n' << std::endl; // 1
std::cout << sizeof "n" << std::endl; // 2
std::cout << sizeof '\n' << std::endl; // 1
std::cout << sizeof "\n" << std::endl; // 2
std::cout << sizeof "\\n" << std::endl; // 3
std::cout << sizeof "" << std::endl; // 1
}
Single quotes indicate characters.
Double quotes indicate C-style strings with an invisible NUL
terminator.
\n (line break) is only a single char and so is \\ (backslash). \\n is just a backslash followed by n.
'n': is not a string, is a literal char, one byte, the character code for the letter n.
"n": string, two bytes, one for n and one for the null character every string has at the end.
"\n": two bytes as \n stand for "new line" which takes one byte, plus one byte for the null char.
'\n': same as the first, literal char, not a string, one byte.
"\\n": three bytes.. one for \, one for newline and one for the null character
"": one byte, just the null character.
A char, by definition, takes up one byte.
Literals using ' are char literals; literals using " are string literals.
A string literal is implicitly null-terminated, so it will take up one more byte than the observable number of characters in the literal.
\ is the escape character and \n is a newline character.
Put these together and you should be able to figure it out.
The following will take x consecutive chars in memory:
'n' - 1 char (type char)
"n" - 2 chars (above plus zero character) (type const char[2])
'\n' - 1 char
"\n" - 2 chars
"\\n" - 3 chars ('\', 'n', and zero)
"" - 1 char
edit: formatting fixed
edit2: I've written something very stupid, thanks Mooing Duck for pointing that out.
The number of bytes a string takes up is equal to the number of characters in the string plus 1 (the terminator), times the number of bytes per character. The number of bytes per character can vary. It is 1 byte for a regular char type.
All your examples are one character long except for the second to last, which is two, and the last, which is zero. (Some are of type char and only define a single character.)
'n' -> One char. A char is always 1 byte. This is not a string.
"n" -> A string literal, containing one n and one terminating NULL char. So 2 bytes.
'\n' -> One char, A char is always 1 byte. This is not a string.
"\n" -> A string literal, containing one \n and one terminating NULL char. So 2 bytes.
"\\n" -> A string literal, containing one \, one '\n', and one terminating NULL char. So 3 bytes.
"" -> A string literal, containing one terminating NULL char. So 1 byte.
You appear to be referring to string constants. And distinguishing them from character constants.
A char is one byte on all architectures. A character constant uses the single quote delimiter '.
A string is a contiguous sequence of characters with a trailing NUL character to identify the end of string. A string uses double quote characters '"'.
Also, you introduce the C string constant expression syntax which uses blackslashes to indicate special characters. \n is one character in a string constant.
So for the examples 'n', "n", '\n', "\n":
'n' is one character
"n" is a string with one character, but it takes two characters of storage (one for the letter n and one for the NUL
'\n' is one character, the newline (ctrl-J on ASCII based systems)
"\n" is one character plus a NUL.
I leave the others to puzzle out based on those.
'n' - 0x6e
"n" - 0x6e00
'\n' - 0x0a
"\n" - 0x0a00
"\\n" - 0x5c6e00
"" - 0x00
Depends if using UTF8 a char is 1byte if UTF16 a char is 2bytes doesn't matter if the byte is 00000001 or 10000000 a full byte is registered and reserved for the character once declared for initialization and if the char changes this register is updated with the new value.
a strings bytes is equal to the number of char between "".
example: 11111111 is a filled byte,
UTF8 char T = 01010100 (1 byte)
UTF16 char T = 01010100 00000000 (2 bytes)
UTF8 string "coding" = 011000110110111101100100011010010110111001100111 (6 bytes)
UTF16 string "coding" = 011000110000000001101111000000000110010000000000011010010000000001101110000000000110011100000000 (12 bytes)
UTF8 \n = 0101110001101110 (2 bytes)
UTF16 \n = 01011100000000000110111000000000 (4 bytes)
Note: Every space and every character you type takes up 1-2 bytes in the compiler but there is so much space that unless you are typing code for a computer or game console from the early 90s with 4mb or less you shouldn't worry about bytes in regards to strings or char.
Things that are problematic to memory are calling things that require heavy computation with floats, decimals, or doubles and using math random in a loop or update methods. That would better be ran once at runtime or on a fixed time update and averaged over the time span.

Understanding Endianness - a variable value

I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;