In a C++ class I've the following code/while loop:
uint8_t len = 0;
while (*s != ',') {
len = (uint8_t)(len + 1u);
++s;
}
return (len);
The outcome should be a value between 0 and max 20.
As I receive a strange outcome, and started debugging. When I step through this
I get the following values for the variable Len:
‘\01’, ‘\02’, ‘\03’, ‘\04’, ‘\05’, ‘\06’, ‘\a’, ‘\b’, ‘\t’
I don’t understand the change from ‘\06’ to ‘\a’!
Can somebody explain this? I expect that the Len value is simply increased by 1 until character array pointer s hits the ',' char.
The values are correct, but your debugger interprets them as char type, not an integer type.
You can see escape sequences used in C++ here (and the corresponding values in ASCII).
\01 - 1 in octal, 1 in decimal
\02 - 2 in octal, 2 in decimal
...
\06 - 6 in octal, 6 in decimal
\a - equivalent to \07, the ASCII code to use the computer bell
\b - equivalent to \010 (10 octal, 8 decimal), the ASCII code for "backspace" character
\t - equivalent to \011 (11 octal, 9 decimal), the ASCII code for tabulator
etc.
I don't know if you can change the way your debugger interprets the data. Worst case, you can always print the value after casting it to int.
(gdb)p static_cast<int>(len)
my problem is as follows.
I'm reading a piece of ascii data from a sensor, let's say it's "400". It's stored in an array of characters. In hex (ascii) that would be { 0x34, 0x30, 0x30 }.
What I'm trying to get from that set of characters is an integer in decimal representative of hex 0x400, which would be 1024. All the other numeric values in this array of ascii characters are represented in decimal, so I've been using this:
int num_from_ascii(char reading[], int start, int length){
printf("++++++++num_from_ascii+++++++++\n");
char radar_block[length];
for(int i = 0; i < length; i++){
radar_block[i] = reading[start + i];
printf("%02x ", reading[start + i]);
}
printf("\n");
return atoi(radar_block);
}
This obviously just gives me back 400, but I need a decimal integer from a hex value. Any advice?
As Eugene has suggested, all you need to do is replace atoi(radar_block) by strtol(radar_block, NULL, 16). That takes a "base" argument, which can be 10 for decimal, 16 for hex (which is what you want), etc or 0 to auto-detect using the C++ rules (leading "0x" for hex, leading "0" for octal).
You should never use atoi anyway because it does not handle invalid inputs safely. strtol does everything that atoi does, has well defined errno for all edge cases, and also allows you to distinguish "0" from non-numeric input.
As user3121023 mentioned, don't forget to NUL-terminate the string you pass to strtol (this is a serious bug in your code calling atoi as well).
The following string has size 4 not 3 as I would have expected.
std::string s = "\r\n½";
int ss = s.size(); //ss is 4
When loop through the string character by character escaping it to hex I get
0x0D (hex code for carriage return)
0x0A (hex code for line feed)
0xc2 (hex code, but what is this?)
0xbd (hex code for the ½ character)
Where does the 0xc2 come from?
Is it some sort of encoding information? I though std::string had a char per visible character in the string. Can someone confirm 0xc2 is a "character set modifier"?
"½" has, in unicode, the code point U+00BD and is represented by UTF-8 by the two bytes sequence 0xc2bd. This means, your string contains only three characters, but is four bytes long.
See https://www.fileformat.info/info/unicode/char/00bd/index.htm
Additional reading on SO: std::wstring VS std::string.
I'm trying to tokenize the input consisting of UTF-8 characters. While some trying the learn utf8 i get an output that i cannot understand. when i input the characher π (pi) i get three different numbers 207 128 10. How can i use them to control which category it is belong to?
ostringstream oss;
oss << cin.rdbuf();
string input = oss.str();
for(int i=0; i<input.size(); i++)
{
unsigned char code_unit = input[i];
cout << (int)code_unit << endl;
}
Thanks in advance.
Characters encoded with UTF-8 may take up more than a single byte (and often do). The number of bytes used to encode a single code point can vary from 1 byte to 6 bytes (or 4 under RFC 3629). In the case of π, the UTF-8 encoding, in binary, is:
11001111 10000000
That is, it is two bytes. You are reading these bytes out individually. The first byte has decimal value 207 and the second has decimal value 128 (if you interpret as an unsigned integer). The following byte that you're reading has decimal value 10 and is the Line Feed character which you're giving when you hit enter.
If you're going to do any processing of these UTF-8 characters, you're going to need to interpret what the bytes mean. What exactly you'll need to do depends on how you're categorising the characters.
I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;