storing 0 in a unsigned char array - c++

I have a unsigned char array like this:
unsigned char myArr[] = {100, 128, 0, 32, 2, 9};
I am using reinterpret_cast to covert it to const char* as I have to pass a const char* to a method. This information is then sent over grpc and the other application (erlang based recieves it and stores it in a erlang Bin). But what I observe is the Erlang application only received <<100, 128>> and nothing after that. What could be causing this? Is it the 0 in the character array that is the problem here? Could someone explain how to handle the 0 in an unsigned char array? I did read quite a few answers but nothing clearly explains my problem.

What could be causing this? Is it the 0 in the character array that is the problem here?
Most likely, yes.
Probably one of the functions that the pointer is passed to is specified to accept an argument which points to a null terminated string. Your array happens to incidentally be null terminated by containing null character at index 2 which is where the string terminates. Such function would typically only have well defined behavior in case the array is null terminated, so passing pointer to arbitrary binary that might not contain null character would be quite dangerous.
Could someone explain how to handle the 0 in an unsigned char array?
Don't pass the array into functions that expect null terminated strings. This includes most formatted output functions and most functions in <cstring>. The documentation of the function should mention the pre-conditions.
If such functions are your only option, then you can encode the binary data in a textual format and decode it in the other end. A commonly used textual encoding for binary data is Base64 although it is not necessarily the most optimal.

Related

What happens if strncpy copies random data into a buffer?

Suppose you have a char buffer that you want to copy an std::string into. Are there consequences of copying extra data into the buffer, outside of the strings scope, even though the buffer has adequate size?
Example
std::string my_string = "hello";
char my_buffer[128];
memset(my_buffer, 0, 128);
strncpy(my_buffer, my_string.c_str(), 128);
So "hello" gets copied into my_buffer, but so will 123 other bytes of data that comes after my_string. Are there any consequences of this? Is it harmful for the buffer to hold this other data?
but so will 123 other bytes of data that comes after my_string
This assumption is incorrect: strncpy pays attention to null termination of the source string, never reading past null terminator. The remaining data will be set to '\0' characters:
destination is padded with zeros until a total of num characters have been written to it. [reference]
This is in contrast to memcpy, which requires both the source and the destination to be of sufficient size in order to avoid undefined behavior.
OK, let's assume what you wanted is:
strncpy(my_buffer, my_string.c_str(), 128);
Thsi is always a 0-terminated string by definition, so considering:
Copies at most count characters of the character array pointed to by src (including the terminating null character, but not any of the characters that follow the null character) to character array pointed to by dest.
You won't get anything copied after "hello" from the original string, the rest will be 0s:
If, after copying the terminating null character from src, count is not reached, additional null characters are written to dest until the total of count characters have been written.
According to strncpy() description here 1, the copy is done up to the length you provided, for null terminated string, so that when end of the string come before, like in this case, copy is done up to it and no more copy is done, so rest of the "123 bytes" are not copied, and the copy loop terminates
The other answers to this question have addressed what happens with strncpy() (i.e. it will copy your string correctly because it stops at the 0-terminator byte), so perhaps what gets more to the intent of the question would be, what if you had this instead?
memcpy(my_buffer, my_string.c_str(), 128);
In this case, the function (memcpy()) doesn't know about 0-terminated-string semantics, and will always just blindly copy 128 bytes (starting at the address returned by my_string.c_str()) to the address my_buffer. The first 6 (or so) of those bytes will be from my_string's internal buffer, and the remaining bytes will be from whatever happens to be in memory after that.
So the question is, what happens then? Well, this memcpy() call reads from "mystery memory" whose purpose you're not aware of, so you're invoking undefined behavior by doing that, and therefore in principle anything could happen. In practice, the likely result is that your buffer will contain a copy of whatever bytes were read (although you probably won't notice them, since you'll be using string functions that don't look past the 0/terminator byte in your array anyway).
There is a small chance, though, that the "extra" memory bytes that memcpy() read could be part of a memory-page that is marked as off-limits, in which case trying to read from that page would likely cause a segmentation fault.
And finally, there's the real bugaboo of undefined behavior, which is that your C++ compiler's optimizer is allowed to do all kinds of crazy modifications to your code's logic, in the name of making your program more efficient -- and (assuming the optimizer isn't buggy) all of those optimizations will still result in the program running as intended -- as long as the program follows the rules and doesn't invoke undefined behavior. Which is to say, if your program invokes undefined behavior in any way, the optimizations may be applied in ways that are very difficult to predict or understand, resulting in bizarre/unexpected behavior in your program. So the general rule is, avoid undefined behavior like the plague, because even if you think it "should be harmless", there's a very real possibility that it will end up doing things you wouldn't expect it to do, and then you're in for a long, painful debugging session as you try to figure out what's going on.
Assuming you mean definitely copying that data, as your current code would end at a null terminator.
Essentially, no. Whatever data is in there will be used only as string data so unless you then tried to do something weird with that spare data (try to put a function pointer to it and execute it etc.) then it's basically safe.
The issue is with initially copying random data past the end of your original string, that could overflow into protected data that you don't have access to and throw a segfault(Inaccessible memory exception)
Function strncpy will copy everything up to NUL character or specified size. That's it.
In addition to other answers please remember that strncpy does not NUL terminate the dst buffer when the src string length (excluding NUL) is equal to buffer's size given. You should always do something like my_buffer[127]=0 to prevent buffer overruns when handling the string later on. In libbsd there is strlcpy defined that always NUL terminate the buffer.
Check out this snippet:
#include <string.h>
int main(int argc, char ** argv)
{
char buf[8];
char *str = "deadbeef";
strlcpy(buf, str, sizeof(buf));
printf("%s\n", buf);
strncpy(buf, str, sizeof(buf));
printf("%s\n", buf);
return 0;
}
To see the problem compile it with flag -fsanitize=address. The second buffer is not NUL terminated!

Correctly formatting data for I2C (Wire.write)

I am connecting multiple Arduino Mega units together to create a bank of IO all controlled by a master on an I2C bus.
I had it working with the slave populating a string with the status of the analog inputs etc. each separated by a colon. The string would then be looped through with a Wire.write.
The initial reqNo would tell the master which batch were being returned. E.g. batch 0 would be analog 0 - 5, batch 1 would be analog 6 - 11 etc.
It was all working, until further reading led me to an article that advocated against using strings due to memory usage and related issues. I have tried to refactor my code to avoid the use of strings however, now I am getting strings like this:
:⸮:⸮:⸮:⸮:⸮:⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮⸮W
returned instead of my expected output.
I think this is an encoding issue or similar? Could anyone please offer advice on what I'm doing wrong or another way of achieving this please. It's pretty important that the device functions for very long periods of time without reboots or any issues which is why I was pretty keen to remove Strings if this could cause issues.
Master code:
int i=0;
char res[32]="";
while(Wire.available()){
char c=Wire.read();
Serial.print(c);
res[i]=c;
i++;
}
Slave code:
void requestStatus(){
int i;
Wire.write(reqNo);
if(reqNo==0){
for(i=0;i<6;i++){
Wire.write(':');
Wire.write(analogRead(i));
}
}else if(reqNo==1){
for(i=6;i<12;i++){
Wire.write(':');
Wire.write(analogRead(i));
}
}else if(reqNo==2){
for(i=12;i<16;i++){
Wire.write(':');
Wire.write(analogRead(i));
}
}
reqNo++;
if(reqNo==3){
reqNo=0;
}
}
There are two problems with your code, first:
...
// master side
char c=Wire.read();
...
// slave side
Wire.write(analogRead(i));
you are treating integer values as if they were ASCII encoded, they are not. You have to convert them to ASCII at some point (e.g. master side). Consider using sscanf or snprintf for conversion.
Second, you are not NUL terminating a C string.
Wire.write(reqNo); for instance will write a character of code 0 1 or 2, not the character '0' or '1' or '2' if this what you expected
It was all working, until further reading led me to an article that advocated against using strings due to memory usage and related issues.
I assume it was talking about the String objects and not c-strings (nul terminated char arrays).
Those use dynamic memory allocation and cause heap fragmentation. C-strings are usually statically allocated (char res[32];), and are cleaned up cleanly from the stack, if they are defined inside a function.
write only accepts a single byte, an array of bytes (or chars) with explicit length or a c-string.
Passing it an int will only convert it to a byte and only keep the lower 8 bits.
What you need is the ˙print˙ method. It converts integers to characters on the fly. It doesn't use String or c-strings.
You should replace all of your Wire.write with Wire.print.
Also in your master code, you are not nul terminating the res char array, which will cause problems, since there is no explicit end of the string. You should also make sure, you don't write more than 31 characters into it (nul terminator also counts as a character), and cause buffer overflow.

Which is more efficient way for storing strings, an array or a pointer in C/C++? [duplicate]

This question already has answers here:
What is the difference between char s[] and char *s?
(14 answers)
Closed 6 years ago.
We can store string using 2 methods.
Method 1: using array
char a[]="str";
Method 2:
char *b="str";
In method 1 the memory is used only in storing the string "str" so the memory used is 4 bytes.
In method 2 the memory is used in storing the string "str" on 'Read-Only-Memory' and then in storing the pointer to the 1st character of the string.
So the memory used must be 4 bytes for storing string in ROM and then 8 bytes for storing pointer (in 64-bit machine) to the first character.
In total the 1st method uses 4 bytes and the method 2 uses 12 bytes. So is the method 1 always better than method 2 for storing strings in C/C++.
Except if you use a highly resource limited system, you should not care too much for the memory used by a pointer. Anyway, optimizing compilers could lead to same code in both cases.
You should care more about Undefined Behaviour in second case!
char a[] = "str";
correctly declares a non const character array which is initialized to "str". That means that a[0] = 'S'; is perfectly allowed and will change a to "Str".
But with
char *b = "str";
you declare a non const pointer to a litteral char array which is implicitely const. That means that b[0] = 'S'; tries to modify a litteral string and is Undefined Behaviour => it can work, segfault or anything in between including not changing the string.
All of the numbers that you cite, and the type of the memory where the string literal is stored are platform specific.
Which is more efficient way for storing strings, an array or a pointer
Some pedantry about terminology: A pointer can not store a string; it stores an address. The string is always stored in an array, and a pointer can point to it. String literals in particular are stored in an array of static storage duration.
Method 1: using array char a[]="str";
This makes a copy of the content of the string literal, into a local array of automatic storage duration.
Method 2: char *b="str";
You may not bind a non const pointer to a string literal in standard C++. This is ill-formed in that language (since C++11; prior to that the conversion was merely deprecated). Even in C (and extensions of C++) where this conversion is allowed, this is quite dangerous, since you might accidentally pass the pointer to a function that might try to modify the pointed string. Const correctness replaces accidental UB with a compile time error.
Ignoring that, this doesn't make a copy of the literal, but points to it instead.
So is the method 1 always better than method 2 for storing strings in C/C++.
Memory use is not the only metric that matters. Method 1 requires copying of the string from the literal into the automatic array. Not making copies is usually faster than making copies. This becomes more and more important with increasingly longer strings.
The major difference between method 1 and 2, are that you may modify the local array of method 1 but you may not modify string literals. If you need a modifiable buffer, then method 2 doesn't give you that - regardless of its efficiency.
Additional considerations:
Suppose your system is not a RAM-based PC computer but rather a computer with true non-volatile memory (NVM), such as a microcontroller. The string literal "str" would then in both cases get stored in NVM.
In the array case, the string literal has to be copied down from NVM in run-time, whereas in the pointer case you don't have to make a copy, you can just point straight at the string literal.
This also means that on such systems, assuming 32 bit, the array version will occupy 4 bytes of RAM for the array, while the pointer version will occupy 4 bytes of RAM for the pointer. Both cases will have to occupy 4 bytes of NVM for the string literal.

int8_t and char: converts between pointers to integer types with different sign - but it doesn't

I'm working with some embedded code and I am writing something new from scratch so I am preferring to stick with the uint8_t, int8_t and so on types.
However, when porting a function:
void functionName(char *data)
to:
void functionName(int8_t *data)
I get the compiler warning "converts between pointers to integer types with different sign" when passing a literal string to the function. ( i.e. when calling functionName("put this text in"); ).
Now, I understand why this happens and these lines are only debug however I wonder what people feel is the most appropriate way of handling this, short of typecasting every literal string. I don't feel that blanket typecasting in any safer in practice than using potentially ambiguous types like "char".
You seem to be doing the wrong thing, here.
Characters are not defined by C as being 8-bit integers, so why would you ever choose to use int8_t or uint8_t to represent character data, unless you are working with UTF-8?
For C's string literals, their type is pointer to char, and that's not at all guaranteed to be 8-bit.
Also it's not defined if it's signed or unsigned, so just use const char * for string literals.
To answer your addendum (the original question was nicely answered by #unwind). I think it mostly depends on the context. If you are working with text i.e. string literals you have to use const char* or char* because the compiler will convert the characters accordingly. Short of writing your own string implementation you are probably stuck with whatever the compiler provides to you. However, the moment you have to interact with someone/something outside of your CPU context e.g. network, serial, etc. you have to have control over the exact size (which I suppose is where your question stems from). In this case I would suggest writing functions to convert strings or any data-type for that matter to uint8_t buffers for serialized sending (or receiving).
const char* my_string = "foo bar!";
uint8_t buffer* = string2sendbuffer(my_string);
my_send(buffer, destination);
The string2buffer function would know everything there is to know about putting characters in a buffer. For example it might know that you have to encode each char into two buffer elements using big-endian byte ordering. This function is most certainly platform dependent but encapsulates all this platform dependence so you would gain a lot of flexibility.
The same goes for every other complex data-type. For everything else (where the compiler does not have that strong an opinion) I would advise on using the (u)intX_t types provided by stdint.h (which should be portable).
It is implementation-defined whether the type char is signed or unsigned. It looks like you are using an environment where is it unsigned.
So, you can either use uint8_t or stick with char, whenever you are dealing with characters.

About std::basic_ostream::write

I'm reading up on the write method of basic_ostream objects and this is what I found on cppreference:
basic_ostream& write( const char_type* s, std::streamsize count );
Behaves as an UnformattedOutputFunction. After constructing and checking the sentry object, outputs the characters from successive locations in the character array whose first element is pointed to by s. Characters are inserted into the output sequence until one of the following occurs:
exactly count characters are inserted
inserting into the output sequence fails (in which case setstate(badbit) is called)
So I get that it writes a chunk of characters from a buffer into the stream. And the number of characters are the bytes specified by count. But there are a few things of which I'm not sure. These are my questions:
Should I use write only when I want to specify how many bytes I want to write to a stream? Because normally when you print a char array it will print the entire array until it reaches the null byte, but when you use write you can specify how many characters you want written.
char greeting[] = "Hello World";
std::cout << greeting; // prints the entire string
std::cout.write(greeting, 5); // prints "Hello"
But maybe I'm misinterpreting something with this one.
And I often see this in code samples that use write:
stream.write(reinterpret_cast<char*>(buffer), sizeof(buffer));
Why is the reinterpret_cast to char* being use? When should I know to do something like that when writing to a stream?
If anyone can help me with these two question it would be greatly appreciated.
•Should I use write only when I want to specify how many bytes I want to write to a stream?
Yes - you should use write when there's a specific number of bytes of data arranged contiguously in memory that you'd like written to the stream in order. But sometimes you might want a specific number of bytes and need to get them another way, such as by formatting a double's ASCII representation to have specific width and precision.
Other times you might use >>, but that has to be user-defined for non builtin types, and when it is defined - normally for better but it may be worse for your purposes - it prints whatever the class designer choose, including potentially data that's linked from the object via pointers or references and static data of interest, and/or values calculated on the fly. It may change the data representation: say converting binary doubles to ASCII representations, or ensuring a network byte order regardless of the host's endianness. It may also omit some of the object's data, such as cache entries, counters used to manage but not logically part of the data, array elements that aren't populated etc..
Why is the reinterpret_cast to char* being use? When should I know to do something like that when writing to a stream?
The write() function signature expects a const char* argument, so this conversion is being done. You'll need to use a cast whenever you can't otherwise get a char* to the data.
The cast reflects the way write() treats data starting at the first byte of the object as 8-bit values without any consideration of the actual pre-cast type of the data. This ties in with being able to do things like say a write() of the last byte of a float and first 3 bytes of a double appearing next in the same structure - all the data boundaries and interpretation is lost after the reinterpret_cast<>.
(You've actually got to be more careful of this when doing a read() of bytes from an input stream... say you read data that constituted a double when written into memory that's not aligned appropriately for a double, then try to use it as a double, you may get a SIGBUS or similar alignment exception from your CPU or degraded performance depending on your system.)
basic_ostream::write and its counterpart basic_istream::read, is used to perform unformatted I/O on a data stream. Typically, this is raw binary data which may or may not contain printable ascii characters.
The main difference between read/write and the other formatted operators like <<, >>, getline etc. is that the former doesn't make any assumptions on the data being worked on -- you have full control over what bytes get read from and written to the stream. Compared to the latter which may skip over whitespaces, discard or ignore them etc.
To answer your second question, the reinterpret_cast <char *> is there to satisfy the function signature and to work with the buffer a byte at a time. Don't let the type char fool you. The reason char is used is because it's the smallest builtin primitive type provided by the language. Perhaps a better name would be something like uint8 to indicate it's really an unsigned byte type.