Append to String a Signed Int (Converted to Bytes) in Big Endian - c++

I have a 4 byte integer (signed), and (i) I want to reverse the byte order, (ii) I want to store the bytes (i.e. the 4 bytes) as bytes of the string. I am working in C++. In order to reverse the byte order in Big Endian, I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
Example:
int32_t a = -123456;
string s;
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a))); // int to byte
Further, when I am appending these data, it seems that I am appending 8 bytes instead of 4, why?
I need to use the append (I cannot use memcpy or something else).
Do you have any suggestion?

I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
It's unclear why you think that negative number would be a problem. It's fine to convert negative numbers with ntohl.
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a)));
std::string::append(char*) requires that the argument points to a null terminated string. An integer is not null terminated (unless it happens to contain a byte that incidentally represents a null terminator character). As a result of violating this requirement, the behaviour of the program is undefined.
Do you have any suggestion?
To fix this bug, you can use the std::string::append(char*, size_type) overload instead:
s.append(reinterpret_cast<char*>(&a), sizeof a);
reinterpret_cast<char*>(reinterpret_cast<void*>
The inner cast to void* is redundant. It makes no difference.

Related

C++ Turning Character types into int type

So I read and was taught that subtracting '0' from my given character turns it into an int, however my Visual Studio isn't recognizing that here, saying a value of type "const char*" cannot be used to initialize an entity of type int in C++ programming here.
bigint::bigint(const char* number) : bigint() {
int number1 = number - '0'; // error code
for (int i = 0; number1 != 0 ; ++i)
{
digits[i] = number1 % 10;
number1 /= 10;
digits[i] = number1;
}
}
The goal of the first half is to simply turn the given number into a type int. The second half is outputting that number backwards with no leading zeroes. Please note this function is apart of the class declared given in a header file here:
class bigint {
public:
static const int MAX_DIGITS = 50;
private:
int digits[MAX_DIGITS];
public:
// constructors
bigint();
bigint(int number);
bigint(const char * number);
}
Is there any way to convert the char parameter to an int so I can then output an int? Without using the std library or strlen, since I know there is a way to use the '0' char but I can't seem to be doing it right.
You can turn a single character in the range '0'..'9' into a single digit 0..9 by subtracting '0', but you cannot turn a string of characters into a number by subtracting '0'. You need a parsing function like std::stoi() to do the conversion work character-by-character.
But that's not what you need here. If you convert the string to a number, you then have to take the number apart. The string is already in pieces, so:
bigint::bigint(const char* number) : bigint() {
while (number) // keep looping until we hit the string's null terminator
{
digits[i] = number - '0'; // store the digit for the current character
number++; // advance the string to the next character
}
}
There could be some extra work involved in a more advanced version, such as sizing digits appropriately to fit the number of digits in number. Currently we have no way to know how many slots are actually in use in digits, and this will lead to problems later when the program has to figure out where to stop reading digits.
I don't know what your understanding is, so I will go over everything I see in the code snippet.
First, what you're passing to the function is a pointer to a char, with const keyword making the char immutable or "read only" if you prefer.
A char is actually a 8-bit sized 1 integer. It can store a numerical value in binary form, which can be also interpreted as a character.
Fundamental types - cppreference.com
Standard also expects char to be a "type for character representation". It could be represented in ASCII code, but it could be something else like EBCDIC maybe, I'm not sure. For future reference just remember that ASCII is not guaranteed, although you're likely to never use a system where it's no ASCII (if I'm correct). But it's not so much that char is somehow enforcing encoding - it's the functions that you pass those chars and char pointers to, that interpret their content as characters in ASCII encoding, while on some obscure or legacy platforms they could actually interpret them as characters in some less common encoding. Standard however demands that encoding used has this property: codes for characters '0' to '9' are subsequent, and thus '9' - '0' means: subtract code of '0' from code of '9'. The result is 9, because code for '9' is 9 positions from code for '0' in ASCII. Ranges 'a'-'z' and 'A'-'Z' have this quality as well, in case you need that, but it's a little bit trickier if your input is in base higher than 10, like a popular base of 16 called hexadecimal.
A pointer stores an address, so the most basic functionality for it is to "point" to a variable. But it can be used in various ways, one of which, very frequent in C, is to store address of the beginning of an array of variables of the same type. Those could be chars. We could interpret such an array as a line of text, or a string (a concept, not to be confused with C++ specific string class).
Since a pointer does not contain information on length or end of such an array, we need to get that information across to the function we pass the pointer to. Sometimes we can just provide the length, sometimes we provide the end pointer. When dealing with "lines of text" or c-style strings, we use (and c standard library functions expect) what is callled a null-terminated string. In such a string, the first char after the last one used for a line is a null, which is, to simplify, basically a 0. A 0, but not a '0'.
So what you're passing to the function, and what you interpret as, say 416, is actually a pointer to a place in memory where '4' is econded and stored as a number, followed by '1' and then '6', taking up three bytes. And depending on how you obtained this line of text, '6' is probably followed by a NULL, that is - a zero.
NULL - cppreference.com
Conversion of such a string to a number first requires a data type able to hold it. In case of 416 it could be anything from short upwards. If you wanted to do that on your own, you would need to iterate over entire line of text and add the numbers multiplied by proper powers of 10, take care of signedness too and maybe check if there are any edge cases. You could however use a standard function like int atoi (const char * str);
atoi - cplusplus.com
Now, that would be nice of course, but you're trying to work with "bigints". However you define them, it means your class' purpose is to deal with numbers to big to be stored in built-in types. So there is no way you can convert them just like that.
What you're trying to do right now seems to be a constructor that creates a bigint out of number represented as a c style string. How shall I put it... you want to store your bigint internally as an array of it's digits in base 10 (a good choice for code simplicity, readability and maintainability, as well as interoperation with base 10 textual representation, but it doesn't make efficient use of memory and processing power.) and your input is also an array of digits in base 10, except internally you're storing numbers as numbers, while your input is encoded characters. You need to:
sanitize the input (you need criteria for what kind of input is acceptable, fe. if there can be any leading or trailing whitespace, can the number be followed by any non-numerical characters to be discarded, how to represent signedness, is + for positive numbers optional or forbidden etc., throw exception if the input is invalid.
convert whatever standard you enforce for your input into whatever uniform standard you employ internally, fe. strip leading whitespace, remove + sign if it's optional and you don't use it internally etc.
when you know which positions in your internal array correspond with which positions in the input string, you can iterate over it and copy every number, decoding it first from ASCII.
A side note - I can't be sure as to what exactly it is that you expect your input to be, because it's only likely that it is a textual representation - as it could just as easily be an array of unencoded chars. Of course it's obviously the former, which I know because of your post, but the function prototype (the line with return type and argument types) does not assure anyone about that. Just another thing to be aware of.
Hope this answer helped you understand what is happening there.
PS. I cannot emphasize strongly enough that the biggest problem with your code is that even if this line worked:
int number1 = number - '0'; // error code
You'd be trying to store a number on the order of 10^50 into a variable capable of holding on the order of 10^9
The crucial part in this problem, which I have a vague feeling you may have found on spoj.com is that you're handling BIGints. Integers too big to be stored in a trivial manner.
1 ) The standard does not actually require for char to be this size directly, but indirectly it requires for it to be at least 8 bits, possibly more on obscure platforms. And yes, I think there were some platforms where it was indeed over 8 bits. Same thing with pointers that may behave strange on obscure architectures.

optimizing string search for a single byte

In general, string search algorithms (like Boyer-Moore) are optimized for cases where the search string is long-ish. That is, Boyer-Moore is great because by lining up the search string with our text, we can skip N = len(search string) characters if the end of the search string doesn't match the text.
But what if our search string is really short? Like a single byte or character? In this case, Boyer-Moore doesn't help much.
So, what are some alternative algorithms for speeding up searching?
I know that many optimized library search routines (like memchr in C), take the strategy of reading the input string word by word, rather than char by char. So on a 64-bit machine, 8 bytes can be examined at once, rather than a single byte.
I'd like to know how these optimized string/byte searches actually work. How does the actual comparison work then? I know it obviously must involve bit masking - but I don't see how performing all the bit masking is any better than simply search character by character.
So, suppose our search char is 0xFF. Ignoring alignment issues, let's say we have some input buffer: void* buf. We can read it word by word by saying:
const unsigned char search_char = 0xFF;
unsigned char* bufptr = static_cast<unsigned char*>(buf);
unsigned char* bufend = bufptr + BUF_SIZE;
while (bufptr != bufend)
{
// Ignore alignment concerns for now, assume BUF_SIZE % sizeof(uintptr_t) == 0
//
std::uinptr_t next_word = *reinterpret_cast<std::uintptr_t*>(bufptr);
// ... but how do we compare next_word with our search char?
bufptr += sizeof(std::uintptr_t);
}
I also realize that the above code is not strictly portable, because std::uintptr_t isn't guaranteed to actually be word size. But let's assume for the sake of this question that std::uinptr_t is equal to the processor word size. (An actual implementation would likely need platform-specific macros to get the actual word-size)
So, how do we actually check if the byte 0xFF occurs anywhere in the value of next_word?
We can use OR operations of course, but it seems we'd still need to perform a lot of OR'ing and bit shifting to check each byte of next_word, at which point it becomes questionable whether this optimization is actually any better than simply scanning character by character.
You can use this snippet from Bit Twiddling Hacks:
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
#define hasvalue(x,n) \
(haszero((x) ^ (~0UL/255 * (n))))
It effectively XORs each byte with the character to be tested, then determines if any byte is now zero.
At this point you can get the location of the matching byte (or bytes) from the return value of the expression, e.g. the value will be 0x00000080 if the least significant byte matches the value.

Sizeof and Strlen

I am trying to implement an encryption using a Salt and a Password. And since the recommended size for a Salt is 64 bits, I declared.
char Salt[8];
I used RAND_pseudo_bytes to get a random Salt this way:
RAND_pseudo_bytes((unsigned char*)Salt, sizeof Salt);
And because the hexdump output was different in length(sometimes 5, mostly 24 bytes) each time I compiled because I wrongly used strlen instread of sizeof:
RAND_pseudo_bytes((unsigned char*)Salt, strlen(Salt));
I tried the following line to figure out what's happening:
printf("\n%d\n",strlen(Salt));
which outputs 24 each time.
So, my question is: Why is the strlen(Salt)=24 when I declared Salt's length 8(sizeof(Salt)=8)? I would understand a 9(with the '\0', although not entirely sure how exactly would that happen), but 24 strikes me as odd. Thank you.
strlen is going to walk down the pointer you gave it and count the number of bytes until it reaches a null byte. In this case, your char array of 8 bytes has no null bytes, so strlen happily continues past the boundary into a region of memory beyond the defined char array on the stack, and whatever happens to be there will determine the behaviour of strlen. In this case, 24 bytes past the beginning of the array, there was a null byte.
Don't use char to represent bytes.
Over half of the values of a byte are not printable, i.e. they don't have corresponding printable values.
I suggest you iterate over the array of uint8_t using printf("0x%02X\n", array[i]);
strlen()searches for the first null character and counts all bytes excluding that null byte.
A salt is 8 non-zero bytes - and there's no guarantee that the next character is a null byte.
That's why sizeof and strlen differ.
sizeof is an operator that returns the number of bytes needed to store a specific data structure. When applied to an an array of characters, it represents of the three cases where the name of the array does not decay to the pointer to its first element (the other two are the usage of & and the initialization via a string literal).
strlen is instead a function, assuming that its input is a null-terminated sequence of characters. Because when you pass the name of the array of characters to a function, it does decay to the pointer of its first element, strlen has no way to know the size of the original data structure (like sizeof does). All it gets is a pointer to char. The only way it can determine the end of the string is by running through the sequence of characters, looking for a '\0'. In your case, it cannot find one before the 24th byte in memory. That happens by pure chance.
Try initializing your array with:
char Salt[8] = {0};
And make sure that your RAND_pseudo_bytes function preserves the sentinel '\0' in the treated string.
Beside the null termination of salt, as others pointed out, you need to change the format specifier in printf to %zu because strlen return type is size_t. Using wrong specifier invokes undefined behavior.
Addressing Your Question about strlen()
What strlen() is counting is the number of bytes until the first '\0' in memory.
char Salt[9] = { '\0' };
Will initialize Salt with all '\0's.
NOTE: As #OliCharlesworth pointed out, Salt can have embedded NULLs. Don't use any str*() methods. You need to use mem*() methods only and keep track of the length yourself. Don't rely on sizeof because arrays are turned into pointers when passed to functions.

When to use unsigned char pointer

What is the use of unsigned char pointers? I have seen it at many places that pointer is type cast to pointer to unsinged char Why do we do so?
We receive a pointer to int and then type cast it to unsigned char*. But if we try to print element in that array using cout it does not print anything. why? I do not understand. I am new to c++.
EDIT Sample Code Below
int Stash::add(void* element)
{
if(next >= quantity)
// Enough space left?
inflate(increment);
// Copy element into storage, starting at next empty space:
int startBytes = next * size;
unsigned char* e = (unsigned char*)element;
for(int i = 0; i < size; i++)
storage[startBytes + i] = e[i];
next++;
return(next - 1); // Index number
}
You are actually looking for pointer arithmetic:
unsigned char* bytes = (unsigned char*)ptr;
for(int i = 0; i < size; i++)
// work with bytes[i]
In this example, bytes[i] is equal to *(bytes + i) and it is used to access the memory on the address: bytes + (i* sizeof(*bytes)). In other words: If you have int* intPtr and you try to access intPtr[1], you are actually accessing the integer stored at bytes: 4 to 7:
0 1 2 3
4 5 6 7 <--
The size of type your pointer points to affects where it points after it is incremented / decremented. So if you want to iterate your data byte by byte, you need to have a pointer to type of size 1 byte (that's why unsigned char*).
unsigned char is usually used for holding binary data where 0 is valid value and still part of your data. While working with "naked" unsigned char* you'll probably have to hold the length of your buffer.
char is usually used for holding characters representing string and 0 is equal to '\0' (terminating character). If your buffer of characters is always terminated with '\0', you don't need to know it's length because terminating character exactly specifies the end of your data.
Note that in both of these cases it's better to use some object that hides the internal representation of your data and will take care of memory management for you (see RAII idiom). So it's much better idea to use either std::vector<unsigned char> (for binary data) or std::string (for string).
In C, unsigned char is the only type guaranteed to have no trapping values, and which guarantees copying will result in an exact bitwise image. (C++ extends this guarantee to char as well.) For this reason, it is traditionally used for "raw memory" (e.g. the semantics of memcpy are defined in terms of unsigned char).
In addition, unsigned integral types in general are used when bitwise operations (&, |, >> etc.) are going to be used. unsigned char is the smallest unsigned integral type, and may be used when manipulating arrays of small values on which bitwise operations are used. Occasionally, it's also used because one needs the modulo behavior in case of overflow, although this is more frequent with larger types (e.g. when calculating a hash value). Both of these reasons apply to unsigned types in general; unsigned char will normally only be used for them when there is a need to reduce memory use.
The unsinged char type is usually used as a representation of a single byte of binary data. Thus, and array is often used as a binary data buffer, where each element is a singe byte.
The unsigned char* construct will be a pointer to the binary data buffer (or its 1st element).
I am not 100% sure what does c++ standard precisely says about size of unsigned char, whether it is fixed to be 8 bit or not. Usually it is. I will try to find and post it.
After seeing your code
When you use something like void* input as a parameter of a function, you deliberately strip down information about inputs original type. This is very strong suggestion that the input will be treated in very general manner. I.e. as a arbitrary string of bytes. int* input on the other hand would suggest it will be treated as a "string" of singed integers.
void* is mostly used in cases when input gets encoded, or treated bit/byte wise for whatever reason, since you cannot draw conclusions about its contents.
Then In your function you seem to want to treat the input as a string of bytes. But to operate on objects, e.g. performing operator= (assignment) the compiler needs to know what to do. Since you declare input as void* assignment such as *input = something would have no sense because *input is of void type. To make compiler to treat input elements as the "smallest raw memory pieces" you cast it to the appropriate type which is unsigned int.
The cout probably did not work because of wrong or unintended type conversion. char* is considered a null terminated string and it is easy to confuse singed and unsigned versionin code. If you pass unsinged char* to ostream::operator<< as a char* it will treat and expect the byte input as normal ASCII characters, where 0 is meant to be end of string not an integer value of 0. When you want to print contents of memory it is best to explicitly cast pointers.
Also note that to print memory contents of a buffer you would need to use a loop, since other wise the printing function would not know when to stop.
Unsigned char pointers are useful when you want to access the data byte by byte. For example, a function that copies data from one area to another could need this:
void memcpy (unsigned char* dest, unsigned char* source, unsigned count)
{
for (unsigned i = 0; i < count; i++)
dest[i] = source[i];
}
It also has to do with the fact that the byte is the smallest addressable unit of memory. If you want to read anything smaller than a byte from memory, you need to get the byte that contains that information, and then select the information using bit operations.
You could very well copy the data in the above function using a int pointer, but that would copy chunks of 4 bytes, which may not be the correct behavior in some situations.
Why nothing appears on the screen when you try to use cout, the most likely explanation is that the data starts with a zero character, which in C++ marks the end of a string of characters.

Integer to Character conversion in C

Lets us consider this snippet:
int s;
scanf("%c",&s);
Here I have used int, and not char, for variable s, now for using s for character conversion safely I have to make it char again because when scanf reads a character it only overwrites one byte of the variable it is assigning it to, and not all four that int has.
For conversion I could use s = (char)s; as the next line, but is it possible to implement the same by subtracting something from s ?
What you've done is technically undefined behaviour. The %c format calls for a char*, you've passed it an int* which will (roughly speaking) be reinterpreted. Even assuming that the pointer value is still good after reinterpreting, storing an arbitrary character to the first byte of an int and then reading it back as int is undefined behaviour. Even if it were defined, reading an int when 3 bytes of it are uninitialized, is undefined behaviour.
In practice it probably does something sensible on your machine, and you just get garbage in the top 3 bytes (assuming little-endian).
Writing s = (char)s converts the value from int to char and then back to int again. This is implementation-defined behaviour: converting an out-of-range value to a signed type. On different implementations it might clean up the top 3 bytes, it might return some other result, or it might raise a signal.
The proper way to use scanf is:
char c;
scanf("%c", &c);
And then either int s = c; or int s = (unsigned char)c;, according to whether you want negative-valued characters to result in a negative integer, or a positive integer (up to 255, assuming 8-bit char).
I can't think of any good reason for using scanf improperly. There are good reasons for not using scanf at all, though:
int s = getchar();
Are you trying to convert a digit to its decimal value? If so, then
char c = '8';
int n = c - '0';
n should 8 at this point.
That's probably not a good idea; GCC gives me a warning for that code:
main.c:10: warning: format ‘%c’ expects type ‘char *’, but
argument 2 has type ‘int *’
In this case you're ok since you're passing a pointer to more space than you need (for most systems), but what if you did it the other way around? Could be crash city. If you really want to do something like what you have there, just do the typecast or mask it - the mask will be endian-dependent.
As written this won't work reliably . The argument, &s, to scanf is a pointer to int and scanf is expecting a pointer to char. The two data type (int and char) have different sizes (at least on most architectures) so the data may get put in the wrong spot in memeory, and the other part of s may not get properly cleared.
The answers suggesting manipulation of the result after using a pointer to int rely on unspecified behavior (i.e. that scanf will put the character value it has in the least significant byte of the int you're pointing to), and are not safe.
Not but you could use the following:
s = s & 0xFF
That will blank out all of the data except the first byte. But in general all these ideas (and the ones above) are bad ideas, since not all systems store the lowest part of the integer in memory first. So if you ever have to port this code to a big endian system, you'll be screwed.
True, you may never have to port the code, but why write unportable code to begin with?
See this for more info:
http://en.wikipedia.org/wiki/Endianness