C++: <= conflict between signed and unsigned - c++

I have created a wrapper around the .substr function:
wstring MidEx(wstring u, long uStartBased1, long uLenBased1)
{
//Extracts a substring. It is fail-safe. In case we read beyond the string, it will just read as much as it has
// For example when we read from the word HELLO , and we read from position 4, len 5000, it will just return LO
if (uStartBased1 > 0)
{
if (uStartBased1 <= u.size())
{
return u.substr(uStartBased1-1, uLenBased1);
}
}
return wstring(L"");
}
It works fine, however the compiler gives me the warning "<= Conflict between signed and unsigned".
Can somebody tell me how to do it correctly?
Thank you very much!

You should use wstring::size_type (or size_t) instead of long:
wstring MidEx(wstring u, wstring::size_type uStartBased1, wstring::size_type uLenBased1)
{
//Extracts a substring. It is fail-safe. In case we read beyond the string, it will just read as much as it has
// For example when we read from the word HELLO , and we read from position 4, len 5000, it will just return LO
if (uStartBased1 > 0)
{
if (uStartBased1 <= u.size())
{
return u.substr(uStartBased1-1, uLenBased1);
}
}
return wstring(L"");
}
which is the exact return type of u.size(). This way, you ensure that the comparision gives the expected result.
If you are working with std::wstring or another standard library container (like std::vector etc.), then x::size_type will be defined as size_t. So using it will be more consistent.

You want unsigned arguments, something like:
wstring MidEx(wstring u, unsigned long uStartBased1, unsigned long uLenBased1)

Related

Can I safely use std::string to assemble binary data into messages?

I am using a std::string to hold binary data read from a socket.
The data consists of messages beginning with a '$' and ending with a '#'. Each message may contain '\0' characters.
I use std::string::find() to find the location of the first message and extract it from the string using std::string::substr():
class MessageSplitter {
public:
MessageSplitter() { m_data.reserve(1'000'000); }
void appendBinaryData(const std::string& binaryData) {
m_data.append(bytes);
}
bool popMessage(std::string& msg) {
size_t beg_index = m_data.find("$");
if (beg_index == std::string::npos) {
return false;
}
size_t end_index = m_data.find("#", beg_index);
if (end_index == std::string::npos) {
return false;
}
size_t count = end_index - beg_index + end.size();
msg = m_data.substr(beg_index, count);
m_data = m_data.substr(end_index + end.size());
return true;
}
private:
std::string m_data;
};
I read from socket this way (error checking on recv omitted):
char buffer[4096];
int ret = ::recv(m_socket, buffer, 4096, 0);
std::string binaryData = std::string(buffer, ret);
This approach seems to work fine on Windows.
However is it guaranteed to work on other platforms according to the C++ standard?
This is perfectly safe from a language level. std::string is guaranteed to be able to handle non-printable characters including embedded nul characters just fine.
From a programmer's prospective though it's somewhat unsafe because it's surprising. When I see std::string I generally expect it to be printable text. It has an operator<< for example to make it easy to print to output streams, and I have to remember never to use that.
For the second reason, I would tend to prefer something more explicit. std::vector<std::byte> or std::vector<unsigned char> or similar. Something that doesn't act like text is much more difficult to accidentally treat as text.

Parse string to unsigned int with error handling if string represents negative number

Unfortunately parsing the string representation of a negative number to an unsigned int with proper error handling seems way more complicated than one would expect. Neither std::stoul, nor strtoul, nor boost::lexical_cast nor the stringstream approach detect the "error" and happily parse the string "-1" by performing a wrap around.
Is there any other way of converting a string to unsigned int with proper error handling? The way proposed as comment to the boost bug report seems a bit...strange.
AFAIK, there is no such function/operator in the standard library. You could:
Check for - character in the string beforehand, as already suggested, which, while strange, will work AFAICT
int mystrtoul(char const *s, unsigned &y)
{
if (strchr(s, '-') == NULL) {
y = strtoul(s, NULL, 0);
return 0;
}
return -1;
}
Use strtod() first. It will detect negative numbers and you can then call strtoul() if the number is not negative, like:
int yastrtoul(char const *s, unsigned &y)
{
if (strtod(s, NULL) >= 0) {
y = strtoul(s, NULL, 0);
return 0;
}
return -1;
}
Do the whole thing yourself, but be careful as detecting overflow is tricky, because of undefined behavior.
Not sure what is meant by error handling. If you are truly trying to parse out an unsigned int, anytime there is a - in the 0th index of your string, you should flag the error, otherwise, parse as normal.

Displaying integer on an LCD

I'm trying to display an integer on an LCD-Display. The way the Lcd works is that you send an 8-Bit ASCII-Character to it and it displays the character.
The code I have so far is:
unsigned char text[17] = "ABCDEFGHIJKLMNOP";
int32_t n = 123456;
lcd.printInteger(text, n);
//-----------------------------------------
void LCD::printInteger(unsigned char headLine[17], int32_t number)
{
//......
int8_t str[17];
itoa(number,(char*)str,10);
for(int i = 0; i < 16; i++)
{
if(str[i] == 0x0)
break;
this->sendCharacter(str[i]);
_delay_ms(2);
}
}
void LCD::sendCharacter(uint8_t character)
{
//....
*this->cOutputPort = character;
//...
}
So if I try to display 123456 on the LCD, it actually displays -7616, which obviously is not the correct integer.
I know that there is probably a problem because I convert the characters to signed int8_t and then output them as unsigned uint8_t. But I have to output them in unsigned format. I don't know how I can convert the int32_t input integer to an ASCII uint8_t-String.
On your architecture, int is an int16_t, not int32_t. Thus, itoa treats 123456 as -7616, because:
123456 = 0x0001_E240
-7616 = 0xFFFF_E240
They are the same if you truncate them down to 16 bits - so that's what your code is doing. Instead of using itoa, you have following options:
calculate the ASCII representation yourself;
use ltoa(long value, char * buffer, int radix), if available, or
leverage s[n]printf if available.
For the last option you can use the following, "mostly" portable code:
void LCD::printInteger(unsigned char headLine[17], int32_t number) {
...
char str[17];
if (sizeof(int) == sizeof(int32_t))
snprintf(str, sizeof(str), "%d", num);
else if (sizeof(long int) == sizeof(int32_t))
snprintf(str, sizeof(str), "%ld", num);
else if (sizeof(long long int) == sizeof(int32_t))
snprintf(str, sizeof(str), "%lld", num);
...
}
If, and only if, your platform doesn't have snprintf, you can use sprintf and remove the 2nd argument (sizeof(str)). Your go-to function should always be the n variant, as it gives you one less bullet to shoot your foot with :)
Since you're compiling with a C++ compiler that is, I assume, at least half-decent, the above should do "the right thing" in a portable way, without emitting all the unnecessary code. The test conditions passed to if are compile-time constant expressions. Even some fairly old C compilers could deal with such properly.
Nitpick: Don't use int8_t where a char would do. itoa, s[n]printf, etc. expect char buffers, not int8_t buffers.

Basics of strtol?

I am really confused. I have to be missing something rather simple but nothing I am reading about strtol() is making sense. Can someone spell it out for me in a really basic way, as well as give an example for how I might get something like the following to work?
string input = getUserInput;
int numberinput = strtol(input,?,?);
The first argument is the string. It has to be passed in as a C string, so if you have a std::string use .c_str() first.
The second argument is optional, and specifies a char * to store a pointer to the character after the end of the number. This is useful when converting a string containing several integers, but if you don't need it, just set this argument to NULL.
The third argument is the radix (base) to convert. strtol can do anything from binary (base 2) to base 36. If you want strtol to pick the base automatically based on prefix, pass in 0.
So, the simplest usage would be
long l = strtol(input.c_str(), NULL, 0);
If you know you are getting decimal numbers:
long l = strtol(input.c_str(), NULL, 10);
strtol returns 0 if there are no convertible characters at the start of the string. If you want to check if strtol succeeded, use the middle argument:
const char *s = input.c_str();
char *t;
long l = strtol(s, &t, 10);
if(s == t) {
/* strtol failed */
}
If you're using C++11, use stol instead:
long l = stol(input);
Alternately, you can just use a stringstream, which has the advantage of being able to read many items with ease just like cin:
stringstream ss(input);
long l;
ss >> l;
Suppose you're given a string char const * str. Now convert it like this:
#include <cstdlib>
#include <cerrno>
char * e;
errno = 0;
long n = std::strtol(str, &e, 0);
The last argument 0 determines the number base you want to apply; 0 means "auto-detect". Other sensible values are 8, 10 or 16.
Next you need to inspect the end pointer e. This points to the character after the consumed input. Thus if all input was consumed, it points to the null-terminator.
if (*e != '\0') { /* error, die */ }
It's also possible to allow for partial input consumption using e, but that's the sort of stuff that you'll understand when you actually need it.
Lastly, you should check for errors, which can essentially only be overflow errors if the input doesn't fit into the destination type:
if (errno != 0) { /* error, die */ }
In C++, it might be preferable to use std::stol, though you don't get to pick the number base in this case:
#include <string>
try { long n = std::stol(str); }
catch (std::invalid_argument const & e) { /* error */ }
catch (std::out_of_range const & e) { /* error */ }
Quote from C++ reference:
long int strtol ( const char * str, char ** endptr, int base );
Convert string to long integer
Parses the C string str interpreting its content as an integral number of the specified base, which is returned as a long int value. If endptr is not a null pointer, the function also sets the value of endptr to point to the first character after the number.
So try something like
long l = strtol(pointerToStartOfString, NULL, 0)
I always use simply strol(str,0,0) - it returns long value. 0 for radix (last parameter) means to auto-detect it from input string, so both 0x10 as hex and 10 as decimal could be used in input string.

C++ substring multi byte characters

I am having this std::string which contains some characters that span multiple bytes.
When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.
So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.
Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.
Thanks
Simpler version.
based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos
std::string substr(std::string originalString, int maxLength)
{
std::string resultString = originalString;
int len = 0;
int byteCount = 0;
const char* aStr = originalString.c_str();
while(*aStr)
{
if( (*aStr & 0xc0) != 0x80 )
len += 1;
if(len>maxLength)
{
resultString = resultString.substr(0, byteCount);
break;
}
byteCount++;
aStr++;
}
return resultString;
}
A std::string object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring, except that it's a string of 16bit values.
In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.
UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring objects to expect. Assuming that UTF-16 is acceptable you have various choices:
On Windows, you can use the MultiByteToWideChar function; no extra dependencies required.
The UTF8-CPP library claims to provide a lightweight solution for dealing with UTF-* encoded strings. Never tried it myself, but I keep hearing good things about it.
On Linux systems, using the libiconv library is quite common.
If you need to deal with all sorts of crazy encodings and want the full-blown alpha-and-omega word as far as encodings go, look at ICU.
There are really only two possible solutions. If you're doing this a
lot, over large distances, you'd be better off converting your
characters to a single element encoding, using wchar_t (or int32_t,
or whatever is most appropriate. This is not a simple copy, which
would convert each individual char into the target type, but a true
conversion function, which would recognize the multibyte characters, and
convert them into a single element.
For occasional use or shorter sequences, it's possible to write your own
functions for advancing n bytes. For UTF-8, I use the following:
inline size_t
size(
Byte ch )
{
return byteCountTable[ ch ] ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::random_access_iterator_tag )
{
return begin + size ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::input_iterator_tag )
{
while ( size != 0 ) {
++ begin ;
-- size ;
}
return begin ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
InputIterator end )
{
if ( begin != end ) {
begin = succ( begin, end, size( *begin ),
std::::iterator_traits< InputIterator >::iterator_category() ) ;
}
return begin ;
}
template< typename InputIterator >
size_t
characterCount(
InputIterator begin,
InputIterator end )
{
size_t result = 0 ;
while ( begin != end ) {
++ result ;
begin = succ( begin, end ) ;
}
return result ;
}
Based on this I've written my utf8 substring function:
void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
int len = 0, byteIndex = 0;
const char* aStr = originalString.c_str();
size_t origSize = originalString.size();
for (byteIndex=0; byteIndex < origSize; byteIndex++)
{
if((aStr[byteIndex] & 0xc0) != 0x80)
len += 1;
if(len >= SubStrLength)
break;
}
csSubstring = originalString.substr(0, byteIndex);
}
Unicode is hard.
std::wstring is not a list of codepoints, it's a list of wchar_t, and their width is implementation-defined (commonly 16 bits with VC++ and 32 bits with gcc and clang). Yes, it means it's useless for portable code...
A single character may be encoded on several code points (because of diacritics)
In some language, two different characters together form a "unit" that is not really separable (for example, LL is considered a letter on its own in Spanish).
So... it's a bit hard.
Solving 3) may be costly (it requires specific language/usage annotations); solving 1) and 2) is absolutely necessary... and requires Unicode aware libraries or coding your own (and probably getting it wrong).
1) is trivially solved: writing a routine transforming from UTF-8 to CodePoint is trivial (a CodePoint can be represented with an uint32_t)
2) is more difficult, it requires a list of diacritics and the sub routine must know never to cut prior to a diacritic (they follow the character they qualify)
Otherwise, there is probably what you seek in ICU. I wish you good luck finding it.
Let me assume for simplicity that your encoding is UTF-8. In this case we would have some chars occupying more than one byte, as in your case.
Then you have std::string, where those UTF-8 encoded characters are stored.
And now you want to substr() in terms of chars, not bytes.
I'd write a function that will convert character length to byte length. For the utf 8 case it would look like:
#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1
int32 GetByteCountForCharCount(const char* utf8Str, int charCnt)
{
int ByteCount = 0;
for (int i = 0; i < charCnt; i++)
{
int charlen = UTF8_CHAR_LEN(*utf8Str);
ByteCount += charlen;
utf8Str += charlen;
}
return ByteCount;
}
So, say you want to substr() the string from 7-th char. No problem:
int32 pos = GetByteCountForCharCount(str.c_str(), 7);
str.substr(pos);