bool check(const char *text) {
char c;
while (c = *text++) {
if ((c & 0x80) && ((*text) & 0x80)) {
return true;
}
}
return false;
}
What's 0x80 and the what does the whole mysterious function do?
Testing the result of an x & 0x80 expression for non-zero (as is done twice in the code you show) checks if the most significant bit (bit 7) of the char operand (x) is set1. In your case, the code loops through the given string looking for two consecutive characters (c, which is a copy of the 'current' character, and *test, the next one) with that bit set.
If such a combination is found, the function returns true; if it is not found and the loop reaches the nul terminator (so that the c = *text++ expression becomes zero), it returns false.
As to why it does such a check – I can only guess but, if that upper bit is set, then the character will not be a standard ASCII value (and may be the first of a Unicode pair, or some other multi-byte character representation).
Possibly helpful references:
Bitwise operators
Hexadecimal constants
1 Note that this bitwise AND test is really the only safe way to check that bit, because the C++ Standard allows the char type to be either signed (where testing for a negative value would be an alternative) or unsigned (where testing for >= 128 would be required); either of those tests would fail if the implementation's char had the 'wrong' type of signedness.
I can't be totally sure without more context, but it looks to me like this function checks to see if a string contains any UTF-8 characters outside the classic 7-bit US-ASCII range.
while (c=*text++) will loop until it finds the nul-terminator in a C-style string; assigning each char to c as it goes. c & 0x80 checks if the most-significant-bit of c is set. *text & 0x80 does the same for the char pointed to by text (which will be the one after c, since it was incremented as part of the while condition).
Thus this function will return true if any two adjacent chars in the string pointed to by text have their most-significant-bit set. That's the case for any code points U+0080 and above in UTF-8; hence my guess that this function is for detecting UTF-8 text.
Rewriting to be less compact:
while (true)
{
char c = *text;
text += 1;
if (c == '\0') // at the end of string?
return false;
int temp1 = c & 0x80; // test MSB of c
int temp2 = (*text) & 0x80; // test MSB of next character
if (temp1 != 0 && temp2 != 0) // if both set the return true
return true;
}
MSB means Most Significant Bit. Bit7. Zero for plain ascii characters
Related
I have and std::string with utf-8 characters (some latin, some non-latin) in linux and mac.
As we know, utf-8 char size is not fixed, and some of the characters are not just 1 byte (like regular latin characters).
The question is how can I get the character in offset i?
It makes sense to use int32 data type to store the char, but how do I get that character?
For example:
std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong
It is important to point out that I do not know the size of character in offset i.
It's very simple.
First, you have to understand, you cant calculate the position without iterating the string (that's obvious fr var-length characters)
Second, you need to remember that in utf-8 characters can be 1-4 bytes and in case they occupy more than one byte, all trailing bytes have 10 significant bits set. So, you just count bytes, ignoring them if (byte_val & 0xC0) == 0x80.
Unfortunately, I don't have compiler at my disposal right now, so please be kind to possible mistakes in the code:
int desired_index = 19;
int index = 0;
char* p = my_str.c_str();
while ( *p && index < desired_index ){
if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
index++;
p++;
}
// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
p++;
if ( *p ){
// here p points to your desired char
} else {
// we reached EOL while searching
}
This code produces Medium warnings at lines w/ return:
// Checks if the symbol defines two-symbols Unicode sequence
bool doubleSymbol(const char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}
// Checks if the symbol defines three-symbols Unicode sequence
bool tripleSymbol(const char c) {
static const char THREE_SYMBOLS_MASK = 0b1110;
return (c >> 4) == THREE_SYMBOLS_MASK;
}
// Checks if the symbol defines four-symbols Unicode sequence
bool quadrupleSymbol(const char c) {
static const char FOUR_SYMBOLS_MASK = 0b11110;
return (c >> 3) == FOUR_SYMBOLS_MASK;
}
PVS says that the expressions are always false (V547), but they actually aren't: char may be a part of Unicode symbol that is read to std::string!
Here is the Unicode representation of symbols:
1 byte - 0xxx'xxxx - 7 bits
2 bytes - 110x'xxxx 10xx'xxxx - 11 bits
3 bytes - 1110'xxxx 10xx'xxxx 10xx'xxxx - 16 bits
4 bytes - 1111'0xxx 10xx'xxxx 10xx'xxxx 10xx'xxxx - 21 bits
The following code counts number of symbols in a Unicode text:
size_t symbolCount = 0;
std::string s;
while (getline(std::cin, s)) {
for (size_t i = 0; i < s.size(); ++i) {
const char c = s[i];
++symbolCount;
if (doubleSymbol(c)) {
i += 1;
} else if (tripleSymbol(c)) {
i += 2;
} else if (quadrupleSymbol(c)) {
i += 3;
}
}
}
std::cout << symbolCount << "\n";
For the Hello! input the output is 6 and for Привет, мир! is 12 — this is right!
Am I wrong or doesn't PVS know something? ;)
PVS-Studio analyzer knows that there are signed and unsigned char types. Whether signed/unsigned is used depends on compilation keys and PVS-Studio analyzer takes these keys into account.
I think this code is compiled, when char is of signed char type. Let's see what consequences it brings.
Let’s look only at the first case:
bool doubleSymbol(const char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}
If the value variable 'c' is less than or equal to 01111111, the condition will always be false, because during the shift the max value you can get is 011.
It means we are interested in only cases where the highest bit in the variable 'c' is equal to 1. As this variable is of signed char type, then the highest bit means that the variable stores a negative value. Before the shift, signed char becomes a signed int and the value continues to be negative.
Now let's see what the standard says about the right-shift of negative numbers:
The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a non-negative value, the value of the result is the integral part of the quotient of E1/2^E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.
Thus, the shift of a negative number to the left is implementation-defined. This means that the highest bits are filled with nulls or ones. Both will be correct.
PVS-Studio thinks that the highest bits are filled with ones. It has a full right to think so, because it is necessary to choose any implementation. So it turns out that the expression ((c) >> 5) will have a negative value if the highest bit in the variable 'c' is originally equal to 1. A negative number cannot be equal to TWO_SYMBOLS_MASK.
It turns out that from the viewpoint of PVS-Studio, the condition will always be false, and it correctly issues a warning V547.
In practice, the compiler may behave differently: the highest bits will be filled with 0 and then everything will work correctly.
In any case, it is necessary to fix the code, as it goes to the implementation-defined behavior of the compiler.
Code might be fixed as follows:
bool doubleSymbol(const unsigned char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}
I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}
I have this code which handles Strings like "19485" or "10011010" or "AF294EC"...
long long toDecimalFromString(string value, Format format){
long long dec = 0;
for (int i = value.size() - 1; i >= 0; i--) {
char ch = value.at(i);
int val = int(ch);
if (ch >= '0' && ch <= '9') {
val = val - 48;
} else {
val = val - 55;
}
dec = dec + val * (long long)(pow((int) format, (value.size() - 1) - i));
}
return dec;
}
this code works for all values which are not in 2's complement.
If I pass a hex-string which is supposed to be a negativ number in decimal I don't get the right result.
If you don't handle the minus sign, it won't handle itself.
Check for it, and memorize the fact you've seen it. Then, at
the end, if you'd seen a '-' as the first character, negate
the results.
Other points:
You don't need (nor want) to use pow: it's just
results = format * results + digit each time through.
You do need to validate your input, making sure that the digit
you obtain is legal in the base (and that you don't have any
other odd characters).
You also need to check for overflow.
You should use isdigit and isalpha (or islower and
isupper) for you character checking.
You should use e.g. val -= '0' (and not 48) for your
conversion from character code to digit value.
You should use [i], and not at(i), to read the individual
characters. Compile with the usual development options, and
you'll get a crash, rather than an exception, in case of error.
But you should probably use iterators, and not an index, to go
through the string. It's far more idiomatic.
You should almost certainly accept both upper and lower case
for the alphas, and probably skip leading white space as well.
Technically, there's also no guarantee that the alphabetic
characters are in order and adjacent. In practice, I think you
can count on it for characters in the range 'A'-'F' (or
'a'-'f', but the surest way of converting character to digit
is to use table lookup.
You need to know whether the specified number is to be interpreted as signed or unsigned (in other words, is "ffffffff" -1 or 4294967295?).
If signed, then to detect a negative number test the most-significant bit. If ms bit is set, then after converting the number as you do (generating an unsigned value) take the 1's complement (bitwise negate it then add 1).
Note: to test the ms bit you can't just test the leading character. If the number is signed, is "ff" supposed to be -1 or 255?. You need to know the size of the expected result (if 32 bits and signed, then "ffffffff" is negative, or -1. But if 64 bits and signed, "ffffffff' is positive, or 4294967295). Thus there is more than one right answer for the example "ffffffff".
Instead of testing ms bit you could just test if unsigned result is greater than the "midway point" of the result range (for example 2^31 -1 for 32-bit numbers).
I have a char a[] of hexadecimal characters like this:
"315c4eeaa8b5f8aaf9174145bf43e1784b8fa00dc71d885a804e5ee9fa40b16349c146fb778cdf2d3aff021dfff5b403b510d0d0455468aeb98622b137dae857553ccd8883a7bc37520e06e515d22c954eba5025b8cc57ee59418ce7dc6bc41556bdb36bbca3e8774301fbcaa3b83b220809560987815f65286764703de0f3d524400a19b159610b11ef3e"
I want to convert it to letters corresponding to each hexadecimal number like this:
68656c6c6f = hello
and store it in char b[] and then do the reverse
I don't want a block of code please, I want explanation and what libraries was used and how to use it.
Thanks
Assuming you are talking about ASCII codes. Well, first step is to find the size of b. Assuming you have all characters by 2 hexadecimal digits (for example, a tab would be 09), then size of b is simply strlen(a) / 2 + 1.
That done, you need to go through letters of a, 2 by 2, convert them to their integer value and store it as a string. Written as a formula you have:
b[i] = (to_digit(a[2*i]) << 4) + to_digit(a[2*i+1]))
where to_digit(x) converts '0'-'9' to 0-9 and 'a'-'z' or 'A'-'Z' to 10-15.
Note that if characters below 0x10 are shown with only one character (the only one I can think of is tab, then instead of using 2*i as index to a, you should keep a next_index in your loop which is either added by 2, if a[next_index] < '8' or added by 1 otherwise. In the later case, b[i] = to_digit(a[next_index]).
The reverse of this operation is very similar. Each character b[i] is written as:
a[2*i] = to_char(b[i] >> 4)
a[2*i+1] = to_char(b[i] & 0xf)
where to_char is the opposite of to_digit.
Converting the hexadecimal string to a character string can be done by using std::substr to get the next two characters of the hex string, then using std::stoi to convert the substring to an integer. This can be casted to a character that is added to a std::string. The std::stoi function is C++11 only, and if you don't have it you can use e.g. std::strtol.
To do the opposite you loop over each character in the input string, cast it to an integer and put it in an std::ostringstream preceded by manipulators to have it presented as a two-digit, zero-prefixed hexadecimal number. Append to the output string.
Use std::string::c_str to get an old-style C char pointer if needed.
No external library, only using the C++ standard library.
Forward:
Read two hex chars from input.
Convert to int (0..255). (hint: sscanf is one way)
Append int to output char array
Repeat 1-3 until out of chars.
Null terminate the array
Reverse:
Read single char from array
Convert to 2 hexidecimal chars (hint: sprintf is one way).
Concat buffer from (2) to final output string buffer.
Repeat 1-3 until out of chars.
Almost forgot to mention. stdio.h and the regular C-runtime required only-assuming you're using sscanf and sprintf. You could alternatively create a a pair of conversion tables that would radically speed up the conversions.
Here's a simple piece of code to do the trick:
unsigned int hex_digit_value(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
return -1;
}
std::string dehexify(std::string const & s)
{
std::string result(s.size() / 2);
for (std::size_t i = 0; i != s.size(); ++i)
{
result[i] = hex_digit_value(s[2 * i]) * 16
+ hex_digit_value(s[2 * i + 1]);
}
return result;
}
Usage:
char const a[] = "12AB";
std::string s = dehexify(a);
Notes:
A proper implementation would add checks that the input string length is even and that each digit is in fact a valid hex numeral.
Dehexifying has nothing to do with ASCII. It just turns any hexified sequence of nibbles into a sequence of bytes. I just use std::string as a convenient "container of bytes", which is exactly what it is.
There are dozens of answers on SO showing you how to go the other way; just search for "hexify".
Each hexadecimal digit corresponds to 4 bits, because 4 bits has 16 possible bit patterns (and there are 16 possible hex digits, each standing for a unique 4-bit pattern).
So, two hexadecimal digits correspond to 8 bits.
And on most computers nowadays (some Texas Instruments digital signal processors are an exception) a C++ char is 8 bits.
This means that each C++ char is represented by 2 hex digits.
So, simply read two hex digits at a time, convert to int using e.g. an istringstream, convert that to char, and append each char value to a std::string.
The other direction is just opposite, but with a twist.
Because char is signed on most systems, you need to convert to unsigned char before converting that value again to hex digits.
Conversion to and from hexadecimal can be done using hex, like e.g.
cout << hex << x;
cin >> hex >> x;
for a suitable definition of x, e.g. int x
This should work for string streams as well.