Does PVS-Studio know about Unicode chars? - c++

This code produces Medium warnings at lines w/ return:
// Checks if the symbol defines two-symbols Unicode sequence
bool doubleSymbol(const char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}
// Checks if the symbol defines three-symbols Unicode sequence
bool tripleSymbol(const char c) {
static const char THREE_SYMBOLS_MASK = 0b1110;
return (c >> 4) == THREE_SYMBOLS_MASK;
}
// Checks if the symbol defines four-symbols Unicode sequence
bool quadrupleSymbol(const char c) {
static const char FOUR_SYMBOLS_MASK = 0b11110;
return (c >> 3) == FOUR_SYMBOLS_MASK;
}
PVS says that the expressions are always false (V547), but they actually aren't: char may be a part of Unicode symbol that is read to std::string!
Here is the Unicode representation of symbols:
1 byte - 0xxx'xxxx - 7 bits
2 bytes - 110x'xxxx 10xx'xxxx - 11 bits
3 bytes - 1110'xxxx 10xx'xxxx 10xx'xxxx - 16 bits
4 bytes - 1111'0xxx 10xx'xxxx 10xx'xxxx 10xx'xxxx - 21 bits
The following code counts number of symbols in a Unicode text:
size_t symbolCount = 0;
std::string s;
while (getline(std::cin, s)) {
for (size_t i = 0; i < s.size(); ++i) {
const char c = s[i];
++symbolCount;
if (doubleSymbol(c)) {
i += 1;
} else if (tripleSymbol(c)) {
i += 2;
} else if (quadrupleSymbol(c)) {
i += 3;
}
}
}
std::cout << symbolCount << "\n";
For the Hello! input the output is 6 and for Привет, мир! is 12 — this is right!
Am I wrong or doesn't PVS know something? ;)

PVS-Studio analyzer knows that there are signed and unsigned char types. Whether signed/unsigned is used depends on compilation keys and PVS-Studio analyzer takes these keys into account.
I think this code is compiled, when char is of signed char type. Let's see what consequences it brings.
Let’s look only at the first case:
bool doubleSymbol(const char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}
If the value variable 'c' is less than or equal to 01111111, the condition will always be false, because during the shift the max value you can get is 011.
It means we are interested in only cases where the highest bit in the variable 'c' is equal to 1. As this variable is of signed char type, then the highest bit means that the variable stores a negative value. Before the shift, signed char becomes a signed int and the value continues to be negative.
Now let's see what the standard says about the right-shift of negative numbers:
The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a non-negative value, the value of the result is the integral part of the quotient of E1/2^E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.
Thus, the shift of a negative number to the left is implementation-defined. This means that the highest bits are filled with nulls or ones. Both will be correct.
PVS-Studio thinks that the highest bits are filled with ones. It has a full right to think so, because it is necessary to choose any implementation. So it turns out that the expression ((c) >> 5) will have a negative value if the highest bit in the variable 'c' is originally equal to 1. A negative number cannot be equal to TWO_SYMBOLS_MASK.
It turns out that from the viewpoint of PVS-Studio, the condition will always be false, and it correctly issues a warning V547.
In practice, the compiler may behave differently: the highest bits will be filled with 0 and then everything will work correctly.
In any case, it is necessary to fix the code, as it goes to the implementation-defined behavior of the compiler.
Code might be fixed as follows:
bool doubleSymbol(const unsigned char c) {
static const char TWO_SYMBOLS_MASK = 0b110;
return (c >> 5) == TWO_SYMBOLS_MASK;
}

Related

What does this char string related piece of C++ code do?

bool check(const char *text) {
char c;
while (c = *text++) {
if ((c & 0x80) && ((*text) & 0x80)) {
return true;
}
}
return false;
}
What's 0x80 and the what does the whole mysterious function do?
Testing the result of an x & 0x80 expression for non-zero (as is done twice in the code you show) checks if the most significant bit (bit 7) of the char operand (x) is set1. In your case, the code loops through the given string looking for two consecutive characters (c, which is a copy of the 'current' character, and *test, the next one) with that bit set.
If such a combination is found, the function returns true; if it is not found and the loop reaches the nul terminator (so that the c = *text++ expression becomes zero), it returns false.
As to why it does such a check – I can only guess but, if that upper bit is set, then the character will not be a standard ASCII value (and may be the first of a Unicode pair, or some other multi-byte character representation).
Possibly helpful references:
Bitwise operators
Hexadecimal constants
1 Note that this bitwise AND test is really the only safe way to check that bit, because the C++ Standard allows the char type to be either signed (where testing for a negative value would be an alternative) or unsigned (where testing for >= 128 would be required); either of those tests would fail if the implementation's char had the 'wrong' type of signedness.
I can't be totally sure without more context, but it looks to me like this function checks to see if a string contains any UTF-8 characters outside the classic 7-bit US-ASCII range.
while (c=*text++) will loop until it finds the nul-terminator in a C-style string; assigning each char to c as it goes. c & 0x80 checks if the most-significant-bit of c is set. *text & 0x80 does the same for the char pointed to by text (which will be the one after c, since it was incremented as part of the while condition).
Thus this function will return true if any two adjacent chars in the string pointed to by text have their most-significant-bit set. That's the case for any code points U+0080 and above in UTF-8; hence my guess that this function is for detecting UTF-8 text.
Rewriting to be less compact:
while (true)
{
char c = *text;
text += 1;
if (c == '\0') // at the end of string?
return false;
int temp1 = c & 0x80; // test MSB of c
int temp2 = (*text) & 0x80; // test MSB of next character
if (temp1 != 0 && temp2 != 0) // if both set the return true
return true;
}
MSB means Most Significant Bit. Bit7. Zero for plain ascii characters

Unsigned char value cycle C++

I (think I) understand how the maths with different variable types works. For example, if I go over the max limit of an unsigned int variable, it will loop back to 0.
I don't understand the behavior of this code with unsigned char:
#include<iostream>
int main() {
unsigned char var{ 0 };
for(int i = 0; i < 501; ++i) {
var += 1;
std::cout << var << '\n';
}
}
This just outputs 1...9, then some symbols and capital letters, and then it just doesn't print anything. It doesn't loop back to the values 1...9 etc.
On the other hand, if I cast to int before printing:
#include<iostream>
int main() {
unsigned char var{ 0 };
for(int i = 0; i < 501; ++i) {
var += 1;
std::cout << (int)var << '\n';
}
}
It does print from 1...255 and then loops back from 0...255.
Why is that? It seems that the unsgined char variable does loop (as we can see from the int cast).
Is it safe to to maths with unsigned char variables? What is the behavior that I see here?
Why doesn't it print the expected integer value?
The issue is not with the looping of char. The issue is with the insertion operation for std::ostream objects and 8-bit integer types. The non-member operator<< functions for these types treat all 8-bit integers (char, signed char, and unsigned char) as their ASCII character types.
operator<<(std::basic_ostream)
The canonical way to handle outputing 8-bit integer types is the way you're doing it. I personally prefer this instead:
char foo;
std::cout << +foo;
The unary + operator promotes the char type to an integer type, which then causes the integer printing function to be called.
Note that integer overflow is only defined for unsigned integer types. If you repeat this with char or signed char, the behavior is undefined by the standard. SOMETHING will happen, for sure, because we live in reality, but that overflow behavior may differ from compiler to compiler.
Why doesn't it repeat the 0..9 characters
I tested this using g++ to compile, and bash on Ubuntu 20.04. My non-printable characters are handled as explicit symbols in some cases, or nothing printed in other cases. The non-repeating behavior must be due to how your shell handles these non-printable characters. We can't answer that without more information.
Unsigned chars aren't trated as numbers in this case. This data type is literally a byte:
1 byte = 8 bits = 0000 0000 which means 0.
What cout is printing is the character that represents that byte you changed by adding +1 to it.
For example:
0 = 0000 0000
1 = 0000 0001
2 = 0000 0010
.
.
.
9 = 0000 1001
Then, here start other chars that arent related to numbers.
So, if you cast it to int, it will give you the numeric representations of that byte, giving you a 0-255 output.
Hope this clarifies!
Edit: Made the explanation more clear.

How to convert multi-character constant to integer in C?

How to convert multi-character constant in x to integer?
I tried for example '13' as ('3' + '1' << 3), but it doesn't work properly.
I don't mean "0123", but '0123'. It compiles, but I don't how did compiler gets the octal result 6014231063 when printing it. I am not looking for atoi which just converts this to present number. For example int x = '1' would print 49 in decimal number system. Now I am interested what would print int x = '0123'. This task is from programming competition, so the answer shouldn't be unexpected behavior.
int main(void) {
int x = '0123';
printf("%o\n", x);
printf("%d\n", x >> 24);
printf("%d\n", x << 8 >> 24);
printf("%d\n", x & 0xff);
return 0;
}
How to convert multi-character constant to integer in C?
'0123' in an int.
int x = '0123';
'0123' is a character-constant. In C, this is one of the forms of a constant and it has the type of int. It is rarely used as its value is implementation-defined. It's usually the following depending on endianness and character codding (e.g. ASCII):
(('0'*256 + '1')*256 + `2`)*256 + '3' = 858927408 = 0x33323130
(('3'*256 + '2')*256 + `1`)*256 + '0' = 808530483 = 0x30313233
Further: It is a challenge to write useful portable code with it. Many coding styles bar it when used with more than 1 character.
'0123' is a multi-character constant/literal (C calls it a constant, C++ calls it a literal). In both languages, it is of type int and has an implementation-defined value.
It's probably typical for '0123' to have the value
('0' << 24) + ('1' << 16) + ('2' << 8) + '3'
(assuming CHAR_BIT==8, and keeping in mind that the values of '0' et al are themselves implementation-defined).
Because the value is implementation-defined, multi-character constants are rarely useful, and nearly useless in portable code. The standard doesn't even guarantee that '0123' and '1234' have distinct values.
But to answer your question, '0123' is already of type int, so no conversion is necessary. You can store, manipulate, or print that value in any way you like.
For example, on my system this program:
#include <stdio.h>
int main(void) {
printf("0x%x\n", (unsigned int)'0123');
}
prints (after a compile-time warning):
0x30313233
which is consistent with the formula above -- but the result might differ under another implementation.
The "implementation-defined" value means that an implementation is required to document it. gcc's behavior (for version 5.3) is documented here:
The preprocessor and compiler interpret character constants in the
same way; i.e. escape sequences such as ‘\a’ are given the values they
would have on the target machine.
The compiler evaluates a multi-character character constant a
character at a time, shifting the previous value left by the number of
bits per target character, and then or-ing in the bit-pattern of the
new character truncated to the width of a target character. The final
bit-pattern is given type int, and is therefore signed, regardless of
whether single characters are signed or not (a slight change from
versions 3.1 and earlier of GCC). If there are more characters in the
constant than would fit in the target int the compiler issues a
warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be
interpreted as
(int) ((unsigned char) 'a' * 256 + (unsigned char)'b'), and
'\234a' as (int) ((unsigned char) '\234' * 256 + (unsigned char) a').
You could try something in the lines of creating a function like this:
int StringLiteralToInt(const char* string, int numbeOfCharacters)
{
int result = 0;
for(int ch = 0; ch < numberOfCharacters; ch++)
{
float powerTen = pow(10, numbeOfCharacters - (ch+1));
result += (int)string[ch] * (int)powerTen;
}
return result;
}
I just wrote that inline, so it might not be 100% right, but it should be the right idea. Just multiply the chars by a power of ten (right most - 10^0, left most - 10^(strinSize-1).
Hope that helps :)
Well, you could try this:
int main()
{
int x = '0123';
printf("%x\n", x);
}
For me this prints 30313233, as I expect.
Here it is broken apart, as it looks like you were trying to do:
printf("%o ", (x >> 24) & 0xff);
printf("%o ", (x >> 16) & 0xff);
printf("%o ", (x >> 8) & 0xff);
printf("%o\n", x & 0xff);
These printouts show that the multi-character character constant is, in some sense, made up of the characters '0', '1', '2', and '3' all jammed together. But there is really no sense in which this multi-character character constant has any meaningful relationship to the integer 123. (We could write some code to shift and mask by 8 bits, then subtract '0' to convert from character to digit, then multiply by 10 and add, just like atoi, but it wouldn't really mean anything.)

How the compiler knows whether to extend zeros or ones?

int main()
{
char c = -1;
unsigned char u = -1;
printf("c = %u\n",c);
printf("c = %d\n",c);
printf("u = %u\n",u);
printf("u = %d\n",u);
}
The result is:
c = 4294967295
c = -1
u = 255
u = 255
When I try to convert it into unsigned int
Because of sign extension I got c = 4294967295
But when I try to convert the unsigned char into unsigned int
I got u = 255
in the first case 1 extended by 16 bits and printed that number
in the second case 0 is extended.
But my question is how the compiler detects whether to extend zeros or ones when it is going to fit small data into a large memory.
Unsigned numbers (such as your u) will not be sign-extended, they will be zero-extended. That's because they have no sign! Signed numbers (i.e. variables of signed integral type) are always sign extended. The confusion may come from your initial assignment of -1 into an unsigned type--once it's assigned, there is no longer a possibility for sign extension.
The printing involves two steps.
a) The 8 bit chars (signed/unsigned) are converted to 32 bit
b) The 32 bit values is printed as signed or unsigned
So after step a) c=0b1111.1111.1111.1111.1111.1111.1111.1111 while u=0b0000.0000.0000.0000.0000.0000.1111.1111 due to the rule described by John Zwinck.

Get number of bits in char

How do I get the number of bits in type char?
I know about CHAR_BIT from climits. This is described as »The macro yields the maximum value for the number of bits used to represent an object of type char.« at Dikumware's C Reference. I understand that means the number of bits in a char, doesn't it?
Can I get the same result with std::numeric_limits somehow? std::numeric_limits<char>::digits returns 7 correctly but unfortunately, because this value respects the signedness of the 8-bit char here…
CHAR_BIT is, by definition, number of bits in the object representation of type [signed/unsigned] char.
numeric_limits<>::digits is the number of non-sign bits in the value representation of the given type.
Which one do you need?
If you are looking for number of bits in the object representation, then the correct approach is to take the sizeof of the type and multiply it by CHAR_BIT (of course, there's no point in multiplying by sizeof in specific case of char types, since their size is always 1, and since CHAR_BIT by definition alredy contains what you need).
If you are talking about value representation then numeric_limits<> is the way to go.
For unsigned char type the bit-size of object representation (CHAR_BIT) is guaranteed to be the same as bit-size of value representation, so you can use numeric_limits<unsigned char>::digits and CHAR_BIT interchangeably, but this might be questionable from the conceptual point of view.
If you want to be overly specific, you can do this:
sizeof(char) * CHAR_BIT
If you know you are definitely going to do the sizeof char, it's a bit overkill as sizeof(char) is guaranteed to be 1.
But if you move to a different type such as wchar_t, that will be important.
Looking at the snippets archive for this code here, here's an adapted version, I do not claim this code:
int countbits(char ch){
int n = 0;
if (ch){
do n++;
while (0 != (ch = ch&(ch-1)));
}
return n;
}
Hope this helps,
Best regards,
Tom.
A non-efficient way:
char c;
int bits;
for ( c = 1, bits = 0; c; c <<= 1, bits++ )
;
printf( "bits = %d\n", bits );
no reputation here yet so I'm not allowed to reply to #t0mm13b 's answer, but wanted to point out that there's a problem with the code:
int countbits(char ch){
int n = 0;
if (ch){
do n++;
while (0 != (ch = ch&(ch-1)));
}
return n;
}
The above won't count the number of bits in a character, it will count the number of set bits (1 bits).
For example, the following call will return 4:
char c = 'U';
countbits(c);
The code:
ch = ch & (ch - 1)
Is a trick to strip off the right most (least significant) bit that's set to 1. So, it glosses over any bits set to 0 and doesn't count them.