How to get Windows-1252 character values in c++? - c++

I have a weird input file with all kinds of control characters like nulls. I want to remove all control characters from this Windows-1252 encoded text file, but if you do this:
std::string test="tést";
for (int i=0;i<test.length();i++)
{
if (test[i]<32) test[i]=32; // change all control characters into spaces
}
It will change the é into a space as well.
So if you have a string like this, encoded in Windows-1252:
std::string test="tést";
The hex values would be:
t é s t
74 E9 73 74
See https://en.wikipedia.org/wiki/ASCII and https://en.wikipedia.org/wiki/Windows-1252
test[0] would equal to decimal 116 (=0x74), but apparently with é/0xE9, test[1] does not equal the decimal value 233.
So how can you recognize that é properly?

32 is a signed integer, comparing the char with the signed integer is performed by the compiler as signed: E9 (-23)<32 which return true.
Using an unsigned literal of 32, that is 32umakes the comparison to be performed on unsigned values: E9 (233) < 32 which return false.
Replace :
if (test[i]<32) test[i]=32;
By:
if (test[i]<32u) test[i]=32u;
And you should get the expected result.
Test this here:
https://onlinegdb.com/BJ8tj0kbd
Note: you can check that char is signed with the following code:
#include <limits>
...
std::cout << std::numeric_limits<char>::is_signed << std::endl;

Change
if (test[i]<32)
to
if (test[i] >= 0 && test[i] < 32)
chars are often signed types and 0xE9 is a negative value in an eight bit integer.

Related

How to read ASCII value from a character and convert it into hexadecimal formatted string

Need to read the value of character as a number and find corresponding hexadecimal value for that.
#include <iostream>
#include <iomanip>
using namespace std;
int main() {
char c = 197;
cout << hex << uppercase << setw(2) << setfill('0') << (short)c << endl;
}
Output:
FFC5
Expected output:
C5
The problem is that when you use char c = 197 you are overflowing the char type, producing a negative number (-59). Starting there it doesn't matter what conversion you make to larger types, it will remain a negative number.
To fully understand why you must know how two's complement works.
Basically, -59 and 192 have the same binary representation: 1100 0101, depending on the data type it is interpreted in one way or another. When you print it using hexadecimal format, the binary representation (the actual value stored in memory) is the one used, producing C5.
When the char is converted into an short/unsigned short, it is converting the -59 into its short/unsigned short representation, which is 1111 1111 1100 0101 (FFC5) for both cases.
The correct way to do it would be to store the initial value (197) into a variable which data type is able to represent it (unsigned char, short, unsigned short, ...) from the very beginning.

Difference between converting int to char by (char) and by ASCII

I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.

C++ int with preceding 0 changes entire value

I have this very strange problem where if I declare an int like so
int time = 0110;
and then display it to the console the value returned is 72. However when I remove the 0 at the front so that int time = 110; the console then displays 110 like expected.
Two things I'd like to know, first of all why it does this with a preceding 0 at the start of the int and is there a way to stop it so that 0110 at least equals 110?Secondly is there any way to keep it so that 0110 returns 0110?
If you take a crack guess at the variable name I'm trying to do operations with 24hr time, but at this point any time before 1000 is causing problems because of this.
Thanks in advance!
An integer literal that starts from 0 defines an octal integer literal. Now in C++ there are four categories of integer literals
integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
binary-literal integer-suffixopt
And octal-integer literal is defined the following way
octal-literal:
0 octal-literal
opt octal-digit
That is it starts from 0.
Thus this octal integer literal
0110
corresponds to the following decimal number
8^2 + 8^1
that is equal to 72.
You can be sure that 72 in octal representation is equivalent to 110 by running the following simple program
#include <iostream>
#include <iomanip>
int main()
{
std::cout << std::oct << 72 << std::endl;
return 0;
}
The output is
110
It is because of Integer Literals. Placing a 0 before number means its a octal number. For binary it is 0b, for hexadecimal it is 0x or 0X. You don't need to write any thing for decimal. See the code bellow.
#include<stdio.h>
int main()
{
int binary = 0b10;
int octal=010;
int decimal = 10;
int hexa = 0x10;
printf("%d %d %d %d\n", octal, decimal, hexa, binary);
}
For more information visit tutorialspoint.
The compiler is interpreting the leading zero as an octal number. The octal value of "110" is 72 in decimal. There's no need for the leading zero if you're just storing an int value.
You're trying to store "time" as it appears on a clock. That's actually more complicated than a simple int. You could store the number of minutes since midnight instead.
Zero at the start means the number is in octal. Without it is decimal.

Weird casting from string element

I get std:string which should include bytes (array of chars), I'm trying to display the bytes, but first byte always include weird data:
4294967169, how this can be a byte/char?!
void do_log(const std::string & data) {
std::stringstream ss;
ss<< "do_log: ";
int min = min((data.length()), (20)); // first 20 bytes
for (int i=0;i<min;i++)
{
ss << setfill ('0') << setw(2) << hex << (unsigned int)data.at(i) <<" ";
}
log(ss.str());
}
Data I log is:
ffffff81 0a 53 3a 30 30 38 31 30 36 43 38
Why and how this ffffff81 appear? if the string.at should return char?
When you write (unsigned int)data.at(i), data.at(i) is a char which then is submitted to integer promotion. If on your system char is signed, the values greater than 127 are interpreted as negative number. The sign bit will remain in the integer promotion, giving such strange results.
You can verify if char is signed by looking at numeric_limits<char>::is_signed
You can easily solve the issue by getting rid of the bits added by the integer promotion, by ANDing the integer with 0xff: (static_cast<int>(data.at(i)) & 0xff)
Another way is to force your compiler to work with unsigned chars. For example, with option -funsigned-char on gcc or /J with MSVC.
string contains signed characters.
So a character 0x81 is interpreted as a negative number, like 0xFFFFFF81.
You cast this character to an unsigned int, so it becomes a very large number.

Unexpected results when looking at ASCII codes in C++

The bit of code below is extracting ASCII codes from characters.
When I convert characters in the normal ASCII region I get the value I expect.
When I convert £ and € from the extened region I get a load of 1's padding the INT that I'm storing the character in.
e.g. the output of the below is:
45 (ascii E as expected)
FFFFFF80 (extended ascii € as expected but padded with ones)
It's not causing me an issue but I'm just wondering why this happens.
Here's the code...
unsigned int asciichar[3];
string cTextToEncode = "E€";
for (unsigned int i = 0; i < cTextToEncode.length(); i++)
{
asciichar[i] = (unsigned int)cTextToEncode[i];
cout << hex << asciichar[i] << "\n";
}
Can anyone explain why this is?
Thanks
depending on the implementation a char can be either signed or unsigned. In your case they appear to be signed, so 0x80 is interpreted as -128 instead of 128, hence when cast to an integer it becomes 0xffffff80.
btw, this has nothing at all to do with ASCII
First, there's no € in ASCII (extended or otherwise) because the euro didn't exist when ASCII was created. However, several ASCII-friendly 8-bit encodings do support the € character, but the conversion is done by your source code editor (the compiler merely sees a byte which happens to represent € in your editor, but might be something else entirely on, say, a computer in Israel).
Second, (unsigned int) casts do not extract the ASCII encoding of a character. They merely convert the value of the underlying numeric char type to an unsigned integer. This causes strange things to happen when the converted value is negative - on your compiler, char happens to be signed char and thus characters with an ASCII value larger than 127 end up being negative char values.
You should convert to an unsigned char first, and then to an unsigned int.
You should be careful when promoting signed values.
When promoting signed char to signed int a first bit (sign bit) is taken into account. The algorithm is roughly look like this:
1) If you have 1X-XX-XX-XX (char in binary, X - any binary digit) then int will be (starts with 24 ones) 1...1-1X-XX-XX-XX (binary) -> 0xFFFFFFYY (hex)
2) if you have 0X-XX-XX-XX (binary), then you'll have (starts with 24 zeroes) 0...0-0X-XX-XX-XX (binary) -> 0x000000YY (hex).
In your case you want to force rule #2 all the time. In order to do this, you need to tell compiler to ignore first bit (sign bit). For this you need to use unsigned char.