I'm completely new to C++, so I guess this might be a very trivial question. If this is a duplicate of an already answered question (I bet it is...), please point me to that answer!
I have a file with the following cut from hexdump myfile -n 4:
00000000 02 00 04 00 ... |....|
00000004
My problem/confusion comes when trying to read these values and convert them to ints ( [0200]_hex --> [512]_dec and [0400]_hex --> [1024]_dec).
A minimum working example based on this answer:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(void){
char fn[] = "myfile";
ifstream file;
file.open(fn, ios::in | ios::binary);
string fbuff = " ";
file.read((char *)&fbuff[0], 2);
cout << "fbuff: " << fbuff << endl;
// works
string a = "0x0200";
cout << "a: " << a << endl;
cout << "stoi(a): " << stoi(a, nullptr, 16) << endl;
// doesn't work
string b = "\x02\x00";
cout << "b: " << b << endl;
cout << "stoi(b): " << stoi(b, nullptr, 16) << endl;
// doesn't work
cout << "stoi(fbuff): " << stoi(fbuff, nullptr, 16) << endl;
file.close();
return(0);
}
What I cant get my head around is the difference between a and b; the former defined with 0x (which works perfect) and the latter defined with \x and breaks stoi. My guess is that whats being read from the file is in the \x-format, based on the output when running the code within sublime-text3 (below), and every example I've seen only deals with for example 0x0200-formatted inputs
// Output from sublime, which just runs g++ file.cpp && ./file.cpp
fbuff: <0x02> <0x00>
a: 0x0200
stoi(a): 512
b:
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi
[Finished in 0.8s with exit code -6]
Is there a simple way to read two, or more, bytes, group them and convert into a proper short/int/long?
The literal string "0x0200" is really an array of seven bytes:
0x30 0x78 0x30 0x32 0x30 0x30 0x00
The first six are ASCII encoded characters for '0', 'x', '0', '2', '0' and '0'. The last is the null-terminator that all strings have.
The literal string "\x00\x02" is really an array of three bytes:
0x00 0x02 0x00
That is not really what is normally called a "string", but rather just a collection of bytes. And it's nothing that can be parsed as a string by std::stoi. And as std::stoi can't parse it the function will throw an exception.
You might want to get a couple of good books to read and learn more about strings.
Note: This answer assumes ASCII encoding and 8-bit bytes, which is by far the most common.
Related
I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
The first print shows: ???
The loop prints the hex for the three characters as c6 d8 c5
When foo[i] is cast to char (or wchar_t), nothing is printed
Environmental variable $LANG is set to default en_US.UTF-8
In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use std::wstring over std::string?
On Linux? Almost never, unless you use a toolkit/framework.
Short explanation why:
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.
In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).
That's why std::cout << char(-49) << char(-120); will also print ψ.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.
I'm trying to output the plaintext contents of this .exe file. It's got plaintext stuff in it like "Changing the code in this way will not affect the quality of the resulting optimized code." all the stuff microsoft puts into .exe files. When I run the following code I get the output of M Z E followed by a heart and a diamond. What am I doing wrong?
ifstream file;
char inputCharacter;
file.open("test.exe", ios::binary);
while ((inputCharacter = file.get()) != EOF)
{
cout << inputCharacter << "\n";
}
file.close();
I would use something like std::isprint to make sure the character is printable and not some weird control code before printing it.
Something like this:
#include <cctype>
#include <fstream>
#include <iostream>
int main()
{
std::ifstream file("test.exe", std::ios::binary);
char c;
while(file.get(c)) // don't loop on EOF
{
if(std::isprint(c)) // check if is printable
std::cout << c;
}
}
You have opened the stream in binary, which is good for the intended purpose. However you print every binary data as it is: some of thes characters are not printable, giving weird output.
Potential solutions:
If you want to print the content of an exe, you'll get more non-printable chars than printable ones. So one approach could be to print the hex value instead:
while ( file.get(inputCharacter ) )
{
cout << setw(2) << setfill('0') << hex << (int)(inputCharacter&0xff) << "\n";
}
Or you could use the debugger approach of displaying the hex value, and then display the char if it's printable or '.' if not:
while (file.get(inputCharacter)) {
cout << setw(2) << setfill('0') << hex << (int)(inputCharacter&0xff)<<" ";
if (isprint(inputCharacter & 0xff))
cout << inputCharacter << "\n";
else cout << ".\n";
}
Well, for the sake of ergonomy, if the exe file contains any real exe, you'd better opt for displaying several chars on each line ;-)
Binary file is a collection of bytes. Byte has a range of values 0..255. Printable characters that can be safely "printed" form a much narrower range. Assuming most basic ASCII encoding
32..63
64..95
96..126
plus, maybe, some higher than 128, if your codepage has them
see ascii table.
Every character that falls out of that range may, at least:
print out as invisible
print out as some weird trash
be in fact a control character that will change settings of your terminal
Some terminals support "end of text" character and will simply stop printing any text afterwards. Maybe you hit that.
I'd say, if you are interested only in text, then print only that printables and ignore others. Or, if you want everything, then maybe write them out in hex form instead?
This worked:
ifstream file;
char inputCharacter;
string Result;
file.open("test.exe", ios::binary);
while (file.get(inputCharacter))
{
if ((inputCharacter > 31) && (inputCharacter < 127))
Result += inputCharacter;
}
cout << Result << endl;
cout << "These are the ascii characters in the exe file" << endl;
file.close();
I'd like to take the next two hex characters from a stream and store them as the associated associated hex->decimal numeric value in a char.
So if an input file contains 2a3123, I'd like to grab 2a, and store the numeric value (decimal 42) in a char.
I've tried
char c;
instream >> std::setw(2) >> std::hex >> c;
but this gives me garbage (if I replace c with an int, I get the maximum value for signed int).
Any help would be greatly appreciated! Thanks!
edit: I should note that the characters are guaranteed to be within the proper range for chars and that the file is valid hexadecimal.
OK I think dealing with ASCII decoding is a bad idea at all and does not really answer the question.
I think your code does not work because setw() or istream::width() works only when you read to std::string or char*. I guess it from here
How ever you can use the goodness of standard c++ iostream converters. I came up with idea that uses stringstream class and string as buffer. The thing is to read n chars into buffer and then use stringstream as a converter facility.
I am not sure if this is the most optimal version. Probably not.
Code:
#include <iostream>
#include <sstream>
int main(void){
int c;
std::string buff;
std::stringstream ss_buff;
std::cin.width(2);
std::cin >> buff;
ss_buff << buff;
ss_buff >> std::hex >> c;
std::cout << "read val: " << c << '\n';
}
Result:
luk32#genaker:~/projects/tmp$ ./a.out
0a10
read val: 10
luk32#genaker:~/projects/tmp$ ./a.out
10a2
read val: 16
luk32#genaker:~/projects/tmp$ ./a.out
bv00
read val: 11
luk32#genaker:~/projects/tmp$ ./a.out
bc01
read val: 188
luk32#genaker:~/projects/tmp$ ./a.out
01bc
read val: 1
And as you can see not very error resistant. Nonetheless, works for the given conditions, can be expanded into a loop and most importantly uses the iostream converting facilities so no ASCII magic from your side. C/ASCII would probably be way faster though.
PS. Improved version. Uses simple char[2] buffer and uses non-formatted write/read to move data thorough the buffer (get/write as opposed to operator<</operator>>). The rationale is pretty simple. We do not need any fanciness to move 2 bytes of data. We ,however, use formatted extractor to make the conversion. I made it a loop version for the convenience. It was not super simple though. It took me good 40 minutes of fooling around to figure out very important lines. With out them the extraction works for 1st 2 characters.
#include <iostream>
#include <sstream>
int main(void){
int c;
char* buff = new char[3];
std::stringstream ss_buff;
std::cout << "read vals: ";
std::string tmp;
while( std::cin.get(buff, 3).gcount() == 2 ){
std::cout << '(' << buff << ") ";
ss_buff.seekp(0); //VERY important lines
ss_buff.seekg(0); //VERY important lines
ss_buff.write(buff, 2);
if( ss_buff.fail() ){ std::cout << "error\n"; break;}
std::cout << ss_buff.str() << ' ';
ss_buff >> std::hex >> c;
std::cout << c << '\n';
}
std::cout << '\n';
delete [] buff;
}
Sample output:
luk32#genaker:~/projects/tmp$ ./a.out
read vals: 0aabffc
(0a) 0a 10
(ab) ab 171
(ff) ff 255
Please note, the c was not read as intended.
I found everything needed here http://www.cplusplus.com/reference/iostream/
You can cast a Char to an int and the int will hold the ascii value of the char. For example, '0' will be 48, '5' will be 53. The letters occur higher up so 'a' will be cast to 97, 'b' to 98 etc. So knowing this you can take the int value and subtract 48, if the result is greater than 9, subtract another 39. Then char 0 will have been turned to int 0, char 1 to int 1 all the way up to char a being set to int 10, char b to int 11 etc.
Next you will need to multiply the value of the first by 16 and add it to the second to account for the bit shift. Using your example of 2a.
char 2 casts to int 50. Subtract 48 and get 2. Multiply by 16 and get 32.
char a casts to int 97. Subtract 48 and get 49, this is higher than 9 so subtract another 39 and get 10. Add this to the end result of the last one (32) and you get 42.
Here is the code:
int HexToInt(char hi, char low)
{
int retVal = 0;
int hiBits = (int)hi;
int loBits = (int)low;
retVal = Convert(hiBits) * 16 + Convert(loBits);
return retVal;
}
int Convert(int in)
{
int retVal = in - 48;
//If it was not a digit
if(retVal > 10)
retVal = retVal - 7;
//if it was not an upper case hex didgit
if(retVal > 15)
retVal = retVal - 32;
return retVal;
}
The first function can actually be written as one line thus:
int HexToInt(char hi, char low)
{
return Convert((int)hi) * 16 + Convert((int)low);
}
NOTE: This only accounts for lower case letters and only works on systems that uses ASCII, i.e. Not IBM ebcdic based systems.
Im trying to write a program in C++ that will take 2 files and compare them byte by byte.
I was looking at the following post
Reading binary istream byte by byte
Im not really sure about parts of this. When using get(char& c) it reads in a char and stores it in c. Is this storing as, say 0x0D, or is it storing the actual char value "c" (or whatever)?
If i wish to use this method to compare two files byte by byte would i just use get(char& c) on both then compare the chars that were got, or do i need to cast to byte?
(I figured starting a new post would be better since the original is quite an old one)
chars are nothing but a "special type of storage" (excuse the expression) for integers, in memory there is no difference between 'A' and the decimal value 65 (ASCII assumed).
c will in other words contain the read byte from the file.
To answer your added question; no, there is no cast required doing c1 == c2 will be just fine.
char c1 = 'A', c2 = 97, c3 = 0x42;
std::cout << c1 << " " << c2 << " " << c3 << std::endl;
std::cout << +c1 << " " << +c2 << " " << +c3 << std::endl;
/* Writing +c1 in the above will cast c1 to an int, it's is the same thing as writing (int)c1 or the more correct (c++ish) static_cast<int> (c1). */
output:
A a B
65 97 66
Ehm,
a char contains 1 Byte
The interpretation of that value is indeed depending on you, the programmer.
If you print that byte in the cout stream it is interpreted via ASCII Code and therefor if your char was 0x63 then it will print 'c' on the screen.
If you just use the value you can use it as you like..
char c = 0x63;
c++;
// c is now: 0x64
Note that you can also input decimals
Would I would like to be able to do is convert a char array (may be binary data) to a list of HEX values of the form: ab 0d 12 f4 etc....
I tried doing this with
lHexStream << "<" << std::hex << std::setw (2) << character << ">";
but this did not work since I would get the data printing out as:
<ffe1><2f><ffb5><54>< 6><1b><27><46><ffd9><75><34><1b><ffaa><ffa2><2f><ff90><23><72><61><ff93><ffd9><60><2d><22><57>
Note here that some of the values would have 4 HEX values in them? e.g.
What I would be looking for is what they have in wireshark, where they represent a char aray (or binary data) in a HEX format like:
08 0a 12 0f
where each character value is represented by just 2 HEX characters of the form shown above.
It looks like byte values greater than 0x80 are being sign-extended to short (I don't know why it's stopping at short, but that's not important right now). Try this:
IHexStream << '<' << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(static_cast<unsigned char>(character))
<< '>';
You may be able to remove the outer cast but I wouldn't rely on it.
EDIT: added std::setfill call, which you need to get <06> instead of < 6>. Hat tip to jkerian; I hardly ever use iostreams myself. This would be so much shorter with fprintf:
fprintf(ihexfp, "<%02x>", (unsigned char)character);
As Zack mentions, The 4-byte values are because it is interpreting all values over 128 as negative (the base type is signed char), then that 'negative value' is extended as the value is expanded to a signed short.
Personally, I found the following to work fairly well:
char *myString = inputString;
for(int i=0; i< length; i++)
std::cout << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(myString[i]) << " ";
I think the problem is that the binary data is being interpreted as a multi-byte encoding when you're reading the characters. This is evidenced byt he fact that each of the 4-character hex codes in your example have the high bit set in the lower byte.
You probably want to read the binary stream in ascii mode.