Reading from a file byte per byte C++ - c++

Im trying to write a program in C++ that will take 2 files and compare them byte by byte.
I was looking at the following post
Reading binary istream byte by byte
Im not really sure about parts of this. When using get(char& c) it reads in a char and stores it in c. Is this storing as, say 0x0D, or is it storing the actual char value "c" (or whatever)?
If i wish to use this method to compare two files byte by byte would i just use get(char& c) on both then compare the chars that were got, or do i need to cast to byte?
(I figured starting a new post would be better since the original is quite an old one)

chars are nothing but a "special type of storage" (excuse the expression) for integers, in memory there is no difference between 'A' and the decimal value 65 (ASCII assumed).
c will in other words contain the read byte from the file.
To answer your added question; no, there is no cast required doing c1 == c2 will be just fine.
char c1 = 'A', c2 = 97, c3 = 0x42;
std::cout << c1 << " " << c2 << " " << c3 << std::endl;
std::cout << +c1 << " " << +c2 << " " << +c3 << std::endl;
/* Writing +c1 in the above will cast c1 to an int, it's is the same thing as writing (int)c1 or the more correct (c++ish) static_cast<int> (c1). */
output:
A a B
65 97 66

Ehm,
a char contains 1 Byte
The interpretation of that value is indeed depending on you, the programmer.
If you print that byte in the cout stream it is interpreted via ASCII Code and therefor if your char was 0x63 then it will print 'c' on the screen.
If you just use the value you can use it as you like..
char c = 0x63;
c++;
// c is now: 0x64
Note that you can also input decimals

Related

Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout`

I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
The first print shows: ???
The loop prints the hex for the three characters as c6 d8 c5
When foo[i] is cast to char (or wchar_t), nothing is printed
Environmental variable $LANG is set to default en_US.UTF-8
In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use std::wstring over std::string?
On Linux? Almost never, unless you use a toolkit/framework.
Short explanation why:
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.
In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).
That's why std::cout << char(-49) << char(-120); will also print ψ.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.

How to avoid 0xFF prefix when converting char to short?

When I do:
cout << std::hex << (short)('\x3A') << std::endl;
cout << std::hex << (short)('\x8C') << std::endl;
I expect the following output:
3a
8c
but instead, I have:
3a
ff8c
I suppose that this is due to the way char—and more precisely a signed char—is stored in memory: everything below 0x80 would not be prefixed; the value 0x80 and above, on the other hand, would be prefixed with 0xFF.
When given a signed char, how do I get a hexadecimal representation of the actual character inside it? In other words, how do I get 0x3A for \x3A, and 0x8C for \x8C?
I don't think a conditional logic is well suited here. While I can subtract 0xFF00 from the resulting short when needed, it doesn't seem very clear.
Your output might make more sense if you looked at it in decimal instead of hexadecimal:
std::cout << std::dec << (short)('\x3A') << std::endl;
std::cout << std::dec << (short)('\x8C') << std::endl;
output:
58
-116
The values were cast to short, so we are (most commonly) dealing with 16 bit values. The 16-bit binary representation of -116 is 1111 1111 1000 1100, which becomes FF8C in hexadecimal. So the output is correct given what you requested (on systems where char is a signed type). So not so much the way the char is stored in memory, but more the way the bits are interpreted. As an unsigned value, the 8-bit pattern 1000 1100 represents -116, and the conversion to short is supposed to preserve this value, rather than preserving the bits.
Your desired output of a hexadecimal 8C corresponds (for a short) to the decimal value 140. To get this value out of 8 bits, the value has to be interpreted as an unsigned 8-bit value (since the largest signed 8-bit value is 127). So the data needs to be interpreted as an unsigned char before it gets expanded to some flavor of short. For a character literal like in the example code, this would look like the following.
std::cout << std::hex << (unsigned short)(unsigned char)('\x3A') << std::endl;
std::cout << std::hex << (unsigned short)(unsigned char)('\x8C') << std::endl;
Most likely, the real code would have variables instead of character literals. If that is the case, then rather than casting to an unsigned char, it might be more convenient to declare the variable to be of unsigned char type. Which is possibly the type you should be using anyway, based on the fact that you want to see its hexadecimal value. Not definitively, but this does suggest that the value is seen simply as a byte of data rather than as a number, and that suggests that an unsigned type is appropriate. Have you looked at std::byte?
One other nifty thought to throw out: the following also gives the desired output as a reasonable facsimile of using an unsigned char variable.
#include <iostream>
unsigned char operator "" _u (char c) { return c; } // Suffix for unsigned char literals
int main()
{
std::cout << std::hex << (unsigned short)('\x3A'_u) << std::endl;
std::cout << std::hex << (unsigned short)('\x8C'_u) << std::endl;
}
A more straightforward approach is to cast a signed char to an unsigned char. In other words, this:
cout << std::hex << (short)(unsigned char)('\x3A') << std::endl;
cout << std::hex << (short)(unsigned char)('\x8C') << std::endl;
produces the expected result:
3a
8c
Not sure this is particularly clear, though.

C++ convert string of bytes to ints from file

I'm completely new to C++, so I guess this might be a very trivial question. If this is a duplicate of an already answered question (I bet it is...), please point me to that answer!
I have a file with the following cut from hexdump myfile -n 4:
00000000 02 00 04 00 ... |....|
00000004
My problem/confusion comes when trying to read these values and convert them to ints ( [0200]_hex --> [512]_dec and [0400]_hex --> [1024]_dec).
A minimum working example based on this answer:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(void){
char fn[] = "myfile";
ifstream file;
file.open(fn, ios::in | ios::binary);
string fbuff = " ";
file.read((char *)&fbuff[0], 2);
cout << "fbuff: " << fbuff << endl;
// works
string a = "0x0200";
cout << "a: " << a << endl;
cout << "stoi(a): " << stoi(a, nullptr, 16) << endl;
// doesn't work
string b = "\x02\x00";
cout << "b: " << b << endl;
cout << "stoi(b): " << stoi(b, nullptr, 16) << endl;
// doesn't work
cout << "stoi(fbuff): " << stoi(fbuff, nullptr, 16) << endl;
file.close();
return(0);
}
What I cant get my head around is the difference between a and b; the former defined with 0x (which works perfect) and the latter defined with \x and breaks stoi. My guess is that whats being read from the file is in the \x-format, based on the output when running the code within sublime-text3 (below), and every example I've seen only deals with for example 0x0200-formatted inputs
// Output from sublime, which just runs g++ file.cpp && ./file.cpp
fbuff: <0x02> <0x00>
a: 0x0200
stoi(a): 512
b:
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi
[Finished in 0.8s with exit code -6]
Is there a simple way to read two, or more, bytes, group them and convert into a proper short/int/long?
The literal string "0x0200" is really an array of seven bytes:
0x30 0x78 0x30 0x32 0x30 0x30 0x00
The first six are ASCII encoded characters for '0', 'x', '0', '2', '0' and '0'. The last is the null-terminator that all strings have.
The literal string "\x00\x02" is really an array of three bytes:
0x00 0x02 0x00
That is not really what is normally called a "string", but rather just a collection of bytes. And it's nothing that can be parsed as a string by std::stoi. And as std::stoi can't parse it the function will throw an exception.
You might want to get a couple of good books to read and learn more about strings.
Note: This answer assumes ASCII encoding and 8-bit bytes, which is by far the most common.

Initializing an unsigned char array with hex values in C++

I would like to initialize an unsigned char array with 16 hex values. However, I don't seem to know how to properly initialize/access those values. When I try to access them as I might want to intuitively, I'm getting no value at all.
This is my output
The program was run with the following command: 4
Please be a value! -----> p
Here's some plaintext
when run with the code below -
int main(int argc, char** argv)
{
int n;
if (argc > 1) {
n = std::stof(argv[1]);
} else {
std::cerr << "Not enough arguments\n";
return 1;
}
char buff[100];
sprintf(buff,"The program was run with the following command: %d",n);
std::cout << buff << std::endl;
unsigned char plaintext[16] =
{0x0f, 0xb0, 0xc0, 0x0f,
0xa0, 0xa0, 0xa0, 0xa0,
0x00, 0x00, 0xa0, 0xa0,
0x00, 0x00, 0x00, 0x00};
unsigned char test = plaintext[1]^plaintext[2];
std::cout << "Please be a value! -----> " << test << std::endl;
std::cout << "Here's some plaintext " << plaintext[3] << std::endl;
return 0;
}
By way of context, this is part of a group project for school. We are ultimately trying to implement the Serpent cipher, but keep on getting tripped up by unsigned char arrays. Our project specification says that we must have two functions that take what would be Byte arrays in Java. I assume the closest relative in C++ is an unsigned char[]. Otherwise I would use vector. Elsewhere in the code I've implemented a setKey function which takes an unsigned char array, packs its values into 4 long long ints (the key needs to be 256 bits) and performs various bit-shifting and xor operations on those ints to generate the keys necessary for the cryptographic algorithm. Hope that's enough background on what I'm looking to do. I'm guessing I'm just overlooking some basic C++ functionality here. Thanks for any and all help!
A char is an 8-bit value capable of storing -128 <= n <= +127, frequently used to store character representations in different encodings and commonly - in Western, Roman-alphabet installations - char is used to indicate representation of ASCII or utf encoded values. 'Encoded' means the symbols/letter in the character set have been assigned numeric values. Think of the periodic table as an encoding of elements, so that 'H' (Hydrogen) is encoded as 1, Germanium as 32. In the ASCII (and UTF-8) tables, position 32 represents the character we call "space".
When you use operator << on a char value, the default behavior is to assume you are passing it a character encoding, e.g. an ASCII character code. If you do
char c = 'z';
char d = 122;
char e = 0x7A;
char f = '\x7a';
std::cout << c << d << e << f << "\n";
All four assignments are equivalent. 'z' is a shortcut/syntactic-sugar for char(122), 0x7A is hex for 122, and '\x7a' is an escape that forms the ascii character with a value of 0x7a or 122 - i.e. z.
Where many new programmers go wrong is that they do this:
char n = 8;
std::cout << n << endl;
this does not print "8", it prints ASCII character at position 8 in the ASCII table.
Think for a moment:
char n = 8; // stores the value 8
char n = a; // what does this store?
char n = '8'; // why is this different than the first line?
Lets rewind a moment: when you store 120 in a variable, it can represent the ASCII character 'x', but ultimately what is being stored is just the numeric value 120, plain and simple.
Specifically: When you pass 122 to a function that will ultimately use it to look up a font entry from a character set using the Latin1, ISO-8859-1, UTF-8 or similar encodings, then 120 means 'z'.
At the end of the day, char is just one of the standard integer value types, it can store values -128 <= n <= +127, it can trivially be promoted to a short, int, long or long long, etc, etc.
While it is generally used to denote characters, it also frequently gets used as a way of saying "I'm only storing very small values" (such as integer percentages).
int incoming = 5000;
int outgoing = 4000;
char percent = char(outgoing * 100 / incoming);
If you want to print the numeric value, you simply need to promote it to a different value type:
std::cout << (unsigned int)test << "\n";
std::cout << unsigned int(test) << "\n";
or the preferred C++ way
std::cout << static_cast<unsigned int>(test) << "\n";
I think (it's not completely clear what you are asking) that the answer is as simple as this
std::cout << "Please be a value! -----> " << static_cast<unsigned>(test) << std::endl;
If you want to output the numeric value of a char or unsigned char, you have to cast it to an int or unsigned first.
Not surprisingly, by default, chars are output as characters not integers.
BTW this funky code
char buff[100];
sprintf(buff,"The program was run with the following command: %d",n);
std::cout << buff << std::endl;
is more simply written as
std::cout << "The program was run with the following command: " << n << std::endl;
std::cout and std::cin always treats char variable as a char
If you want to input or output as int, you must manually do it like below.
std::cin >> int_var; c = int_var;
std::cout << (int)c;
If using scanf or printf, there is no such problem as the format parameter ("%d", "%c", "%s") tells howto covert input buffer (integer, char, string).

How to convert a char array to a list of HEX values?

Would I would like to be able to do is convert a char array (may be binary data) to a list of HEX values of the form: ab 0d 12 f4 etc....
I tried doing this with
lHexStream << "<" << std::hex << std::setw (2) << character << ">";
but this did not work since I would get the data printing out as:
<ffe1><2f><ffb5><54>< 6><1b><27><46><ffd9><75><34><1b><ffaa><ffa2><2f><ff90><23><72><61><ff93><ffd9><60><2d><22><57>
Note here that some of the values would have 4 HEX values in them? e.g.
What I would be looking for is what they have in wireshark, where they represent a char aray (or binary data) in a HEX format like:
08 0a 12 0f
where each character value is represented by just 2 HEX characters of the form shown above.
It looks like byte values greater than 0x80 are being sign-extended to short (I don't know why it's stopping at short, but that's not important right now). Try this:
IHexStream << '<' << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(static_cast<unsigned char>(character))
<< '>';
You may be able to remove the outer cast but I wouldn't rely on it.
EDIT: added std::setfill call, which you need to get <06> instead of < 6>. Hat tip to jkerian; I hardly ever use iostreams myself. This would be so much shorter with fprintf:
fprintf(ihexfp, "<%02x>", (unsigned char)character);
As Zack mentions, The 4-byte values are because it is interpreting all values over 128 as negative (the base type is signed char), then that 'negative value' is extended as the value is expanded to a signed short.
Personally, I found the following to work fairly well:
char *myString = inputString;
for(int i=0; i< length; i++)
std::cout << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(myString[i]) << " ";
I think the problem is that the binary data is being interpreted as a multi-byte encoding when you're reading the characters. This is evidenced byt he fact that each of the 4-character hex codes in your example have the high bit set in the lower byte.
You probably want to read the binary stream in ascii mode.