I encountered an odd problem when exporting float values to a file. I would expect every float to be of the same length (obviously), but my programme sometimes exports it a 32 bit number and sometimes as a 40 bit number.
A minimal working example of a programme that still shows this behaviour is:
#include <stdio.h>
const char* fileName = "C:/Users/Path/To/TestFile.txt";
float array [5];
int main(int argc, char* argv [])
{
float temp1 = 1.63006e-33f;
float temp2 = 1.55949e-32f;
array[0] = temp1;
array[1] = temp2;
array[2] = temp1;
array[3] = temp2;
array[4] = temp2;
FILE* outputFile;
if (!fopen_s(&outputFile, fileName, "w"))
{
fwrite(array, 5 * sizeof(float), 1, outputFile);
fclose(outputFile);
}
return true;
}
I would expect the output file to contain exactly 20 (5 times 4) bytes, each four of which represent a float. However, I get this:
8b 6b 07 09 // this is indeed 1.63006e-33f
5b f2 a1 0d 0a // I don't know what this is but it's a byte too long
8b 6b 07 09
5b f2 a1 0d 0a
5b f2 a1 0d 0a
So the float temp2 takes 5 bytes instead of four, and the total length of he file is 23. How is this possible?! The number aren't so small that they are subnormal numbers, and I can't think of any other reason why there would be a difference in size.
I am using the MSVC 2010 compiler on a 64-bit Windows 7 system.
Note: I already asked a very similar question here, but when I realised the problem was more general, I decided to repost it in a more concise way.
QDataStream uses sometimes 32 bit and sometimes 40 bit floats
The problem is that on Windows, you have to differentiate between text and binary files. You have the file opened as text, which means 0d (carriage-return) is inserted before every 0a (line-feed) written. Open the file like this:
if (!fopen_s(&outputFile, fileName, "wb"))
The rest as before, and it should work.
You're not writing text; you're writing binary data... However, your file is open for writing text ("w") instead of writing binary ("wb"). Hence, fwrite() is translating '\n' to "\r\n".
Change this:
if (!fopen_s(&outputFile, fileName, "w"))
To this:
if (!fopen_s(&outputFile, fileName, "wb"))
In "wb", the b stands for binary mode.
Related
I have written a small application which works at some point with binary data. In unit tests, I compare this data with the expected one. When an error occurs, I want the test to display the hexadecimal output such as:
Failure
Expected: string_to_hex(expected, 11)
Which is: "01 43 02 01 00 65 6E 74 FA 3E 17"
To be equal to: string_to_hex(writeBuffer, 11)
Which is: "01 43 02 01 00 00 00 00 98 37 DB"
In order to display that (and to compare binary data in the first place), I used the code from Stack Overflow, slightly modifying it for my needs:
std::string string_to_hex(const std::string& input, size_t len)
{
static const char* const lut = "0123456789ABCDEF";
std::string output;
output.reserve(2 * len);
for (size_t i = 0; i < len; ++i)
{
const unsigned char c = input[i];
output.push_back(lut[c >> 4]);
output.push_back(lut[c & 15]);
}
return output;
}
When checking for memory leaks with valgrind, I fould a lot of errors such as this one:
Use of uninitialised value of size 8
at 0x11E75A: string_to_hex(std::__cxx11::basic_string, std::allocator > const&, unsigned long)
I'm not sure to understand it. First, everything seems initialized, including, I'm mistaken, output. Moreover, there is no mention of size 8 in the code; the value of len varies from test to test, while valgrind reports the same size 8 every time.
How should I fix this error?
So this is one of the cases where passing a pointer to char that points to buffer filled with arbitrary binary data into evil implicit constructor of std::string class was causing string to be truncated to first \0. Straightforward approach would be to pass a raw pointer but a better solution is to start using array_view span or similar utility classes that will provide index validation at least in debug build for both input and lut.
So I am writing a program to turn a Chinese-English definition .txt file into a vocab trainer that runs through the CLI. However, in windows when I try to compile this in VS2017 it turns into gibberish and I'm not sure why. I think it was working OK in linux but windows seems to mess it up quite a bit. Does this have something to do with the encoding table in windows? Am I missing something? I wrote the code in Linux as well as the input file, but I tried writing the characters using windows IME and still has the same result. I think the picture speaks best for itself. Thanks
Note: Added sample of input/output as it appears in Windows, as requested. Also, input is UTF-8.
Sample of input
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
Sample of output
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
土(tǔ),earth
Picture of Input file & Output
TL;DR: The Windows terminal hates Unicode. You can work around it, but it's not pretty.
Your issues here are unrelated to "char versus wchar_t". In fact, there's nothing wrong with your program! The problems only arise when the text leaves through cout and arrives at the terminal.
You're probably used to thinking of a char as a "character"; this is a common (but understandable) misconception. In C/C++, the char type is usually synonymous with an 8-bit integer, and thus is more accurately described as a byte.
Your text file chineseVocab.txt is encoded as UTF-8. When you read this file via fstream, what you get is a string of UTF-8-encoded bytes.
There is no such thing as a "character" in I/O; you're always transmitting bytes in a particular encoding. In your example, you are reading UTF-8-encoded bytes from a file handle (fin).
Try running this, and you should see identical results on both platforms (Windows and Linux):
int main()
{
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
{
cout << "Number of bytes in the line: " << dec << line.length() << endl;
cout << " ";
for (char c : line)
{
// Here we need to trick the compiler into displaying this "char" as an integer:
unsigned int byte = (unsigned char)c;
cout << hex << byte << " ";
}
cout << endl;
cout << endl;
}
return 0;
}
Here's what I see in mine (Windows):
Number of bytes in the line: 16
e4 ba ba 28 72 c3 a9 6e 29 2c 70 65 72 73 6f 6e
Number of bytes in the line: 15
e5 88 80 28 64 c4 81 6f 29 2c 6b 6e 69 66 65
Number of bytes in the line: 14
e5 8a 9b 28 6c c3 ac 29 2c 70 6f 77 65 72
Number of bytes in the line: 27
e5 8f 88 28 79 c3 b2 75 29 2c 72 69 67 68 74 20 68 61 6e 64 3b 20 61 67 61 69 6e
Number of bytes in the line: 15
e5 8f a3 28 6b c7 92 75 29 2c 6d 6f 75 74 68
So far, so good.
The problem starts now: you want to write those same UTF-8-encoded bytes to another file handle (cout).
The cout file handle is connected to your CLI (the "terminal", the "console", the "shell", whatever you wanna call it). The CLI reads bytes from cout and decodes them into characters so they can be displayed.
Linux terminals are usually configured to use a UTF-8 decoder. Good news! Your bytes are UTF-8-encoded, so your Linux terminal's decoder matches the text file's encoding. That's why everything looks good in the terminal.
Windows terminals, on the other hand, are usually configured to use a system-dependent decoder (yours appears to be DOS codepage 437). Bad news! Your bytes are UTF-8-encoded, so your Windows terminal's decoder does not match the text file's encoding. That's why everything looks garbled in the terminal.
OK, so how do you solve this? Unfortunately, I couldn't find any portable way to do it... You will need to fork your program into a Linux version and a Windows version. In the Windows version:
Convert your UTF-8 bytes into UTF-16 code units.
Set standard output to UTF-16 mode.
Write to wcout instead of cout
Tell your users to change their terminals to a font that supports Chinese characters.
Here's the code:
#include <fstream>
#include <iostream>
#include <string>
#include <windows.h>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
using namespace std;
// Based on this article:
// https://msdn.microsoft.com/magazine/mt763237?f=255&MSPPError=-2147217396
wstring utf16FromUtf8(const string & utf8)
{
std::wstring utf16;
// Empty input --> empty output
if (utf8.length() == 0)
return utf16;
// Reject the string if its bytes do not constitute valid UTF-8
constexpr DWORD kFlags = MB_ERR_INVALID_CHARS;
// Compute how many 16-bit code units are needed to store this string:
const int nCodeUnits = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of the source UTF-8 string, in bytes
nullptr, // Unused - no conversion done in this step
0 // Request size of destination buffer, in wchar_ts
);
// Invalid UTF-8 detected? Return empty string:
if (!nCodeUnits)
return utf16;
// Allocate space for the UTF-16 code units:
utf16.resize(nCodeUnits);
// Convert from UTF-8 to UTF-16
int result = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of source UTF-8 string, in bytes
&utf16[0], // Pointer to destination buffer
nCodeUnits // Size of destination buffer, in code units
);
return utf16;
}
int main()
{
// Based on this article:
// https://blogs.msmvps.com/gdicanio/2017/08/22/printing-utf-8-text-to-the-windows-console/
_setmode(_fileno(stdout), _O_U16TEXT);
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
wcout << utf16FromUtf8(line) << endl;
return 0;
}
In my terminal, it mostly looks OK after I change the font to MS Gothic:
Some characters are still messed up, but that's due to the font not supporting them.
#include <iostream>
#include <fstream>
using namespace std;
struct example
{
int num1;
char abc[10];
}obj;
int main ()
{
ofstream myfile1 , myfile2;
myfile1.open ("example1.txt");
myfile2.open ("example2.txt");
myfile1 << obj.num1<<obj.abc; //instruction 1
myfile2.write((char*)&obj, sizeof(obj)); //instruction 2
myfile1.close();
myfile2.close();
return 0;
}
In this example will both the example files be identical with data or different? Are instruction 1 and instruction 2 same?
There's a massive difference.
Approach 1) writes the number using ASCII encoding, so there's an ASCII-encoded byte for each digit in the number. For example, the number 28 is encoded as one byte containing ASCII '2' (value 50 decimal, 32 hex) and another for '8' (56 / 0x38). If you look at the file in a program like less you'll be able to see the 2 and the 8 in there as human-readable text. Then << obj.abc writes the characters in abc up until (but excluding) the first NUL (0-value byte): if there's no NUL you run off the end of the buffer and have undefined behaviour: your program may or may not crash, it may print nothing or garbage, all bets are off. If your file is in text mode, it might translate any newline and/or carriage return characters in abc1 to some other standard representation of line breaks your operating system uses (e.g. it might automatically place a carriage return after every newline you write, or remove carriage returns that were in abc1).
Approach 2) writes the sizeof(obj) bytes in memory: that's a constant number of bytes regardless of their content. The number will be stored in binary, so a program like less won't show you the human-readable number from num1.
Depending on the way your CPU stores numbers in memory, you might have the bytes in the number stored in different orders in the file (something called endianness). There'll then always be 10 characters from abc1 even if there's a NUL in there somewhere. Writing out binary blocks like this is normally substantially faster than converting number to ASCII text and the computer having to worry about if/where there are NULs. Not that you normally have to care, but not all the bytes written necessarily contribute to the logical value of obj: some may be padding.
A more subtle difference is that for approach 1) there are ostensibly multiple object states that could produce the same output. Consider {123, "45"} and {12345, ""} -> either way you'd print "12345". So, you couldn't later open and read from the file and be sure to set num1 and abc to what they used to be. I say "ostensibly" above because you might happen to have some knowledge we don't, such as that the abc1 field will always start with a letter. Another problem is knowing where abc1 finishes, as its length can vary. If these issues are relevant to your actual use (e.g. abc1 could start with a digit), you could for example write << obj.num1 << ' ' << obj.abc1 << '\n' so the space and newline would tell you where the fields end (assuming abc1 won't contain newlines: if it could, consider another delimiter character or an escaping/quoting convention). With the space/newline delimiters, you can read the file back by changing the type of abc1 to std::string to protect against overruns by corrupt or tampered-with files, then using if (inputStream >> obj.num1 && getline(inputStream, obj.abc1)) ...process obj.... getline can cope with embedded spaces and will read until a newline.
Example: {258, "hello\0\0\0\0\0"} on a little-endian system where sizeof(int) is 32 and the stucture's padded out to 12 bytes would print (offsets and byte values shown in hexadecimal):
bytes in file at offset...
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
approach 1) 32 35 38 69 65 6c 6c 6f
'2' '5' '8' 'h' 'e' 'l' 'l' 'o'
approach 2) 00 00 01 02 69 65 6c 6c 6f 00 00 00 00 00 00 00
[-32 bit 258-] 'h' 'e' 'l' 'l' 'o''\0''\0''\0''\0''\0' pad pad
Notes: for approach 2, 00 00 01 02 encodes 100000010 binary which is 258 decimal. (Search for "binary encoding" to learn more about this).
I have a 320Mb binary file (data.dat), containing 32e7 lines of hex numbers:
1312cf60 d9 ff e0 ff 05 00 f0 ff 22 00 2f 00 fe ff 33 00 |........"./...3.|
1312cf70 00 00 00 00 f4 ff 1d 00 3d 00 6d 00 53 00 db ff |........=.m.S...|
1312cf80 b7 ff b0 ff 1e 00 0c 00 67 00 d1 ff be ff f8 ff |........g.......|
1312cf90 0b 00 6b 00 38 00 f3 ff cf ff cb ff e4 ff 4b 00 |..k.8.........K.|
....
Original numbers were:
(16,-144)
(-80,-64)
(-80,16)
(16,48)
(96,95)
(111,-32)
(64,-96)
(64,-16)
(31,-48)
(-96,-48)
(-32,79)
(16,48)
(-80,80)
(-48,128)
...
I have a matlab code which can read them as real numbers and convert them to complex numbers:
nsamps = (256*1024);
for i = 1:305
nstart = 1 + (i - 1) * nsamps ;
fid = fopen('data.dat');
fseek(fid,4 * nstart ,'bof');
y = fread(fid,[2,nsamps],'short');
fclose(fid);
x = complex(y(1,:),y(2,:));
I am using C++ and trying to get data as a vector<complex<float>>:
std::ifstream in('data.dat', std::ios_base::in | std::ios_base::binary);
fseek(infile1, 4*nstart, SEEK_SET);
vector<complex<float> > sx;
in.read(reinterpret_cast<char*>(&sx), sizeof(int));
and very confuse to get complex data using C++. Can anyone give me a help?
Theory
I'll try to explain some points using the issues in your code as examples.
Let's start from the end of the code. You try to read a number, which is stored as a four-byte single-precision floating point number, but you use sizeof(int) as a size argument. While on modern x86 platforms with modern compilers sizeof(int) tends to be equal to sizeof(float), it's not guaranteed. sizeof(int) is compiler dependent, so please use sizeof(float) instead.
In the matlab code you read 2*nsamps numbers, while in C++ code only four bytes (one number) is being read. Something like sizeof(float) * 2 * nsamps would be closer to matlab code.
Next, std::complex is a complicated class, which (in general) may have any implementation-defined internal representation. But luckily, here we read that
For any object z of type complex<T>, reinterpret_cast<T(&)[2]>(z)[0]
is the real part of z and reinterpret_cast<T(&)[2]>(z)[1] is the
imaginary part of z.
For any pointer to an element of an array of complex<T> named p and
any valid array index i, reinterpret_cast<T*>(p)[2*i] is the real part
of the complex number p[i], and reinterpret_cast<T*>(p)[2*i + 1] is
the imaginary part of the complex number p[i].
so we can just cast an std::complex to char type and read binary data there. But std::vector is a class template with it's implementation-defined internal representation as well! It means, that we can't just reinterpret_cast<char*>(&sx) and write binary data to the pointer, as it points to the beginning of the vector object, which is unlikely to be the beginning of the vector data. Modern C++ way to get the beginning of the data is to call sx.data(). Pre-C++11 way is to take an address of the first element: &sx[0]. Overwriting the object from the beginning will result in segfault almost always.
OK, now we have the beginning of the data buffer which is able to receive binary representation of complex numbers. But when you declared vector<complex<float> > sx;, it got zero size, and as you are not pushing or emplacing it's elements, the vector will not "know" that it should resize. Segfault again. So just call resize:
sx.resize(number_of_complex_numbers_to_store);
or use an appropriate constructor:
vector<complex<float> > sx(number_of_complex_numbers_to_store);
Before writing data to the vector. Note that these methods operate with "high-level" concept of number of stored elements, not number of bytes to store.
Putting it all together, the last two lines of your code should look like:
vector<complex<float> > sx(nsamps);
in.read(reinterpret_cast<char*>(sx.data()), 2 * nsamps * sizeof(float));
Minimal example
If you continue having troubles, try a simpler sandbox code first.
For example, let's write six floats to a binary file:
std::ofstream ofs("file.dat", std::ios::binary | std::ios::out);
float foo[] = {1,2,3,4,5,6};
ofs.write(reinterpret_cast<char*>(foo), 6*sizeof(float));
ofs.close();
then read them to a vector of complex:
std::ifstream ifs("file.dat", std::ios::binary | std::ios::in);
std::vector<std::complex<float>> v(3);
ifs.read(reinterpret_cast<char*>(v.data()), 6*sizeof(float));
ifs.close();
and, finally, print them:
std::cout << v[0] << " " << v[1] << " " << v[2] << std::endl;
The program prints:
(1,2) (3,4) (5,6)
so this approach works fine.
Binary files
Here is the remark about binary files which I initially posted as a comment.
Binary files haven't got the concept of "lines". The number of "lines" in binary file completely depends on the size of the window you are viewing it in. Think of binary files as of a magnetic tape, where each discrete position of the head is able to read only one byte. Interpretation of those bytes is up to you.
If everything should work fine, but you get weird numbers, check the displacement in fseek call. A mistake by a number of bytes yields random-looking values instead of the floats you wish to get.
Surely, you might just read a vector (or an array) of floats, observing the above considerations, and then convert them to complex numbers in a loop. Also, it's a good way to debug your fseek call to make sure that you start reading from the right place.
What I must do is open a file in binary mode that contains stored data that is intended to be interpreted as integers. I have seen other examples such as Stackoverflow-Reading “integer” size bytes from a char* array. but I want to try taking a different approach (I may just be stubborn, or stupid :/). I first created a simple binary file in a hex editor that reads as follows.
00 00 00 47 00 00 00 17 00 00 00 41
This (should) equal 71, 23, and 65 if the 12 bytes were divided into 3 integers.
After opening this file in binary mode and reading 4 bytes into an array of chars, how can I use bitwise operations to make char[0] bits be the first 8 bits of an int and so on until the bits of each char are part of the int.
My integer = 00 00 00 00
+ ^ ^ ^ ^
Chars Char[0] Char[1] Char[2] Char[3]
00 00 00 47
So my integer(hex) = 00 00 00 47 = numerical value of 71
Also, I don't know how the endianness of my system comes into play here, so is there anything that I need to keep in mind?
Here is a code snippet of what I have so far, I just don't know the next steps to take.
std::fstream myfile;
myfile.open("C:\\Users\\Jacob\\Desktop\\hextest.txt", std::ios::in | std::ios::out | std::ios::binary);
if(myfile.is_open() == false)
{
std::cout << "Error" << std::endl;
}
char* mychar;
std::cout << myfile.is_open() << std::endl;
mychar = new char[4];
myfile.read(mychar, 4);
I eventually plan on dealing with reading floats from a file and maybe a custom data type eventually, but first I just need to get more familiar with using bitwise operations.
Thanks.
You want the bitwise left shift operator:
typedef unsigned char u8; // in case char is signed by default on your platform
unsigned num = ((u8)chars[0] << 24) | ((u8)chars[1] << 16) | ((u8)chars[2] << 8) | (u8)chars[3];
What it does is shift the left argument a specified number of bits to the left, adding zeros from the right as stuffing. For example, 2 << 1 is 4, since 2 is 10 in binary and shifting one to the left gives 100, which is 4.
This can be more written in a more general loop form:
unsigned num = 0;
for (int i = 0; i != 4; ++i) {
num |= (u8)chars[i] << (24 - i * 8); // += could have also been used
}
The endianness of your system doesn't matter here; you know the endianness of the representation in the file, which is constant (and therefore portable), so when you read in the bytes you know what to do with them. The internal representation of the integer in your CPU/memory may be different from that of the file, but the logical bitwise manipulation of it in code is independent of your system's endianness; the least significant bits are always at the right, and the most at the left (in code). That's why shifting is cross-platform -- it operates at the logical bit level :-)
Have you thought of using Boost.Spirit to make a binary parser? You might hit a bit of a learning curve when you start, but if you want to expand your program later to read floats and structured types, you'll have an excellent base to start from.
Spirit is very well-documented and is part of Boost. Once you get around to understanding its ins and outs, it's really mind-boggling what you can do with it, so if you have a bit of time to play around with it, I'd really recommend taking a look.
Otherwise, if you want your binary to be "portable" - i.e. you want to be able to read it on a big-endian and a little-endian machine, you'll need some sort of byte-order mark (BOM). That would be the first thing you'd read, after which you can simply read your integers byte by byte. Simplest thing would probably be to read them into a union (if you know the size of the integer you're going to read), like this:
union U
{
unsigned char uc_[4];
unsigned long ui_;
};
read the data into the uc_ member, swap the bytes around if you need to change endianness and read the value from the ui_ member. There's no shifting etc. to be done - except for the swapping if you want to change endianness..
HTH
rlc