I'm trying to read a file that contain double formatted numbers in a matrix of 82503x1200. I'm reading the file but don't find the way to specify the correct size of the number that is being taken by lseek. Why is giving me that numbers instead of the file numbers?
float fd;
float ret;
float b;
const size_t NUM_ELEMS = 11;
const size_t NUM_BYTES = NUM_ELEMS * sizeof(float);
fd = open("signal_80k.txt",O_RDONLY);
if(fd < 0){
perror("open");
//exit(1);
}
ret = lseek(fd, seekCounter*NUM_BYTES, SEEK_SET);
ret = read(fd, &b, sizeof(float));
cout<<"> " << seekCounter << ": " << b<<endl;
seekCounter++;
close(fd);
it prints:
0: 1.02564e-08
1: 1.08604e-05
2: 0.000174702
3: 6.56482e-07
4: 2.57894e-09
but the first values are:
9.402433000000000e
8.459109000000000e
8.947654000000000e+03
9.021620000000000e
This is how it looks in matlab
In your comments you clarified that the file contains text data, and my answer is based on that. Now, let's take a look at the first number in the file:
1.02564e-08
How many characters are there? I count 11 characters. Then, there's a space after it, so the next value after this one will be twelve characters after the first one.
By casual inspection, it appears that your code sets
const size_t NUM_ELEMS = 11;
to be the number of values per row.
Then your code sets
const size_t NUM_BYTES = NUM_ELEMS * sizeof(float);
To calculate the number of characters taken up by each row. Now, it's possible that I missed the actual meaning of these constants, but in any case, you have a target value in the file, and you're attempting to seek to it directly, that's the bottom line. So, for the purpose of this answer I'll go with this interpretation, but the answer's still the same, in any case.
Pop quiz for you. What is sizeof(float)?
Answer: it's 4 bytes, on most implementations (so I'll assume that going forward). So, you compute that there's going to be 44 characters per row, and you use that to attempt to seek to the appropriate line in the file. That's, at least, how I parsed your code.
The problem, of course, is that, assuming that each value is represented in scientific notation, with 11 values per line, and each value taking up 12 characters (including either a trailing space or a newline), each line will actually take 11 * 12 or 132 characters, and not 44. Add one more character if you're using an implementation O/S that uses \r\n for a new line.
So, you need to make some adjustments there. And even after that, this whole house of cards depends on each value in the file always being represented in scientific notation, with the same number of precisions.
Which is an assumption you can't really make. Furthermore, that's not the only problem here.
The second problem is you are attempting to read() the contents of the file directly into float datatypes. Yes, each float datatype will be four characters, because that's how many bytes it takes to represent a float value in binary. The problem here is that the file does not contain raw binary data, but text data.
In conclusion, I don't see much choice here but to read the file from start to finish, instead of attempting to seek to the right spot, since you have no guarantees that each value in the file will occupy the same number of characters; and then read the file as text, and convert its contents, using operator>>, to float values.
If the file was binary, then lseek would be the suitable method?
I change the approach to this:
ifstream inFile("signal_80k.txt");
string line;
int count = 0 ;
if(!inFile.is_open())
{
cout<<"\n Cannot open the signal_80k.txt file"<<"\n";
}
else
{
cout<<"loading all data... "<<"\n";
while(getline( inFile , line) ){
vector< string > numbers = ci::split( line, " ", false );
for(int i = 0; i <numbers.size(); i++){
try{
float thisNumber = std::stof(numbers.at(i));
cout<<"numbers at: " << " = "<< thisNumber <<"\n";
}
catch (...){
}
}
count++;
cout<<"done: "<<count<<"\n";
}
cout<<"all data ready!"<<"\n";
inFile.close();
}
Related
I have an extremely large fixed-width CSV file (1.3 million rows and 80K columns). It's about 230 GB in size. I need to be able to fetch a subset of those rows. I have a vector of row indices that I need. However, I need to now figure out how to traverse such a massive file to get them.
The way I understand it, C++ will go through the file line by line, until it hits the newline (or a given delimiter), at which point, it'll clear the buffer, and then move onto the next line. I have also heard of a seek() function that can go to a given position in a stream. So is it possible to use this function somehow to get the pointer to the correct line number quickly?
I figured that since the program doesn't have to basically run billions of if statements to check for newlines, it might improve the speed if I simply tell the program where to go in the fixed-width file. But I have no idea how to do that.
Let's say that my file has a width of n characters and my line numbers are {l_1, l_2, l_3, ... l_m} (where l_1 < l_2 < l_3, ... < l_m). In that case, I can simply tell the file pointer to go to (l_1 - 1) * n, right? But then for the next line, do I calculate the next jump from the end of the l_1 line or from the beginning of the next line? And should I include the newlines when calculating the jumps?
Will this even help improve speed, or am I just misunderstanding something here?
Thanks for taking the time to help
EDIT: The file will look like this:
id0000001,AB,AB,AA,--,BB
id0000002,AA,--,AB,--,BB
id0000003,AA,AA,--,--,BB
id0000004,AB,AB,AA,AB,BB
As I proposed in the comment, you can compress your data field to two bits:
-- 00
AA 01
AB 10
BB 11
That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.
The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)
Here is how I build that binary file:
#include <fstream>
#include <iostream>
#include <string>
#include <chrono>
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
std::ifstream src("data.txt", std::ios::binary);
std::ofstream bin("data.bin", std::ios::binary);
size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
std::string str(length, '\0');
while (src.read(&str[0], length))
{
size_t pos = str.find(',') + 1;
for (int group = 0; group < 2500; ++group) {
uint64_t compressed(0), field(0);
for (int i = 0; i < 32; ++i, pos += 3) {
if (str[pos] == '-')
field = 0;
else if (str[pos] == 'B')
field = 3;
else if (str[pos + 1] == 'B')
field = 2;
else
field = 1;
compressed <<= 2;
compressed |= field;
}
bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
}
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
// clear `bad` bit set by trying to read past EOF
src.clear();
// rewind to the first record
src.seekg(0);
src.read(&str[0], length);
// read next (second) record
src.read(&str[0], length);
// read forty second record from start (skip 41)
src.seekg(41 * length, std::ios_base::beg);
src.read(&str[0], length);
// read next (forty third) record
src.read(&str[0], length);
// read fifties record (skip 6 from current position)
src.seekg(6 * length, std::ios_base::cur);
src.read(&str[0], length);
return 0;
}
This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?
UPDATE:
Added example of how to read individual records from src.
I only managed to make seekg() work in binary mode.
The seek family of functions in <iostream> classes are generally byte-oriented. You can use them, iff you're absolutely confident that your records(lines in this case) have fixed count of bytes; in that case, instead of getline, you can open the file as binary and use .read that can read the specified number of bytes into a byte array of enough capacity. But - because the file is storing text after all - in case even one single record has a different size, you`ll get out of alignment; if the id field is guaranteed to equal the line number - or at-least an increasing mapping of it -an educated guess and a follow-up trial & error can help. You need to switch to some better database management fast; even a 10GB single binary file is too large and prone to fast corruption. You may consider chopping it into much smaller slices(order of 100MB maybe) so as to minimize the chance for damage propagation. Plus you gotta need some redundancy mechanism for recovery/correction.
I want to store multiple arrays which all entries consist of either 0 or 1.
This file would be quite large if i do it the way i do it.
I made a minimalist version of what i currently do.
#include <iostream>
#include <fstream>
using namespace std;
int main(){
ofstream File;
File.open("test.csv");
int array[4]={1,0,0,1};
for(int i = 0; i < 4; ++i){
File << array[i] << endl;
}
File.close();
return 0;
}
So basically is there a way of storing this in a binary file or something, since my data is 0 or 1 in the first place anyways?
If yes, how to do this? Can i also still have line-breaks and maybe even commas in that file? If either of the latter does not work, that's also fine. Just more importantly, how to store this as a binary file which has only 0 and 1 so my file is smaller.
Thank you very much!
So basically is there a way of storing this in a binary file or something, since my data is 0 or 1 in the first place anyways? If yes, how to do this? Can i also still have line-breaks and maybe even commas in that file? If either of the latter does not work, that's also fine. Just more importantly, how to store this as a binary file which has only 0 and 1 so my file is smaller.
The obvious solution is to take 64 characters, say A-Z, a-z, 0-9, and + and /, and have each character code for six entries in your table. There is, in fact, a standard for this called Base64. In Base64, A encodes 0,0,0,0,0,0 while / encodes 1,1,1,1,1,1. Each combination of six zeroes or ones has a corresponding character.
This still leaves commas, spaces, and newlines free for your use as separators.
If you want to store the data as compactly as possible, I'd recommend storing it as binary data, where each bit in the binary file represents one boolean value. This will allow you to store 8 boolean values for each byte of disk space you use up.
If you want to store arrays whose lengths are not multiples of 8, it gets a little bit more complicated since you can't store a partial byte, but you can solve that problem by storing an extra byte of meta-data at the end of the file that specifies how many bits of the final data-byte are valid and how many are just padding.
Something like this:
#include <iostream>
#include <fstream>
#include <cstdint>
#include <vector>
using namespace std;
// Given an array of ints that are either 1 or 0, returns a packed-array
// of uint8_t's containing those bits as compactly as possible.
vector<uint8_t> packBits(const int * array, size_t arraySize)
{
const size_t vectorSize = ((arraySize+7)/8)+1; // round up, then +1 for the metadata byte
vector<uint8_t> packedBits;
packedBits.resize(vectorSize, 0);
// Store 8 boolean-bits into each byte of (packedBits)
for (size_t i=0; i<arraySize; i++)
{
if (array[i] != 0) packedBits[i/8] |= (1<<(i%8));
}
// The last byte in the array is special; it holds the number of
// valid bits that we stored to the byte just before it.
// That way if the number of bits we saved isn't an even multiple of 8,
// we can use this value later on to calculate exactly how many bits we should restore
packedBits[vectorSize-1] = arraySize%8;
return packedBits;
}
// Given a packed-bits vector (i.e. as previously returned by packBits()),
// returns the vector-of-integers that was passed to the packBits() call.
vector<int> unpackBits(const vector<uint8_t> & packedBits)
{
vector<int> ret;
if (packedBits.size() < 2) return ret;
const size_t validBitsInLastByte = packedBits[packedBits.size()-1]%8;
const size_t numValidBits = 8*(packedBits.size()-((validBitsInLastByte>0)?2:1)) + validBitsInLastByte;
ret.resize(numValidBits);
for (size_t i=0; i<numValidBits; i++)
{
ret[i] = (packedBits[i/8] & (1<<(i%8))) ? 1 : 0;
}
return ret;
}
// Returns the size of the specified file in bytes, or -1 on failure
static ssize_t getFileSize(ifstream & inFile)
{
if (inFile.is_open() == false) return -1;
const streampos origPos = inFile.tellg(); // record current seek-position
inFile.seekg(0, ios::end); // seek to the end of the file
const ssize_t fileSize = inFile.tellg(); // record current seek-position
inFile.seekg(origPos); // so we won't change the file's read-position as a side effect
return fileSize;
}
int main(){
// Example of packing an array-of-ints into packed-bits form and saving it
// to a binary file
{
const int array[]={0,0,1,1,1,1,1,0,1,0};
// Pack the int-array into packed-bits format
const vector<uint8_t> packedBits = packBits(array, sizeof(array)/sizeof(array[0]));
// Write the packed-bits to a binary file
ofstream outFile;
outFile.open("test.bin", ios::binary);
outFile.write(reinterpret_cast<const char *>(&packedBits[0]), packedBits.size());
outFile.close();
}
// Now we'll read the binary file back in, unpack the bits to a vector<int>,
// and print out the contents of the vector.
{
// open the file for reading
ifstream inFile;
inFile.open("test.bin", ios::binary);
const ssize_t fileSizeBytes = getFileSize(inFile);
if (fileSizeBytes < 0)
{
cerr << "Couldn't read test.bin, aborting" << endl;
return 10;
}
// Read in the packed-binary data
vector<uint8_t> packedBits;
packedBits.resize(fileSizeBytes);
inFile.read(reinterpret_cast<char *>(&packedBits[0]), fileSizeBytes);
// Expand the packed-binary data back out to one-int-per-boolean
vector<int> unpackedInts = unpackBits(packedBits);
// Print out the int-array's contents
cout << "Loaded-from-disk unpackedInts vector is " << unpackedInts.size() << " items long:" << endl;
for (size_t i=0; i<unpackedInts.size(); i++) cout << unpackedInts[i] << " ";
cout << endl;
}
return 0;
}
(You could probably make the file even more compact than that by running zip or gzip on the file after you write it out :) )
You can indeed write and read binary data. However having line breaks and commas would be difficult. Imagine you save your data as boolean data, so only ones and zeros. Then having a comma would mean you need an special character, but you have only ones and zeros!. The next best thing would be to make an object of two booleans, one meaning the usual data you need (c++ would then read the data in pairs of bits), and the other meaning whether you have a comma or not, but I doubt this is what you need. If you want to do something like a csv, then it would be easy to just fix the size of each column (int would be 4 bytes, a string of no more than 32 char for example), and then just read and write accordingly. Suppose you have your binary
To initially save your array of the an object say pets, then you would use
FILE *apFile;
apFile = fopen(FILENAME,"w+");
fwrite(ARRAY_OF_PETS, sizeof(Pet),SIZE_OF_ARRAY, apFile);
fclose(apFile);
To access your idx pet, you would use
Pet m;
ifstream input_file (FILENAME, ios::in|ios::binary|ios::ate);
input_file.seekg (sizeof(Pet) * idx, ios::beg);
input_file.read((char*) &m,sizeof(Pet));
input_file.close();
You can also add data add the end, change data in the middle and so on.
I want to read double values from a binary file and store them in a vector. My values have the following form: 73.6634, 73.3295, 72.6764 and so on. I have this code that read and store data in memory. It works perfectly with char types since the read function has as input a char type (istream& read (char* s, streamsize n)). When I try to convert char type to double I get obviously integer values as 74, 73, 73 and so on. Is there any function which allows me to read directly double values or any other way of doing that?
If I change char * memblock to double * memblock and memblock = new char[] to memblock = new double[] , I get errors when compiling because again read function can only have char type input variable...
Thanks, I will appreciate your help :)
// reading an entire binary file
#include <iostream>
#include <fstream>
using namespace std;
int main () {
streampos size;
char * memblock;
int i=0;
ifstream file ("example.bin", ios::in|ios::binary|ios::ate);
if (file.is_open())
{
size = file.tellg();
cout << "size=" << size << "\n";
memblock = new char [size];
file.seekg (0, ios::beg);
file.read (memblock, size);
file.close();
cout << "the entire file content is in memory \n";
for(i=0; i<=10; i++)
{
double value = memblock [i];
cout << "value ("<<i<<")=" << value << "\n";
}
};
delete[] memblock;
}
else cout << "Unable to open file";
return 0;
}
(sorry about the "Like I'm 5" tone, I have no idea how much you know or don't)
Intro Binary Data
As you probably know, your computer doesn't think about numbers the way you do.
To start, the computer thinks about all numbers in a "base 2" system. But it doesn't stop there. Your computer also associates a fixed size to all the numbers. It creates a fixed "width" of the numbers. This size is (almost always) in bytes, or groups of 4 digits. This is (pretty close to) the equivalent of, when you do math on the numbers [1,15,30002] you look at all the numbers as
[
00000001
00000015
00030002
]
(doubles are a little weirder, but I'll get to that in a second).
Lets pretend for demonstrative purposes that each 2 characters above represent a single byte of data. This means that, in the computer, it thinks about the numbers like this:
[
00,00,00,01
00,00,00,15
00,03,00,02
]
File IO is all done along a "byte"(char) size: it typically has no idea what it is reading. It is up to YOU to figure that out. When writing binary data to a file (from an array atleast) we just dump it all. So in the example above, if we write it all to the file like this:
[00,00,00,01,00,00,00,15,00,03,00,02]
But you'll have to reinterpret it, back into the type of 4 bytes.
Luckily, this is stupidly easy to do in c++:
size = file.tellg();
cout << "size=" << size << "\n";
memblock = new char [size];
file.seekg (0, ios::beg);
file.read (memblock, size);
file.close();
cout << "the entire file content is in memory \n";
double* double_values = (double*)memblock;//reinterpret as doubles
for(i=0; i<=10; i++)
{
double value = double_values[i];
cout << "value ("<<i<<")=" << value << "\n";
}
What this basically does is say, interpret those bytes (char) as double.
edit: Endianness
Endiannessis (again, LI5) the order of which the computer writes the number. You are used to fifteen being written left to right (25, twenty-five) but it would be just as valid to write the number from right to left (52, five-twenty). We have big-endian (Most Significan Byte at lowest address) and little-endian (MSB at highest address).
This was never standardized between architectures or virtual machines...but if they disagree you can get weird results.
A special case: doubles
Not really in line with your question, but I have to point out that doubles are a special case: while reading and writing looks the same, the underlying data isn't just a simple number. I like to think of doubles as the "scientific notation" of computers. The double standard uses a base and power to get your number. in the same amount of space as a long it stores (sign)(a^x). This gives a much larger dynamic range of representation of the values, BUT you loose a certain sense of "human readability" of the bytes, and you get the SAME number of values so you can loose precision (though its relative precision, just like scientific notation, so you may not be able to distinguish from a billion and 1 from a billion and 2, but that 1 and 2 are TINY compared to the number).
writing data in C++
We might as well point out one quirk of C++: you gotta make sure when you write the data, it doesn't try to reformat the file to ascii. http://www.cplusplus.com/forum/general/21018/
The issue is this -- there is no guarantee that binary data written by another program (you said Matlab) can be read back by another program by merely casting, unless you know that the data written by this secondary program is the same as data written by your program.
It may not be sufficient to just cast -- you need to know the exact form of the data that is written. You need to know the binary format (for example IEEE), the number of bytes each value occupies, endianess, etc. so that you can interpret the data correctly.
What you should do is this -- write a small program that writes out the number you claim this file has to another file. Then look at the file you just wrote in a hex editor. Then take the file you're attempting to read that was created by MatLab and compare the contents side-by-side with the one you just wrote. Do you see a pattern? If not, then either you have to find one, or forget about it and get the two files to be the same.
Does anyone know how to read in a file with raw encoding? So stumped.... I am trying to read in floats or doubles (I think). I have been stuck on this for a few weeks. Thank you!
File that I am trying to read from:
http://www.sci.utah.edu/~gk/DTI-data/gk2/gk2-rcc-mask.raw
Description of raw encoding:
hello://teem.sourceforge.net/nrrd/format.html#encoding (change hello to http to go to page)
- "raw" - The data appears on disk exactly the same as in memory, in terms of byte values and byte ordering. Produced by write() and fwrite(), suitable for read() or fread().
Info of file:
http://www.sci.utah.edu/~gk/DTI-data/gk2/gk2-rcc-mask.nhdr - I think the only things that matter here are the big endian (still trying to understand what that means from google) and raw encoding.
My current approach, uncertain if it's correct:
//Function ripped off from example of c++ ifstream::read reference page
void scantensor(string filename){
ifstream tdata(filename, ifstream::binary); // not sure if I should put ifstream::binary here
// other things I tried
// ifstream tdata(filename) ifstream tdata(filename, ios::in)
if(tdata){
tdata.seekg(0, tdata.end);
int length = tdata.tellg();
tdata.seekg(0, tdata.beg);
char* buffer = new char[length];
tdata.read(buffer, length);
tdata.close();
double* d;
d = (double*) buffer;
} else cerr << "failed" << endl;
}
/* P.S. I attempted to print the first 100 elements of the array.
Then I print 100 other elements at some arbitrary array indices (i.e. 9,900 - 10,000). I actually kept increasing the number of 0's until I ran out of bound at 100,000,000 (I don't think that's how it works lol but I was just playing around to see what happens)
Here's the part that makes me suspicious: so the ifstream different has different constructors like the ones I tried above.
the first 100 values are always the same.
if I use ifstream::binary, then I get some values for the 100 arbitrary printing
if I use the other two options, then I get -6.27744e+066 for all 100 of them
So for now I am going to assume that ifstream::binary is the correct one. The thing is, I am not sure if the file I provided is how binary files actually look like. I am also unsure if these are the actual numbers that I am supposed to read in or just casting gone wrong. I do realize that my casting from char* to double* can be unsafe, and I got that from one of the threads.
*/
I really appreciate it!
Edit 1: Right now the data being read in using the above method is apparently "incorrect" since in paraview the values are:
Dxx,Dxy,Dxz,Dyy,Dyz,Dzz
[0, 1], [-15.4006, 13.2248], [-5.32436, 5.39517], [-5.32915, 5.96026], [-17.87, 19.0954], [-6.02961, 5.24771], [-13.9861, 14.0524]
It's a 3 x 3 symmetric matrix, so 7 distinct values, 7 ranges of values.
The floats that I am currently parsing from the file right now are very large (i.e. -4.68855e-229, -1.32351e+120).
Perhaps somebody knows how to extract the floats from Paraview?
Since you want to work with doubles, I recommend to read the data from file as buffer of doubles:
const long machineMemory = 0x40000000; // 1 GB
FILE* file = fopen("c:\\data.bin", "rb");
if (file)
{
int size = machineMemory / sizeof(double);
if (size > 0)
{
double* data = new double[size];
int read(0);
while (read = fread(data, sizeof(double), size, file))
{
// Process data here (read = number of doubles)
}
delete [] data;
}
fclose(file);
}
What is an efficient, proper way of reading in a data file with mixed characters? For example, I have a data file that contains a mixture of data loaded from other files, 32-bit integers, characters and strings. Currently, I am using an fstream object, but it gets stopped once it hits an int32 or the end of a string. if i add random data onto the end of the string in the data file, it seems to follow through with the rest of the file. This leads me to believe that the null-termination added onto strings is messing it up. Here's an example of loading in the file:
void main()
{
fstream fin("C://mark.dat", ios::in|ios::binary|ios::ate);
char *mymemory = 0;
int size;
size = 0;
if (fin.is_open())
{
size = static_cast<int>(fin.tellg());
mymemory = new char[static_cast<int>(size+1)];
memset(mymemory, 0, static_cast<int>(size + 1));
fin.seekg(0, ios::beg);
fin.read(mymemory, size);
fin.close();
printf(mymemory);
std::string hithere;
hithere = cin.get();
}
}
Why might this code stop after reading in an integer or a string? How might one get around this? Is this the wrong approach when dealing with these types of files? Should I be using fstream at all?
Have you ever considered that the file reading is working perfectly and it is printf(mymemory) that is stopping at the first null?
Have a look with the debugger and see if I am right.
Also, if you want to print someone else's buffer, use puts(mymemory) or printf("%s", mymemory). Don't accept someone else's input for the format string, it could crash your program.
Try
for (int i = 0; i < size ; ++i)
{
// 0 - pad with 0s
// 2 - to two zeros max
// X - a Hex value with capital A-F (0A, 1B, etc)
printf("%02X ", (int)mymemory[i]);
if (i % 32 == 0)
printf("\n"); //New line every 32 bytes
}
as a way to dump your data file back out as hex.