C++ loading data from part of a file - c++

I want to keep a bunch of simple structures (just 3 ints per structure at the moment) in a file, and be able to read back just one of those structures at any given time.
As a first step, I'm trying to output them to a file, then read them back using boost::serialization. Currently I'm doing this, which crashes:
std::array<Patch, 3> outPatches;
outPatches[0].ZOrigin = 0;
outPatches[0].XOrigin = 0;
outPatches[0].Resolution = 64;
outPatches[1].ZOrigin = 1;
outPatches[1].XOrigin = 5;
outPatches[1].Resolution = 3;
outPatches[2].ZOrigin = 123;
outPatches[2].XOrigin = 546;
outPatches[2].Resolution = 6;
std::ofstream ofs("testing.sss", std::ios::binary);
for (auto const& patch : outPatches)
{
std::cout << "start archive: " << ofs.tellp() << std::endl;
{
boost::archive::binary_oarchive oa(ofs);
std::cout << "start patch: " << ofs.tellp() << std::endl;
oa << patch;
}
}
ofs.close();
std::array<Patch, 3> inPatches;
std::ifstream ifs("testing.sss", std::ios::binary);
for (auto& patch : inPatches)
{
std::cout << "start archive: " << ifs.tellg() << std::endl;
{
boost::archive::binary_iarchive ia(ifs); // <-- crash here on second patch
std::cout << "start patch: " << ifs.tellg() << std::endl;
ia >> patch;
}
}
ifs.close();
for (int i = 0; i != 3; ++i)
std::cout << "check: " << (inPatches[i] == outPatches[i]) << std::endl;
I was planning on using tell to make an index of where each structure is, and seek to skip to that structure on load. Is this a reasonable approach to take? I don't know much about streams beyond the basics.
I've tried putting all the patches in one o/iarchive instead, which works fine for reading everything sequentially. However, seeking on the stream didn't work.
I've found this, which might be what I want, but I have no idea what it's doing or how to use it, or whether it would work with boost::serialization: read part of a file with iostreams
I'd probably be willing to switch to another serialization method if necessary, since I've not got very far with this.
Edit 3: Moved edits 1 and 2 to an answer.

I once had a similar case (with boost / serialization). What I did back then (and it was quite efficient, if I remember) was to map the file into a virtual address, write a streamer that operates on memory buffers instead of files and for each part that I wanted to read assign appropriate offset to the streamer as buffer start / length and initialize the iarchive with the streamer so the serialization library treated it as if each object was in a separate file.
Of course, adding to the file required a re-map. Now that I look back at this, it seems a bit weird, but it was efficient, afair.

Boost serialization
It doesn't seem possible to skip around inside a boost serialization archive. The best I've got so far is to use multiple archives on one stream:
static const int numPatches = 5000;
std::vector<int> indices(numPatches, 0);
std::iota(indices.begin(), indices.end(), 0);
std::vector<Patch> outPatches(numPatches, Patch());
std::for_each(outPatches.begin(), outPatches.end(),
[] (Patch& p)
{
p.ZOrigin = rand();
p.XOrigin = rand();
p.Resolution = rand();
});
std::vector<int64_t> offsets(numPatches, 0);
std::ofstream ofs("testing.sss", std::ios::binary);
for (auto i : indices)
{
offsets[i] = ofs.tellp();
boost::archive::binary_oarchive oa(ofs,
boost::archive::no_header | boost::archive::no_tracking);
oa << outPatches[i];
}
ofs.close();
std::random_shuffle(indices.begin(), indices.end());
std::vector<Patch> inPatches(numPatches, Patch());
std::ifstream ifs("testing.sss", std::ios::binary);
for (auto i : indices)
{
ifs.seekg(offsets[i]);
boost::archive::binary_iarchive ia(ifs,
boost::archive::no_header | boost::archive::no_tracking);
ia >> inPatches[i];
ifs.clear();
}
std::cout << std::all_of(indices.begin(), indices.end(),
[&] (int i) { return inPatches[i] == outPatches[i]; }) << std::endl;
Unfortunately, this is very slow, so I don't think I can use it. Next up is testing protobuf.
google::protobuf
I've got something working with protobuf. It required a bit of fiddling around (apparently I have to use the LimitingInputStream type, and store the size of each object), but it's a lot faster than the boost::serialization version:
static const int numPatches = 500;
std::vector<int> indices(numPatches, 0);
std::iota(indices.begin(), indices.end(), 0);
std::vector<Patch> outPatches(numPatches, Patch());
std::for_each(outPatches.begin(), outPatches.end(),
[] (Patch& p)
{
p.ZOrigin = rand();
p.XOrigin = rand();
p.Resolution = 64;
});
std::vector<int64_t> streamOffset(numPatches, 0);
std::vector<int64_t> streamSize(numPatches, 0);
std::ofstream ofs("testing.sss", std::ios::binary);
PatchBuffer buffer;
for (auto i : indices)
{
buffer.Clear();
WriteToPatchBuffer(buffer, outPatches[i]);
streamOffset[i] = ofs.tellp();
streamSize[i] = buffer.ByteSize();
buffer.SerializeToOstream(&ofs);
}
ofs.close();
std::random_shuffle(indices.begin(), indices.end());
std::vector<Patch> inPatches(numPatches, Patch());
std::ifstream ifs("testing.sss", std::ios::binary);
for (auto i : indices)
{
ifs.seekg(streamOffset[i]);
buffer.Clear();
google::protobuf::io::IstreamInputStream iis(&ifs);
google::protobuf::io::LimitingInputStream lis(&iis, streamSize[i]);
buffer.ParseFromZeroCopyStream(&lis);
ReadFromPatchBuffer(inPatches[i], buffer);
ifs.clear();
}
std::cout << std::all_of(indices.begin(), indices.end(),
[&] (int i) { return inPatches[i] == outPatches[i]; }) << std::endl;

Related

Reading huge binary file (~1.5 GB) and writing the results to a text file C++

I am trying to read a huge binary file, chunk by chunk, decode each chunk and output it into text file for easy troubleshooting purposes. So far I have written a code that does that but it is extremely slow (takes hours to decode the whole file).
Here is my code:
template<class T> std::vector<T> readBytes(std::ifstream& input, int numOfBytes) {
std::vector<T> output;
output.reserve(numOfBytes);
T* buf = new T[numOfBytes];
input.read((char*)buf, sizeof(T) * numOfBytes);
for (int i = 0; i < numOfBytes; ++i) {
output.push_back(buf[i]);
}
delete[] buf;
return output;
}
std::ifstream file("lidar_Mission.dat", std::ios::binary | std::ios::ate);
std::streampos total_bytes(file.tellg());
file.seekg(12, std::ios::beg); //skip the header
while (file) {
if (file.good()) {
//Read the required chunk and store it in a vector
std::vector<std::int8_t> time(readBytes<std::int8_t>(file, 8));
std::vector<std::int8_t> lidarx(readBytes<std::int8_t>(file, 4));
std::vector<std::int8_t> lidary(readBytes<std::int8_t>(file, 4));
std::vector<std::int8_t> lidarz(readBytes<std::int8_t>(file, 4));
std::vector<std::int8_t> intensity(readBytes<std::int8_t>(file, 2));
std::vector<char> classification(readBytes<char>(file, 1));
std::vector<char> Return_scan(readBytes<char>(file, 1));
uint8_t timeArr[8] = { time[0], time[1],time[2],time[3],time[4],time[5],time[6],time[7] };
uint8_t lidarxArr[4] = { lidarx[0], lidarx[1],lidarx[2],lidarx[3] };
uint8_t lidaryArr[4] = { lidary[0], lidary[1],lidary[2],lidary[3] };
uint8_t lidarzArr[4] = { lidarz[0], lidarz[1],lidarz[2],lidarz[3] };
uint8_t intenArr[2] = { intensity[0], intensity[1] };
uint8_t clssArr[1] = { classification[0]};
uint8_t Retn_scnArr[1] = { Return_scan[0]};
//Type punning
double timestamp = *((double*)&timeArr);
float x = *((float*)lidarxArr);
float y = *((float*)lidaryArr);
float z = *((float*)lidarzArr);
uint16_t inten = *((uint16_t*)intenArr);
uint8_t clss = *((uint8_t*)clssArr);
uint8_t Retn_scn = *((uint8_t*)Retn_scnArr);
//Write to a text file
std::ofstream fout;
fout.open("test2", std::ios::out | std::ios::app);
fout << std::fixed << std::setprecision(9) << std::left << std::setw(19) << timestamp
<< std::setprecision(10) << std::setw(15) << x
<< std::setprecision(10) << std::setw(15) << y
<< std::setw(16) << z
<< std::setw(10) << inten
<< std::endl;
fout.close();
}else{
throw std::exception();
}
}
Any ideas on how to make this run faster? Thanks
Do as much outside the loop as possible, especially i/o. Open fout once before entering your loop, close it once after exiting it. Also close the file if there is any fatal error, with an indication why the operation failed.
You might also test moving your other declarations outside the loop as well, referring to predefined variables within it. If you aren't sure whether something like that might be optimized out by the compiler, it's an easy test to run.

Writing and reading a binary file to fill a vector - C++

I'm working on a project that involves binary files.
So I started researching about binary files but I'm still confused about how to write and fill a vector from that binary file that I wrote before
Here's code: for writing.
void binario(){
ofstream fout("./Binario/Data.AFe", ios::out | ios::binary);
vector<int> enteros;
enteros.push_back(1);
enteros.push_back(2);
enteros.push_back(3);
enteros.push_back(4);
enteros.push_back(5);
//fout.open()
//if (fout.is_open()) {
std::cout << "Entre al if" << '\n';
//while (!fout.eof()) {
std::cout << "Entre al while" << '\n';
std::cout << "Enteros size: "<< enteros.size() << '\n';
int size1 = enteros.size();
for (int i = 0; i < enteros.size(); i++) {
std::cout << "for " << i << '\n';
fout.write((char*)&size1, 4);
fout.write((char*)&enteros[i], size1 * sizeof(enteros));
//cout<< fout.get(entero[i])<<endl;
}
//fout.close();
//}
fout.close();
cout<<"copiado con exito"<<endl;
//}
}
Here's code for reading:
oid leerBinario(){
vector<int> list2;
ifstream is("./Binario/Data.AFe", ios::binary);
int size2;
is.read((char*)&size2, 4);
list2.resize(size2);
is.read((char*)&list2[0], size2 * sizeof(list2));
std::cout << "Size del vector: " << list2.size() <<endl;
for (int i = 0; i < list2.size(); i++) {
std::cout << i << ". " << list2[i] << '\n';
}
std::cout << "Antes de cerrar" << '\n';
is.close();
}
I don't know if I'm writing correctly to the file, this is just a test so I don't mess up my main file, instead of writing numbers I need to save Objects that are stored in a vector and load them everytime the user runs the program.
Nope, you're a bit confused. You're writing the size in every iteration, and then you're doing something completely undefined when you try to write the value. You can actually do this without the loop, when you are using a vector.
fout.write(&size1, sizeof(size1));
fout.write(enteros.data(), size1 * sizeof(int));
And reading in:
is.read(&list2[0], size2 * sizeof(int));
To be more portable you might want to use data types that won't change (for example when you switch from 32-bit compilation to 64-bit). In that case, use stuff from <cctype> -- e.g. int32_t for both the size and value data.

How to speed up counting the occurences of a word in large files?

I need to count the occurrences of the string "<page>" in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.
grep -F '<page>' enwiki-20141208-pages-meta-current.xml | uniq -c
However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?
#include <iostream>
#include <fstream>
#include <string>
int main()
{
// Open up file
std::ifstream in("enwiki-20141208-pages-meta-current.xml");
if (!in.is_open()) {
std::cout << "Could not open file." << std::endl;
return 0;
}
// Statistics counters
size_t chars = 0, pages = 0;
// Token to look for
const std::string token = "<page>";
size_t token_length = token.length();
// Read one char at a time
size_t matching = 0;
while (in.good()) {
// Read one char at a time
char current;
in.read(&current, 1);
if (in.eof())
break;
chars++;
// Continue matching the token
if (current == token[matching]) {
matching++;
// Reached full token
if (matching == token_length) {
pages++;
matching = 0;
// Print progress
if (pages % 1000 == 0) {
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Start over again
else {
matching = 0;
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
// Cleanup
in.close();
return 0;
}
Assuming there are no insanely large lines in the file using something like
for (std::string line; std::getline(in, line); } {
// find the number of "<page>" strings in line
}
is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:
Check if there is a tie()ed stream which needs flushing (there isn't, i.e., that's pointless).
Check if the stream is in good shape (except when having reached the end it is but this check can't be omitted entirely).
Call xsgetn() on the stream's stream buffer.
This function first checks if there is another character in the buffer (that's similar to the eof check but different; in any case, doing the eof check only after the buffer was empty removes a lot of the eof checks)
Transfer the character to the read buffer.
Have the stream check if it reached all (1) characters and set stream flags as needed.
There is a lot of waste in there!
I can't really imagine why grep would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline() and std::string() is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:
for (std::istreambuf_iterator<char> it(in), end;
(it = std::find(it, end, '<') != end; ) {
// match "<page>" at the start of of the sequence [it, end)
}
For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...) very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.
In any case, remember to enable optimizations!
I'm using this file to test with: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-current1.xml-p000000010p000010000.bz2
It takes roughly 2.4 seconds versus 11.5 using your code. The total character count is slightly different due to not counting newlines, but I assume that's acceptable since it's only used to display progress.
void parseByLine()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
size_t chars = 0;
size_t pages = 0;
const std::string token = "<page>";
std::string line;
while(std::getline(in, line))
{
chars += line.size();
size_t pos = 0;
for(;;)
{
pos = line.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
if(++pages % 1000 == 0)
{
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
}
Here's an example that adds each line to a buffer and then processes the buffer when it reaches a threshold. It takes 2 seconds versus ~2.4 from the first version. I played with several different thresholds for the buffer size and also processing after a fixed number (16, 32, 64, 4096) of lines and it all seems about the same as long as there is some batching going on. Thanks to Dietmar for the idea.
int processBuffer(const std::string& buffer)
{
static const std::string token = "<page>";
int pages = 0;
size_t pos = 0;
for(;;)
{
pos = buffer.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
++pages;
}
return pages;
}
void parseByMB()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
const size_t BUFFER_THRESHOLD = 16 * 1024 * 1024;
std::string buffer;
buffer.reserve(BUFFER_THRESHOLD);
size_t pages = 0;
size_t chars = 0;
size_t progressCount = 0;
std::string line;
while(std::getline(in, line))
{
buffer += line;
if(buffer.size() > BUFFER_THRESHOLD)
{
pages += processBuffer(buffer);
chars += buffer.size();
buffer.clear();
}
if((pages / 1000) > progressCount)
{
++progressCount;
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
if(!buffer.empty())
{
pages += processBuffer(buffer);
chars += buffer.size();
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}

how to get a value from a char[]

however,thanks everyone who help me.
I want to get the VmRSS value from /proc/pid/status,below is the code
int main()
{
const int PROCESS_MEMORY_FILE_LEN = 500;
FILE *file;
std::string path("/proc/4378/status");
//path += boost::lexical_cast<std::string>( pid );
//path += "/status";
if(!(file = fopen(path.c_str(),"r")))
{
std::cout <<"open " << path<<"is failed " << std::endl;
return float(-1);
}
char fileBuffer[PROCESS_MEMORY_FILE_LEN];
memset(fileBuffer, 0, PROCESS_MEMORY_FILE_LEN);
if(fread(fileBuffer, 1, PROCESS_MEMORY_FILE_LEN - 1, file) != (PROCESS_MEMORY_FILE_LEN - 1))
{
std::cout <<"fread /proc/pid/status is failed"<< std::endl;
return float(-1);
}
fclose(file);
unsigned long long memoryUsage = 0;
int a = sscanf(fileBuffer,"VmRSS: %llu", &memoryUsage);
std::cout << a << std::endl;
std::cout << memoryUsage << std::endl;
}
at last,thanks
Based on your comments: To find VmRSS within your char array use C++ algorithms in combination with the C++ string library. Then you'll get the position of VmRSS and all you'll have to do is to retrieve the wanted result. With a little knowledge of the structure of these entries, this should be an easy task.
In addition to that it might be better to use fstream for reading directly into a string.

Can someone provide an example of seeking, reading, and writing a >4GB file using boost iostreams

I have read that boost iostreams supposedly supports 64 bit access to large files semi-portable way. Their FAQ mentions 64 bit offset functions, but there is no examples on how to use them. Has anyone used this library for handling large files? A simple example of opening two files, seeking to their middles, and copying one to the other would be very helpful.
Thanks.
Short answer
Just include
#include <boost/iostreams/seek.hpp>
and use the seek function as in
boost::iostreams::seek(device, offset, whence);
where
device is a file, stream, streambuf or any object convertible to seekable;
offset is a 64-bit offset of type stream_offset;
whence is BOOST_IOS::beg, BOOST_IOS::cur or BOOST_IOS::end.
The return value of seek is of type std::streampos, and it can be converted to a stream_offset using the position_to_offset function.
Example
Here is an long, tedious and repetitive example, which shows how to open two files, seek to offstets >4GB, and copying data between them.
WARNING: This code will create very large files (several GB). Try this example on an OS/file system which supports sparse files. Linux is ok; I did not test it on other systems, such as Windows.
/*
* WARNING: This creates very large files (several GB)
* unless your OS/file system supports sparse files.
*/
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/positioning.hpp>
#include <cstring>
#include <iostream>
using boost::iostreams::file_sink;
using boost::iostreams::file_source;
using boost::iostreams::position_to_offset;
using boost::iostreams::seek;
using boost::iostreams::stream_offset;
static const stream_offset GB = 1000*1000*1000;
void setup()
{
file_sink out("file1", BOOST_IOS::binary);
const char *greetings[] = {"Hello", "Boost", "World"};
for (int i = 0; i < 3; i++) {
out.write(greetings[i], 5);
seek(out, 7*GB, BOOST_IOS::cur);
}
}
void copy_file1_to_file2()
{
file_source in("file1", BOOST_IOS::binary);
file_sink out("file2", BOOST_IOS::binary);
stream_offset off;
off = position_to_offset(seek(in, -5, BOOST_IOS::end));
std::cout << "in: seek " << off << std::endl;
for (int i = 0; i < 3; i++) {
char buf[6];
std::memset(buf, '\0', sizeof buf);
std::streamsize nr = in.read(buf, 5);
std::streamsize nw = out.write(buf, 5);
std::cout << "read: \"" << buf << "\"(" << nr << "), "
<< "written: (" << nw << ")" << std::endl;
off = position_to_offset(seek(in, -(7*GB + 10), BOOST_IOS::cur));
std::cout << "in: seek " << off << std::endl;
off = position_to_offset(seek(out, 7*GB, BOOST_IOS::cur));
std::cout << "out: seek " << off << std::endl;
}
}
int main()
{
setup();
copy_file1_to_file2();
}