I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream and seekg() to skip between datasets takes about 1300ms, which I would like to reduce.
So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.
Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.
I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.
You could use a memory mapped file interface (I recommend boost's implementation.)
This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.
You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes.
Before accessing the big data, a lookup in the smaller file gives you the needed index.
You may be able to do a buffer queuing process with multithreading. You could create a custom struct that would store various amounts of data.
You said:
Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
So as opening and closing the files over and over again is slow you could read the file all in one go and store it into a full buffer object and then parse it or store it into batches. This would also depend on if you are reading in text or binary mode on how easy it is to parse the file. I'll demonstrate the later with populating multiple batches while reading in a buffered size amount of data from file.
Pseudo Code
struct Batch {
std::string name; // Name of Dataset
unsigned size; // Size of Dataset
unsigned indexOffset; // Index to next read location
bool empty = true; // Flag to tell if this batch is full or empty
std::vector<DataType> dataset; // Container of Data
};
std::vector<Batch> finishedBatches;
// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.
void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {
// Set ifstream's buffer size
// OpenFile For Reading and read in and upto buffer size
// Spawn different thread to populate the Batches and while that batch is loading
// in data read in that much buffer data again. You will need to have a couple local
// stack batches to work with. So if a batch is complete and you reached the next index
// location from the file you can fill another batch.
// When a single batch is complete push it into the vector to store that batch.
// Change its flag and clear its vector and you can then use that empty batch again.
// Continue this until you reached end of file.
}
This would be a 2 threaded system here. Main thread for opening and reading from file and seeking from file with a worker thread filling the batches and pushing batches into container and swapping to use next batch.
Related
I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.
I have a binary file. I am reading a block of data from that file into an array of structs using fread method. My struct looks like below.
struct Num {
uint64_t key;
uint64_t val
};
My main goal is to write the array into a different text file with space separated key and value pairs in each line as shown below.
Key1 Val1
Key2 Val2
Key3 Val3
I have written a simple function to do this.
Num *buffer = new Num[buffer_size];
// Read a block of data from the binary file into the buffer array.
ofstream out_file(OUT_FILE, ios::out);
for(size_t i=0; i<buffer_size; i++)
out_file << buffer[i].key << ' ' << buffer[i].val << '\n';
The code works. But it's slow. One more approach would be to create the entire string first and write to file only once at the end.
But I want to know if there are any best ways to do this. I found some info about ostream_iterator. But I am not sure how it works.
The most efficient method to write structures to a file is to write as many as you can in the fewest transactions.
Usually this means using an array and writing entire array with one transaction.
The file is a stream device and is most efficient when data is continuously flowing in the stream. This can be as simple as writing the array in one call to more complicated using threads. You will save more time by performing block or burst I/O than worrying about which function call to use.
Also, in my own programs, I have observed that placing formatted text into a buffer (array) and then block writing the buffer is faster than using a function to write the formatted text to the file. There is a chance that the data stream may pause during the formatting. With writing formatted data from a buffer, the flow of data through the stream is continuous.
There are other factors involved in writing to a file, such as allocating space on the media, other tasks running on your system and any sharing of the file media.
By using the above techniques, I was able to write GBs of data in minutes instead of the previous duration of hours.
Say, I have a file of an arbitrary length S and I need to remove first of its N bytes (where N is much less than S.) What is the most efficient way to do it on Windows?
I'm looking for a WinAPI to do this, if one is available.
Otherwise, what are my options -- to load it into RAM and then re-write the existing file with the remainder of data? (In this case I cannot be sure that the PC has enough RAM?) Or write the remainder of file data into a new file, erase the old one, and rename the new file into the old one. (In this case what to do if any of these steps fail? Plus how about defragmentation that this method causes on disk?)
There is no general way to do this built into the OS. There are theoretical ways to edit the file system's data structures underneath the operating system on sector or cluster boundaries, but this is different for each file system, and would need to violate any security model.
To accomplish this you can read in the data starting at byte N in chunks of say 4k, and then write them back out starting at byte zero, and then use the file truncate command (setendoffile) to set the new smaller end of file when you are finished copying the data.
The most efficient method to delete data at the beginning of the file is to modify the directory entry, on the hard drive, that tells where the data starts. Again, this is the most efficient method.
Note: This may not be possible, if the data must start on a new boundary. If this is the case, you may have to write the remainder data on the sector(s) to new sector(s), essentially moving the data.
The preferred method is to write a new file that starts with data copied after the deleted area.
Moving files on same drive is faster than copying files since data isn't duplicated; only the file pointer, (symbolic)links & file allocation/index table is updated.
The move command in CMD could be modified to allow user to set file start & end markers, effecting file truncation without copying file data, saving valuable time & RAM/Disk overheads.
Alternative would be to send the commands direct to the device/disk driver bypassing the Operating System as long as OS knows where to find the file & file properties eg. file size, name & sectors occupied on disk.
I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set (vector<entry> where "entry" defines a structured line).
A structured line of this file may look like :
string int int int string string
which also ends with the appropriate platform EOL and is TAB delimited
What I wish to accomplish is :
Read file into memory (string) or vector<char>
Read raw data from my buffer and format it into my data set.
I need to consider memory footprint and have a fast parsing rate.
I'm already avoiding usage of stringstream as they seems too slow.
I'm also avoiding multiple I/O call to my file by using :
// open the stream
std::ifstream is(filename);
// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);
// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);
// load the data
is.read((char *) &out[0], size);
// close the file
is.close();
I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?
EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.
Any suggestions are very welcome.
Thank you
Some random thoughts:
Use vector::resize() at the beginning (you did that)
Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
Don't read the whole file at once, this might needlessly lead to swapping.
sizeof(char) is always 1 :)
while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.
i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.
so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.
update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.
finally
whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
immediately make up a test file the size you expect to be the worst case in size and time two different approaches
.
time
loop
read line from disk
time
loop
process line (counting words per line)
time
loop
write data (word count) from line to disk
time
versus.
time
loop
read line from disk
process line (counting words per line)
write data (word count) from line to disk
time
if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.
I'm making simple dll packet sniffer using C++, that will hook to the apps, and write the received packet into INI file. Unfortunately after 20-30 minutes it crashed the main apps.
When the packet is received, receivedPacket() will be called. After 20+ minutes, WriteCount value is around 150,000-200,000.. and starting to get C++ runtime error/crash, GetLastError() code that I get is 0x8, which is ERROR_NOT_ENOUGH_MEMORY, and the WritePrivateProfileStringA() returns 0
void writeToINI(LPCSTR iSec,LPCSTR iKey,int iVal){
sprintf(inival, _T("%d"), iVal);
WritePrivateProfileStringA(iSec,iKey,inival,iniloc);
//sprintf(strc, _T("%d \n"), WriteCount);
//WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), strc, strlen(strc), 0, 0);
WriteCount++;
}
void receivedPacket(char *packet,WORD size){
switch ( packet[2] )
{
case 0x30:
// Size : 0x5F
ID = *(signed char*)&packet[0x10];
X = *(signed short*)&packet[0x20];
Y = *(signed short*)&packet[0x22];
Z = *(signed short*)&packet[0x24];
sprintf(inisec, _T("PACKET_%d"), (ID+1));
writeToINI(inisec,"id",ID);
writeToINI(inisec,"x",X);
writeToINI(inisec,"y",Y);
writeToINI(inisec,"z",Z);
}
[.....OTHER CASES.....]
}
Thanks :)
WritePrivateProfileString() and GetPrivateProfileString() are very slow (due to parsing INI file each call), instead you can:
use one of existing parsing libraries, but i am not sure about memory efficiency nor supporting sequential write.
write your own sequential INI writter:
read file (or only part, by part, if it is too big)
find section and key (if not found, create new section at end of file, or find insertion position, if you want sorted sections), save file position of key and next key
change value
save (beginning of original file to position of key + actual changed key + position of next key in original file to end of file) (if new section is created at end, you can simply append new section to original file) (if packets rewrite same ID often, you can add padding whitespace after each key, large to hold any value of desired type (example: change X=1---\n to X=100-\n (change - to whitespace), so you have constant size of key, you can update only part of file) )
database, for example MySQL
write binary file (fastest solution) and make program to read values, or to convert to text
Little note: I use GetPrivateProfileString() few years ago to read settings file (about 1KB of size), reading form HDD: 50ms, reading from USB flash disk: 1000ms!, after changing (1. read file to memory 2. run my own parser) it run in 1ms both on HDD and USB.
Thanks for the reply guys, but looks like the problem wasn't come from WritePrivateProfileStringA().
I just need to add extra size in malloc() for the Hook.
:)