fseek now supports large files

fseek now supports large files - c++

It appears that fseek now, at least in my implementation, supports large files naturally without fseek64, lseek or some strange compiler macro.
When did this happen?
#include <cstdio>
#include <cstdlib>
void writeF(const char*fname,size_t nItems){
FILE *fp=NULL;
if(NULL==(fp=fopen(fname,"w"))){
fprintf(stderr,"\t-> problems opening file:%s\n",fname);
exit(0);
}
for(size_t i=0;i<nItems;i++)
fwrite(&i,sizeof(size_t),1,fp);
fclose(fp);
}
void getIt(const char *fname,size_t offset,int whence,int nItems){
size_t ary[nItems];
FILE *fp = fopen(fname,"r");
fseek(fp,offset*sizeof(size_t),whence);
fread(ary,sizeof(size_t),nItems,fp);
for(int i=0;i<nItems;i++)
fprintf(stderr,"%lu\n",ary[i]);
fclose(fp);
}
int main(){
const char * fname = "temp.bin";
writeF(fname,1000000000);//writefile
getIt(fname,999999990,SEEK_SET,10);//get last 10 seek from start
getIt(fname,-10,SEEK_END,10);//get last 10 seek from start
return 0;
}
The code above writes a big file with the entries 1-10^9 in binary size_t format.
And then writes the last 10 entries, seeking from the beginning of the file, and seek from the end of file.

Linux x86-64 has had large file support (LFS) from pretty much day one; and doesn't require any special macros etc to enable it - both traditional fseek()) and LFS fseek64() already use a 64bit off_t.
Linux i386 (32bit) typically defaults to 32-bit off_t as otherwise it would break a huge number of applications - but you can test what is defined in your environment by checking the value of the _FILE_OFFSET_BITS macro.
See http://www.suse.de/~aj/linux_lfs.html for full details on Linux large file support.

The signature is
int fseek ( FILE * stream, long int offset, int origin );
so the range depends on the size of long.
On some systems it is 32-bit, and you have a problem with large files, on other systems it is 64-bit.

999999990 is a normal int and fits perfectly into 32 bits. I don't believe that you'd get away with this though:
getIt(fname,99999999990LL,SEEK_SET,10);

Related

Should std::fread succeed if the file is large enough?

I have found the following strange behaviour on Visual Studio 2015 when reading a file for a large array of bytes. The file that I load is about 80 MB and is large enough.
#include <cstdio>
#include <vector>
int main() {
std::FILE* file;
errno_t error = _wfopen_s(&file, L"/User/account/Desktop/file.data", L"r");
const std::size_t n = 16384;
std::vector<unsigned char> v(n);
const std::size_t nb_bytes_read = std::fread(v.data(), sizeof(unsigned char), n, file);
// At this point error = 0 and nb_bytes_read = 3473
}
So I ask std::fread for 16384 bytes and it just gives me 3473 even though the file is large enough. Should it be considered as a bug? The standard does not seem to say so but this behavior is very weird to me.

Try to open the file in binary mode "rb" which is propably what you want anyway. Otherwise, on the Windows platform, the byte \0x1A terminates input. Also, line breaks like \r\n will be converted to \n which may also result in less bytes read than specified.

According to this reference, fread() will only return fewer than the requested number of bytes if EOF was reached or an error occurred. You can check for those with feof() and ferror().

Incorrect size of file found using Visual Studio C++

I am porting over c++ code from linux to windows. I am currently using Visual Studio 2013 to port my code.
I need to read a binary file and am using this portion of c++ code:
// Open the stream
std::ifstream is("myfile.bin");
// Determine the file length
is.seekg(0, std::ios_base::end);
std::size_t size=is.tellg();
is.seekg(0, std::ios_base::begin);
// Create a vector to store the data
int* Data = new int[size/sizeof(int)];
// Load the data
is.read((char*) &Data[0], size);
// Close the file
is.close();
In linux, the size of my binary file is correctly found to be 744mb. However, in windows, the size of my binary file is incorrectly found to be >4GB. How can I correct this issue?

Change std::ifstream is("myfile.bin"); to std::ifstream is("myfile.bin", std::ios::binary);
With your current default open mode, the compiler choses "char" mode. In Linux chars in files are UTF8, first 128 positions are 1-byte char. But for memory UTF32, 4-bytes per char, is used. In Windows chars are "wide-chars", 2-bytes per char.

I finally had the time to actually run this myself, though I had to fix a couple of things, like ios_base::beg instead of begin (different function) Also, as mentioned, the array allocation should be this int* Data = new int[size / sizeof(int) + 1]; // At most one extra int
I found your problem: you're not in the right directory. Check if you successfully opened the file or not. If you don't, then you get a huge garbage value (probably -1, but unsigned, so massive) for size.
Try this to find your directory in Windows: (probably need Windows.h or something that I "just had" already)
char dirBuf[256];
GetCurrentDirectory(256, dirBuf);
cout << "Current directory is: " << dirBuf << endl;
See if that's where your file is and move it accordingly. Or specify the ENTIRE path in the constructor to ifstream.
Also, it has nothing to do with ios::binary or not. Works fine both ways, or fails if the file isn't there.

std::size_t size=is.tellg();
The standard doesn't require tellg to return the byte offset from the beginning of the file. In general, this may not be a reliable way to get the size of the file, though it probably does what you expect on Linux and Windows.
The return type of the tellg method is std::basic_stream::pos_type, so you're starting with an implicit conversion to std::size_t which may or may not be appropriate. In a 32-bit build, for example, it's conceivable that the size of a file could be larger than a std::size_t can represent.
But the root problem is that you're not checking for errors. If you have exceptions disabled, then tellg reports an error by returning pos_type(-1). When you cast that to an unsigned type (which std::size_t is), then you get a very large value. I suspect you failed to open the file, and since you didn't detect that error, the seekg and the tellg failed. You then coerced pos_type(-1) to a std::size_t, which made it look like the file was huge.
You also have the problems others have noted: failing to open the file in binary mode and computing the wrong size for the buffer when the file isn't a multiple of the size of an int.
The most reliable to get the file size is to use the OS's API. On Windows, you can do this instead:
// Open the file. [TODO: Get the file name in wide characters and use
// CreateFileW instead. If the file name contains characters not
// representable by the user's ANSI codepage, then CreateFileA will fail.]
HANDLE hfile = CreateFileA("myfile.bin", GENERIC_READ, FILE_SHARE_READ,
nullptr, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
nullptr);
if (hfile == INVALID_HANDLE_VALUE) { error handling here }
// Figure out how big it is.
LARGE_INTEGER li_size;
if (!GetFileSizeEx(hfile, &li_size)) { error handling here }
// TODO: On a 32-bit build, this won't be able to handle huge files,
// so check that here.
std::size_t size = li_size.QuadPart;
// Create a buffer to store the data, being careful to round up to a
// multiple of sizeof(int). [TODO: Use a std::vector instead.]
int* Data = new int[(size + sizeof(int) - 1) / sizeof(int)];
// Load the data.
const DWORD BytesToRead = static_cast<DWORD>(size);
DWORD BytesRead = 0;
if (!ReadFile(hfile, Data, &BytesRead, nullptr) || BytesRead < BytesToRead) {
error handling here
}
// Close the file
CloseHandle(hfile);

int* Data = new int[size/sizeof(int)];
Why are you doing this? You're dividing the size by 4. You don't want to do this. It should just be int* Data = new int[size]
Also, it should be std::ifstream f("filename.bin", std::ios::binary);

How to optimize c++ binary file reading?

I have a complex interpreter reading in commands from (sometimes) multiples files (the exact details are out of scope) but it requires iterating over these multiple files (some could be GB is size, preventing nice buffering) multiple times.
I am looking to increase the speed of reading in each command from a file.
I have used the RDTSC (program counter) register to micro benchmark the code enough to know about >80% of the time is spent reading in from the files.
Here is the thing: the program that generates the input file is literally faster than to read in the file in my small interpreter. i.e. instead of outputting the file i could (in theory) just link the generator of the data to the interpreter and skip the file but that shouldn't be faster, right?
What am I doing wrong? Or is writing suppose to be 2x to 3x (at least) faster than reading from a file?
I have considered mmap but some of the results on http://lemire.me/blog/archives/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ appear to indicate it is no faster than ifstream. or would mmap help in this case?
details:
I have (so far) tried adding a buffer, tweaking parameters, removing the ifstream buffer (that slowed it down by 6x in my test case), i am currently at a loss for ideas after searching around.
The important section of the code is below. It does the following:
if data is left in buffer, copy form buffer to memblock (where it is then used)
if data is not left in the buffer, check to see how much data is left in the file, if more than the size of the buffer, copy a buffer sized chunk
if less than the file
//if data in buffer
if(leftInBuffer[activefile] > 0)
{
//cout <<bufferloc[activefile] <<"\n";
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
else //buffers blank
{
//read in block
long blockleft = (cfilemax -cfileplace) / 16 ;
int read=0;
/* slow block starts here */
if(blockleft >= MAXBUFELEMENTS)
{
currentFile->read((char *)(&(buffer[activefile][0])),16*MAXBUFELEMENTS);
leftInBuffer[activefile] = 16*MAXBUFELEMENTS;
bufferloc[activefile]=0;
read =16*MAXBUFELEMENTS;
}
else //read in part of the block
{
currentFile->read((char *)(&(buffer[activefile][0])),16*(blockleft));
leftInBuffer[activefile] = 16*blockleft;
bufferloc[activefile]=0;
read =16*blockleft;
}
/* slow block ends here */
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
edit: this is on a mac, osx 10.9.5, with an i7 with a SSD
Solution:
as was suggested below, mmap was able to increase the speed by about 10x.
(for anyone else who searches for this)
specifically open with:
uint8_t * openMMap(string name, long & size)
{
int m_fd;
struct stat statbuf;
uint8_t * m_ptr_begin;
if ((m_fd = open(name.c_str(), O_RDONLY)) < 0)
{
perror("can't open file for reading");
}
if (fstat(m_fd, &statbuf) < 0)
{
perror("fstat in openMMap failed");
}
if ((m_ptr_begin = (uint8_t *)mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, m_fd, 0)) == MAP_FAILED)
{
perror("mmap in openMMap failed");
}
uint8_t * m_ptr = m_ptr_begin;
size = statbuf.st_size;
return m_ptr;
}
read by:
uint8_t * mmfile = openMMap("my_file", length);
uint32_t * memblockmm;
memblockmm = (uint32_t *)mmfile; //cast file to uint32 array
uint32_t data = memblockmm[0]; //take int
mmfile +=4; //increment by 4 as I read a 32 bit entry and each entry in mmfile is 8 bits.

This should be a comment, but I don't have 50 reputation to make a comment.
What is the value of MAXBUFELEMENTS? From my experience, many smaller reads is far slower than one read of larger size. I suggest to read the entire file in if possible, some files could be GBs, but even reading in 100MB at once would perform better than reading 1 MB 100 times.
If that's still not good enough, next thing you can try is to compress(zlib) input files(may have to break them into chunks due to size), and decompress them in memory. This method is usually faster than reading in uncompressed files.

As #Tony Jiang said, try experimenting with the buffer size to see if that helps.
Try mmap to see if that helps.
I assume that currentFile is a std::ifstream? There's going to be some overhead for using iostreams (for example, an istream will do its own buffering, adding an extra layer to what you're doing); although I wouldn't expect the overhead to be huge, you can test by using open(2) and read(2) directly.
You should be able to run your code through dtruss -e to verify how long the read system calls take. If those take the bulk of your time, then you're hitting OS and hardware limits, so you can address that by piping, mmap'ing, or adjusting your buffer size. If those take less time than you expect, then look for problems in your application logic (unnecessary work on each iteration, etc.).

zlib gzgets extremely slow?

I'm doing stuff related to parsing huge globs of textfiles, and was testing what input method to use.
There is not much of a difference using c++ std::ifstreams vs c FILE,
According to the documentation of zlib, it supports uncompressed files, and will read the file without decompression.
I'm seeing a difference from 12 seconds using non zlib to more than 4 minutes using zlib.h
This I've tested doing multiple runs, so its not a disk cache issue.
Am I using zlib in some wrong way?
thanks
#include <zlib.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>
#define LENS 1000000
size_t fg(const char *fname){
fprintf(stderr,"\t-> using fgets\n");
FILE *fp =fopen(fname,"r");
size_t nLines =0;
char *buffer = new char[LENS];
while(NULL!=fgets(buffer,LENS,fp))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
size_t is(const char *fname){
fprintf(stderr,"\t-> using ifstream\n");
std::ifstream is(fname,std::ios::in);
size_t nLines =0;
char *buffer = new char[LENS];
while(is. getline(buffer,LENS))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
size_t iz(const char *fname){
fprintf(stderr,"\t-> using zlib\n");
gzFile fp =gzopen(fname,"r");
size_t nLines =0;
char *buffer = new char[LENS];
while(0!=gzgets(fp,buffer,LENS))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
int main(int argc,char**argv){
if(atoi(argv[2])==0)
fg(argv[1]);
if(atoi(argv[2])==1)
is(argv[1]);
if(atoi(argv[2])==2)
iz(argv[1]);
}

I guess you are using zlib-1.2.3. In this version, gzgets() is virtually calling gzread() for each byte. Calling gzread() in this way has a big overhead. You can compare the CPU time of calling gzread(gzfp, buffer, 4096) once and of calling gzread(gzfp, buffer, 1) for 4096 times. The result is the same, but the CPU time is hugely different.
What you should do is to implement buffered I/O for zlib, reading ~4KB data in a chunk with one gzread() call (like what fread() does for read()). The latest zlib-1.2.5 is said to be significantly improved on gzread/gzgetc/.... You may try that as well. As it is released very recently, I have not tried personally.
EDIT:
I have tried zlib-1.2.5 just now. gzgetc and gzgets in 1.2.5 are much faster than those in 1.2.3.

Reading from and writing to the middle of a binary file in C/C++

If I have a large binary file (say it has 100,000,000 floats), is there a way in C (or C++) to open the file and read a specific float, without having to load the whole file into memory (i.e. how can I quickly find what the 62,821,214th float is)? A second question, is there a way to change that specific float in the file without having to rewrite the entire file?
I'm envisioning functions like:
float readFloatFromFile(const char* fileName, int idx) {
FILE* f = fopen(fileName,"rb");
// What goes here?
}
void writeFloatToFile(const char* fileName, int idx, float f) {
// How do I open the file? fopen can only append or start a new file, right?
// What goes here?
}

You know the size of a float is sizeof(float), so multiplication can get you to the correct position:
FILE *f = fopen(fileName, "rb");
fseek(f, idx * sizeof(float), SEEK_SET);
float result;
fread(&result, sizeof(float), 1, f);
Similarly, you can write to a specific position using this method.

fopen allows to open a file for modification (not just to append) by using either the rb+ or wb+ mode on fopen. See here: http://www.cplusplus.com/reference/clibrary/cstdio/fopen/
To position the file to a specific float, you can use the fseek by using index*sizeof(float) as the offset ad SEEK_SET as the orign. See here: http://www.cplusplus.com/reference/clibrary/cstdio/fseek/

Here is an example if you would like to use C++ streams:
#include <fstream>
using namespace std;
int main()
{
fstream file("floats.bin", ios::binary);
float number;
file.seekp(62821214*sizeof(float), ios::beg);
file.read(reinterpret_cast<char*>(&number), sizeof(float));
file.seekp(0, ios::beg); // move to the beginning of the file
number = 3.2;
// write number at the beginning of the file
file.write(reinterpret_cast<char*>(&number), sizeof(float));
}

One way would be to call mmap() on the file. Once you've done that, you can read/modify the file as if it was an in-memory array.
Of course that method only works if the file is small enough to fit in your process's address space... if you're running in 64-bit mode, you'll be fine; in 32-bit mode, a file with 100,000,000 floats should fit, but another order or two of magnitude above that and you might run into trouble.

I know this question has been answered already, but Linux/Unix provides easy system calls to read/write(pread/pwrite) in the middle of a file. If you look at the kernel source code for the system calls 'read' & 'pread', both eventually calls the vfs_read().And vfs_read requires a OFFSET, i.e it requires a POSITION to read from the file. In pread,this offset is given by us and in read() the offset is calculated internally in the kernel and maintained for the file descriptor. pread() offers exceptional performance compared to read() and using pread ,you can read/write in the same file descriptor simultaneously in multiple threads in different parts of the file. My Humble opionion, never use read() or other file streams, use pread(). Hope the filestream libraries have wrapped the read() calls, the streams perform well by making fewer system calls.
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int main()
{
char* buf; off_t offToStart = id * sizeof(float); size_t sizeToRead = sizeof(float);
int fd = open("fileName", O_RDONLY);
ret = pread(fd, buf, sizeToRead, offToStart);
//processs from the read 'buf'
close(fd);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

fseek now supports large files - c++

The signature is int fseek ( FILE * stream, long int offset, int origin ); so the range depends on the size of long. On some systems it is 32-bit, and you have a problem with large files, on other systems it is 64-bit.

999999990 is a normal int and fits perfectly into 32 bits. I don't believe that you'd get away with this though: getIt(fname,99999999990LL,SEEK_SET,10);

Related

Should std::fread succeed if the file is large enough?

Incorrect size of file found using Visual Studio C++

How to optimize c++ binary file reading?

zlib gzgets extremely slow?

Reading from and writing to the middle of a binary file in C/C++

Categories

Resources