Portable way to get file size in C/C++ - c++

I need to determin the byte size of a file.
The coding language is C++ and the code should work with Linux, windows and any other operating system. This implies using standard C or C++ functions/classes.
This trivial need has apparently no trivial solution.

Using std's stream you can use:
std::ifstream ifile(....);
ifile.seekg(0, std::ios_base::end);//seek to end
//now get current position as length of file
ifile.tellg();
If you deal with write only file (std::ofstream), then methods are some another:
ofile.seekp(0, std::ios_base::end);
ofile.tellp();

You can use stat system call:
#ifdef WIN32
_stat64()
#else
stat64()

If you only need the file size this is certainly overkill but in general I would go with Boost.Filesystem for platform-independent file operations.
Amongst other attribute functions it contains
template <class Path> uintmax_t file_size(const Path& p);
You can find the reference here. Although Boost Libraries may seem huge I found it to often implement things very efficiently. You could also only extract the function you need but this might proof difficult as Boost is rather complex.

std::intmax_t file_size(std::string_view const& fn)
{
std::filebuf fb;
return fb.open(fn.data(), std::ios::binary | std::ios::in) ?
std::intmax_t(fb.pubseekoff({}, std::ios::end, std::ios::in)) :
std::intmax_t(-1);
}
We sacrifice 1 bit for the error indicator and standard disclaimers apply when running on 32-bit systems. Use std::filesystem::file_size(), if possible, as std::filebuf may dynamically allocate buffers for file io. This would make all the iostream-based methods wasteful and slow. Files were/are meant to be streamed, though much more so in the past than today, which relegates file sizes to secondary importance.
Working example.

Simples:
std::ifstream ifs;
ifs.open("mybigfile.txt", std::ios::bin);
ifs.seekg(0, std::ios::end);
std::fpos pos = ifs.tellg();

Portability requires you to use the least common denominators, which would be C. (not c++)
The method that I use is the following.
#include <stdio.h>
long filesize(const char *filename)
{
FILE *f = fopen(filename,"rb"); /* open the file in read only */
long size = 0;
if (fseek(f,0,SEEK_END)==0) /* seek was successful */
size = ftell(f);
fclose(f);
return size;
}

The prize for absolute inefficiency would go to:
auto file_size(std::string_view const& fn)
{
std::ifstream ifs(fn.data(), std::ios::binary);
return std::distance(std::istream_iterator<char>(ifs), {});
}
Example.

Often we want to get things done in the most portable manner, but in certain situations, especially like this, I would strongly recommend using system API's for best performance.

Related

Read an image or pdf using C++ without external library

I was just thinking after reading about Java & C#, whether C++ can also read image & pdf files without the use of external libraries ? C++ doesn't have the byte type like Java & C#. Then how can we accomplish the task ( again without using an external library) ?
Can anyone give a small demonstration (ie a program or code to read or copy or write image or pdf files) ?
You can use unsigned char or char reinterpreted as some integer type to parse binary file formats like pdf, jpeg etc. You can create a buffer as std::vector<char> and read it as following:
std::vector<char> buffer((
std::istreambuf_iterator<char>(infile)), // Ensure infile was opened with binary attribute
(std::istreambuf_iterator<char>()));
Related questions: Reading and writing binary file
There is no difference what file you are reading opened in binary mode, there is only difference is how you should interpret the data you get from the file.
It's significantly better to take ready to use library like e.g. libjpeg or whatever. There are plenty of them. But If you really want to do this, at first you should define suitable structures and constants (see links below) to make code to be convinient and useable. Then you just read the data and try to interpret it step by step. The code below is just pseudo code, I didn't compile it.
#include <fstream>
// define header structure
struct jpeg_header
{
enum class marker: unsigned short { eoi = 0xffd8, sof0 = 0xffc0 ... };
...
};
bool is_eoi(unsigned short m) { return jpeg_header::eoi == m; }
jpeg_header read_jpeg_header(const std::string& fn)
{
std::ifstream inf(fn, std::ifstream::binary);
if (!inf)
{
throw std::runtime_error("Can't open file: " + fn);
}
inf.exceptions(std::ifstream::failbit | std::ifstream::eofbit);
unsigned short marker = inf.get() << 8;
marker |= inf.get();
if (!is_eoi(marker))
{
throw std::runtime_error("Invalid jpeg header");
}
...
jpeg_header header;
// read further and fill header structure
...
return header;
}
To read huge block of data use ifstream::read(), ifstream::readsome() methods. Here is the good example http://en.cppreference.com/w/cpp/io/basic_istream/read.
Those functions also work faster then stream iterators. It's also better define your own exception classes derived from std::runtime_error.
For details on file formats you interested in look here
Structure of a PDF file?
https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format
https://en.wikipedia.org/wiki/JPEG
It would be a strange world to have a system language like C and in this case C++ without a type byte :).
Yeah, I take it, it has strange name, unsigned char, but it is still there:).
Really just think about the magnitude of re-development of all things to avoid byte:). Peripherals, many registers in CPU's and other chips, communication, data protocols. It would all have to be redone:).

Streaming into vector?

Have stumbled upon this code to insert the contents of a file into a vector. Seems like a useful thing to learn how to do:
#include <iostream>
#include <fstream>
#include <vector>
int main() {
typedef std::vector<char> fileContainer;
std::ifstream testFile("testfile.txt");
fileContainer container;
container.assign(
(std::istreambuf_iterator<char>(testFile)),
std::istreambuf_iterator<char>());
return 0;
}
It works but I'd like to ask is this the best way to do such a thing? That is, to take the contents any file type and insert it into an appropriate STL container. Is there a more efficient way of doing this than above? As i understand, it creates a testFile instance of ifstream and fills it with the contents of testfile.txt, then that copy is again copied into the container through assign. Seems like a lot of copying?
As for speed/efficiency, I'm not sure how to estimate the file size and use the reserve function with that, if i use reserve it appears to slow this code down even. At the moment swapping out vector and just using a deque is quite a bit more efficient it seems.
I'm not sure that there's a best way, but using the two iterator
constructor would be more idiomatic:
FileContainer container( (std::istreambuf_iterator<char>( testFile )),
(std::istreambuf_iterator<char>()) );
(I notice that you have the extra parentheses in your assign. They
aren't necessary there, but they are when you use the constructor.)
With regards to performance, it would be more efficient to pre-allocate
the data, something like:
FileContainer container( actualSizeOfFile );
std::copy( std::istreambuf_iterator<char>( testFile ),
std::istreambuf_iterator<char>(),
container.begin() );
This is slightly dangerous; if your estimation is too small, you'll
encounter undefined behavior. To avoid this, you could also do:
FileContainer container;
container.reserve( estimatedSizeOfFile );
container.insert( container.begin(),
std::istreambuf_iterator<char>( testFile ),
std::istreambuf_iterator<char>() );
Which of these two is faster will depend on the implementation; the last
time I measured (with g++), the first was slightly faster, but if you're
actually reading from file, the difference probably isn't measurable.
The problem with these two methods is that, despite other answers, there
is no portable way of finding the file size other than by actually
reading the file. Non-portable methods exist for some systems (fstat
under Unix), but on other systems, like Windows, there is no means
of finding the exact number of char you can read from a text file.
And of course, there's no guarantee that the results of tellg() will
even convert to an integral type, and that if it does, that they won't
be a magic cookie, with no numerical signification.
Having said that, in practice, the use of tellg() suggested by other
posters will often be "portable enough" (Windows and most Unix, at
least), and the results will often be "close enough"; they'll usually be
a little too high under Windows (since the results will count the
carriage return characters which won't be read), but in a lot of cases,
that's not a big problem. In the end, it's up to you to decide what
your requirements are with regards to portability and precision of the
size.
it creates a testFile instance of ifstream and fills it with the contents of testfile.txt
No, it opens testfile.txt and calls the handle testFile. There is one copy being made, from disk to memory. (Except that I/O is commonly done by another copy through kernel space, but you're not going to avoid that in a portable way.)
As for speed/efficiency, i'm not sure how to estimate the file size and use the reserve function with that
If the file is a regular file:
std::ifstream testFile("testfile.txt");
testFile.seekg(0, std::ios::end);
std::ios::streampos size = testFile.tellg();
testFile.seekg(0, std::ios::beg);
std::vector<char> container;
container.reserve(size);
Then fill container as before. Or construct it as std::vector<char> container(size) and fill it with
testFile.read(&container.front, size);
Which one is faster should be determined by profiling.
The std::ifstream is not fulled with the contents of the file, the contents are read on demand. Some kind of buffering is involved, so the file would be read in chunks of k-bytes. Since stream iterators are InputIterators, it should be more efficient to call reserve on the vector first; but only if you already have that information or can guess a good approximate, otherwise you would have to iterate through the file contents twice.
People much more frequently want to read from a file into a string than a vector. If you can use that, you might want to see the answer I posted to a previous question.
A minor edit of the fourth test there will give this:
std::vector<char> s4;
file.seekg(0, std::ios::end);
s4.resize(file.tellg());
file.seekg(0, std::ios::beg);
file.read(&s4[0], s4.size());
My guess is that this should give performance essentially indistinguishable from the code using a string. Depending on your compiler/standard library, this is likely to be substantially faster than your current code (again, see the timing results there for some idea of the difference you're likely to see).
Also note that this gives a little extra ability to detect and diagnose errors. For example, you can check whether you successfully read the entire file by comparing s4.size() to file.gcount() (and/or check for file.eof()). This also makes it a bit easier to prevent problems by limiting the amount you read, in case somebody decides to see what happens when/if they try to use your program to read a file that's, say, 6 terabytes.
There is definitely a better way if you want to make it efficient. You can check the file size, pre-allocate vector and read directly into vector's memory. A simple example:
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <cstdio>
#include <cstdlib>
#include <vector>
#include <iostream>
using namespace std;
int main ()
{
int fd = open ("test.data", O_RDONLY);
if (fd == -1)
{
perror ("open");
return EXIT_FAILURE;
}
struct stat info;
int res = fstat (fd, &info);
if (res != 0)
{
perror ("fstat");
return EXIT_FAILURE;
}
std::vector<char> data;
if (info.st_size > 0)
{
data.resize (info.st_size);
ssize_t x = read (fd, &data[0], data.size ());
if (x != info.st_size)
{
perror ("read");
return EXIT_FAILURE;
}
cout << "Data (" << info.st_size << "):\n";
cout.write (&data[0], data.size ());
}
}
There are other more efficient ways for some tasks. For example, to copy file without transferring data to and from user space, you can use sendfile etc.
It does work, and it is convenient, but there are many situations where it is a bad idea.
Error handling in a user-edited file, for example. If the user has hand edited a data file or it has been imported from a spreadsheet or even a database with lax field definitions, then this method of filling the vector will result in a simple error with no detail.
In order to process the file and report where the error happened, you need to read it line by line and attempt the conversion to a number on each line. Then you can report the line number and the text that failed to convert. This is extremely useful. Without this feature the user is left to wonder which line caused the problem instead of being able to immediately fix it.

c++ Fastest way to write/read array of shorts to/from file

What is the fastest way to write an array of unsigned short values into a file, and then what would be the fastest way to read these in another application?
Also this would be on an apple machine running snow leopard.
If you don't worry about cross-platform compatibility, just write them all out in binary:
int save_shorts(const unsigned short* array, size_t num_shorts)
{
int ok = 0;
FILE* out = fopen("numbers.bin", "wb");
if (out != NULL)
{
ok = fwrite(array, num_shorts * sizeof *array, 1, out) == 1;
fclose(out);
}
return ok;
}
Reading them back in is very similiar but with fread(), of course. You could probably use C++ (binary) streams too, but this is simple enough.
There are about a gazillion approaches to this, starting from simple file I/O to compressing the data in memory under whatever format (zlib?) and then writing it to the file. It's up to you.
You can't go faster than mapping the file into memory and writing to that memory. This avoids stdio buffering and copying data between user-space and kernel.

How to speed-up loading of 15M integers from file stream?

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.
std::array<int, 15000000> keys;
std::string config = "config.dat";
// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
std::ostream_iterator<int>(out, "\n"));
// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in_ranks.close();
Thanks in advance.
SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.
Thanks all for your insights.
You have two issues regarding the speed of your write and read operations.
First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.
Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.
The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.
There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.
To write:
std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
And to read:
std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
----Update----
If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.
Something like this perhaps:
std::array<int, 15000000> keys;
// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
std::ifstream in("data.txt");
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in.close();
std::fstream out("data.bin", ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
} else {
std::fstream in("data.bin", ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
}
If you have an install process this preprocessing could also be done at that time...
Attention. Reality check ahead:
Reading integers from a large text file is an IO bound operation unless you're doing something completely wrong (like using C++ streams for this). Loading 15M integers from a text file takes less than 2 seconds on an AMD64#3GHZ when the file is already buffered (and only a bit long if had to be fetched from a sufficiently fast disk). Here's a quick & dirty routine to prove my point (that's why I do not check for all possible errors in the format of the integers, nor close my files at the end, because I exit() anyway).
$ wc nums.txt
15000000 15000000 156979060 nums.txt
$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970
$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657
real 0m1.781s
user 0m1.651s
sys 0m0.114s
$ cat read.cc
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>
int main()
{
char c;
int num=0;
int pos=1;
int line=1;
std::vector<int> res;
while(c=getchar(),c!=EOF)
{
if (c>='0' && c<='9')
num=num*10+c-'0';
else if (c=='-')
pos=0;
else if (c=='\n')
{
res.push_back(pos?num:-num);
num=0;
pos=1;
line++;
}
else
{
printf("I've got a problem with this file at line %d\n",line);
exit(1);
}
}
// make sure the optimizer does not throw vector away, also a check.
unsigned sum=0;
for (int i=0;i<res.size();i++)
{
sum=sum+(unsigned)res[i];
}
printf("=>%d\n",sum);
}
UPDATE: and here's my result when read the text file (not binary) using mmap:
$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657
real 0m0.559s
user 0m0.478s
sys 0m0.081s
code's on pastebin:
http://pastebin.com/NgqFa11k
What do I suggest
1-2 seconds is a realistic lower bound for a typical desktop machine for load this data. 2 minutes sounds more like a 60 Mhz micro controller reading from a cheap SD card. So either you have an undetected/unmentioned hardware condition or your implementation of C++ stream is somehow broken or unusable. I suggest to establish a lower bound for this task on your your machine by running my sample code.
if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *
You could precompile the array into a .o file, which wouldn't need to be recompiled unless the data changes.
thedata.hpp:
static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];
thedata.cpp:
#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};
To compile this:
# make thedata.o
Then your main application would look something like:
#include "thedata.hpp"
using namespace std;
int main() {
for (int i=0; i<NUM_ENTRIES; i++) {
cout << thedata[i] << endl;
}
}
Assuming the data doesn't change often, and that you can process the data to create thedata.cpp, then this is effectively instant loadtime. I don't know if the compiler would choke on such a large literal array though!
Save the file in a binary format.
Write the file by taking a pointer to the start of your int array and convert it to a char pointer. Then write the 15000000*sizeof(int) chars to the file.
And when you read the file, do the same in reverse: read the file as a sequence of chars, take a pointer to the beginning of the sequence, and convert it to an int*.
of course, this assumes that endianness isn't an issue.
For actually reading and writing the file, memory mapping is probably the most sensible approach.
If the numbers never change, preprocess the file into a C++ source and compile it into the application.
If the number can change and thus you have to keep them in separate file that you have to load on startup then avoid doing that number by number using C++ IO streams. C++ IO streams are nice abstraction but there is too much of it for such simple task as loading a bunch of number fast. In my experience, huge part of the run time is spent in parsing the numbers and another in accessing the file char by char.
(Assuming your file is more than single long line.) Read the file line by line using std::getline(), parse numbers out of each line using not streams but std::strtol(). This avoids huge part of the overhead. You can get more speed out of the streams by crafting your own variant of std::getline(), such that reads the input ahead (using istream::read()); standard std::getline() also reads input char by char.
Use a buffer of 1000 (or even 15M, you can modify this size as you please) integers, not integer after integer. Not using a buffer is clearly the problem in my opinion.
If the data in the file is binary and you don't have to worry about endianess, and you're on a system that supports it, use the mmap system call. See this article on IBM's website:
High-performance network programming, Part 2: Speed up processing at both the client and server
Also see this SO post:
When should I use mmap for file access?

Fastest way to write large STL vector to file using STL

I have a large vector (10^9 elements) of chars, and I was wondering what is the fastest way to write such vector to a file. So far I've been using next code:
vector<char> vs;
// ... Fill vector with data
ofstream outfile("nanocube.txt", ios::out | ios::binary);
ostream_iterator<char> oi(outfile, '\0');
copy(vs.begin(), vs.end(), oi);
For this code it takes approximately two minutes to write all data to file. The actual question is: "Can I make it faster using STL and how"?
With such a large amount of data to be written (~1GB), you should write to the output stream directly, rather than using an output iterator. Since the data in a vector is stored contiguously, this will work and should be much faster.
ofstream outfile("nanocube.txt", ios::out | ios::binary);
outfile.write(&vs[0], vs.size());
There is a slight conceptual error with your second argument to ostream_iterator's constructor. It should be NULL pointer, if you don't want a delimiter (although, luckily for you, this will be treated as such implicitly), or the second argument should be omitted.
However, this means that after writing each character, the code needs to check for the pointer designating the delimiter (which might be somewhat inefficient).
I think, if you want to go with iterators, perhaps you could try ostreambuf_iterator.
Other options might include using the write() method (if it can handle output this large, or perhaps output it in chunks), and perhaps OS-specific output functions.
Since your data is contiguous in memory (as Charles said), you can use low level I/O. On Unix or Linux, you can do your write to a file descriptor. On Windows XP, use file handles. (It's a little trickier on XP, but well documented in MSDN.)
XP is a little funny about buffering. If you write a 1GB block to a handle, it will be slower than if you break the write up into smaller transfer sizes (in a loop). I've found the 256KB writes are most efficient. Once you've written the loop, you can play around with this and see what's the fastest transfer size.
OK, I did write method implementation with for loop that writes 256KB blocks (as Rob suggested) of data at each iteration and result is 16 seconds, so problem solved. This is my humble implementation so feel free to comment:
void writeCubeToFile(const vector<char> &vs)
{
const unsigned int blocksize = 262144;
unsigned long blocks = distance(vs.begin(), vs.end()) / blocksize;
ofstream outfile("nanocube.txt", ios::out | ios::binary);
for(unsigned long i = 0; i <= blocks; i++)
{
unsigned long position = blocksize * i;
if(blocksize > distance(vs.begin() + position, vs.end())) outfile.write(&*(vs.begin() + position), distance(vs.begin() + position, vs.end()));
else outfile.write(&*(vs.begin() + position), blocksize);
}
outfile.write("\0", 1);
outfile.close();
}
Thnx to all of you.
If you have other structure this method is still valid.
For example:
typedef std::pair<int,int> STL_Edge;
vector<STL_Edge> v;
void write_file(const char * path){
ofstream outfile(path, ios::out | ios::binary);
outfile.write((const char *)&v.front(), v.size()*sizeof(STL_Edge));
}
void read_file(const char * path,int reserveSpaceForEntries){
ifstream infile(path, ios::in | ios::binary);
v.resize(reserveSpaceForEntries);
infile.read((char *)&v.front(), v.size()*sizeof(STL_Edge));
}
Instead of writing via the file i/o methods, you could try to create a memory-mapped file, and then copy the vector to the memory-mapped file using memcpy.
Use the write method on it, it is in ram after all and you have contigous memory.. Fastest, while looking for flexibility later? Lose the built-in buffering, hint sequential i/o, lose the hidden things of iterator/utility, avoid streambuf when you can but do get dirty with boost::asio ..