Related
I want to read a text file from local storage, I'm trying to experiment with multiprocessing so I want to break the text file into smaller chunks and run a process on them.
Rough idea:
Input: 10Kb text file
Program to separate them into chunks of 1Kb each
Run a function on each chunk separately (Eg: Capitalise certain characters, find the frequency of letters or search for a word in that chunk)
Output: Return the function output with no memory leaks or mismatches in reads
I've tried using pread but I'm on windows, so any solution or leads to solve this would be helpful
Maybe you have chosen the wrong example to learn multithreading.
A file stored on a sequential drive will be read fastest in sequential mode.
Therefore I will read, in my example below, the complete file in one rush into a string. For test purposes I used a "Lorem Ipsum" generator and created a file with 1 million characters. 1 million is nowadays considered as still small.
For demo purposes, I will create 4 parallel threads.
After having this complete file in one string, I will split the big string into 4 substrings. One for each thread.
For the thread function, I created a 4 liner test function that calculates the count of letters for a given substring.
For easier learning, I will use std::async to create the threads. The result of std::async will be stored in a std::future. There we can pick up the test function result later. We need to used a shared_future to be able to store all of them in an a std::array, because the std::future's copy constructor is deleted.
Then, we let the threads do their work.
In an additional loop, we use the futures getfunction, which will wait for thread completion and then give us the result.
We sum up the values from all 4 threads and then print it out in a sorted way. Please note: Also the \nwill be counted, which will look a little bit strange in the output.
Please note. This is just demoe. It will be even slower than a straight forwad solution. It is just for showing hwo multithreading could work.
Please see below one simple example (one of many many possible solutions):
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <iterator>
#include <future>
#include <thread>
#include <array>
#include <set>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<char, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
// We will use 4 threads for our task
constexpr size_t NumberOfThreads = 4u;
// Some test function used by a thread. Count characters in text
Counter countCharacters(const std::string& text) {
// Definition of the counter
Counter counter{};
// Count all letters
for (const char c : text) counter[c]++;
// Give back result
return counter;
}
// Test / driver Code
int main() {
// Open a test file with 1M characters and check, if it could be opened
if (std::ifstream sourceStream{ "r:\\text.txt" }; sourceStream) {
// Read the complete 1M file into a string
std::string text(std::istreambuf_iterator<char>(sourceStream), {});
// ------------------------------------------------------------------------------------------------
// This is for the multhreading part
// We will split the big string in parts and give each thread the task to work with this part
// Calculate the length of one partition + some reserve in case of rounding problem
const size_t partLength = text.length() / NumberOfThreads + NumberOfThreads;
// We will create numberOfThread Substrings starting at equidistant positions. This is the start.
size_t threadStringStartpos = 0;
// Container for the futures. Please note. We can only use shared futures in containers.
std::array<std::shared_future<Counter>, NumberOfThreads> counter{};
// Now create the threats
for (unsigned int threadNumber{}; threadNumber < NumberOfThreads; ++threadNumber) {
// STart a thread. Get a reference to the future. And call it with our test function and a part of the string
counter[threadNumber] = std::async( countCharacters, text.substr(threadStringStartpos, partLength));
// Calculate next part of string
threadStringStartpos += partLength;
}
// Combine results from threads
Counter result{};
for (unsigned int threadNumber{}; threadNumber < NumberOfThreads; ++threadNumber) {
// Get will get the result from the thread via the assigned future
for (const auto& [letter, count] : counter[threadNumber].get())
result[letter] += count; // Sum up all counts
}
// ------------------------------------------------------------------------------------------------
for (const auto& [letter, count] : Rank(result.begin(), result.end())) std::cout << letter << " --> " << count << '\n';
}
else std::cerr << "\n*** Error: Could not open source file\n";
}
My program reads a file, processes it and saves the results in a csv file.
The whole of us include a loop in which many different files are processed. a separate csv file is generated for each of these files.
I was able to implement the processing very efficiently in terms of time, so that saving the respective results is the longest process in the loop.
The results are available as vector <float> and are currently saved as follows:
std::vector<float*> out = calculation(bla);
fstream data;
data.open(savepfad + name + ".csv", ios::out);
data<< sizex << endl;
data<< sizey << endl;
data<< dim << endl;
for (int d = 0; d < dim; d++)
{
for (int x = 0; x < sizex * sizey; x++)
{
data << out[d][x] << ",";
}
data << endl;
}
data.close();
my first thought was that i would simply outsource the storage process to a new thread (possibly with a fork) so i could continue with the main loop. But I use windows.
can I somehow write the data to the hard drive faster?
does anyone have a brilliant idea?
EDIT:
so i rebuilt the code according to the statements, but there is no real speed advantage. The code now looks like this:
std::vector<float*> out = calculation(bla);
string line = std::to_string(sizex) + "\n" + std::to_string(sizey ) + "\n" + std::to_string(dim) + "\n";
for (int d = 0; d < dim; d++)
{
for (int x = 0; x < sizex * sizey; x++)
{
line += out[d][x];
line += ",";
}
line += "\n";
}
fstream data;
data.open(savepfad + name + ".csv", ios::out);
data<<line;
data.close();
I also noticed that if out [] [] = 0 hours :: to_string (out [] []) makes 0 from 0.00 to 0.000000, and a data << out [] [] only writes 0 into the file. this makes the file size from 8000KB to 36000KB.
So if I can dump quasi instant 100MB onto the hard disk in python, I have to be able to write 8000KB relatively quickly, currently it takes between 1 and 2 minutes.
example size:
sizex = 638
sizey = 958
dim = 8
The time measurement shows that it takes almost the entire time to go through the two loops. it is a vector consisting of arrays. is the access to out too slow?
data << endl sends a newline AND flushes the result to disk.
You could do
data << "\n";
instead to send a newline without flushing.
The end result is that you flush fewer times, which means you spend less time waiting for the OS.
If that is still not fast enough, consider buffering everything into a ostrstream and dumping that into data in one go.
There are a couple of things you can do which may help, I would try implementing them one after another and measure the performance.
Don't flush after every line:
std::endl actually flushes the buffers and saves the file to the drive, that's probably killing the performance. So use << '\n';
You can try to minimize memory allocation and copying, if you buffer every line (or multiple lines) before writing it out. I would try to reserve a big string (std::string line; line.reserve(<big number enough for the full line>);) and do line += std::to_string(out[d][x]); line += ',';
You can optimize this even further, and you can try to use std::to_chars.
+1. If you are on windows, you can try to use the latest MSVC, they reported 5x speedup in float to string conversion (compared to crt functions), after implementing to_chars. https://www.youtube.com/watch?v=4P_kbF0EbZM
The setup
Hello, I have Fortran code for reading in ASCII double precision data (example of data file at bottom of question):
program ReadData
integer :: mx,my,mz
doubleprecision, allocatable, dimension(:,:,:) :: charge
! Open the file 'CHGCAR'
open(11,file='CHGCAR',status='old')
! Get the extent of the 3D system and allocate the 3D array
read(11,*)mx,my,mz
allocate(charge(mx,my,mz) )
! Bulk read the entire block of ASCII data for the system
read(11,*) charge
end program ReadData
and the "equivalent" C++ code:
#include <fstream>
#include <vector>
using std::ifstream;
using std::vector;
using std::ios;
int main(){
int mx, my, mz;
// Open the file 'CHGCAR'
ifstream InFile('CHGCAR', ios::in);
// Get the extent of the 3D system and allocate the 3D array
InFile >> mx >> my >> mz;
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
// Method 1: std::ifstream extraction operator to double
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
for (int k = 0; k < mz; ++k)
InFile >> charge[i][j][k];
return 0;
}
Fortran kicking #$$ and taking names
Note that the line
read(11,*) charge
performs the same task as the C++ code:
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
for (int k = 0; k < mz; ++k)
InFile >> charge[i][j][k];
where InFile is an if stream object (note that while iterators in the Fortran code start at 1 and not 0, the range is the same).
However, the Fortran code runs way, way faster than the C++ code, I think because Fortran does something clever like reading/parsing the file according to the range and shape (values of mx, my, mz) all in one go, and then simply pointing charge to the memory the data was read to. The C++ code, by comparison, needs to access InFile and then charge (which is typically large) back and forth with each iteration, resulting in (I believe) many more IO and memory operations.
I'm reading in potentially billions of of values (several gigabytes), so I really want to maximize performance.
My question:
How can I achieve the performance of the Fortran code in C++?
Moving on...
Here is a much faster (than the above C++) C++ implementation, where the file is read in one go into a char array, and then charge is populated as the char array is parsed:
#include <fstream>
#include <vector>
#include <cstdlib>
using std::ifstream;
using std::vector;
using std::ios;
int main(){
int mx, my, mz;
// Open the file 'CHGCAR'
ifstream InFile('CHGCAR', ios::in);
// Get the extent of the 3D system and allocate the 3D array
InFile >> mx >> my >> mz;
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
// Method 2: big char array with strtok() and atof()
// Get file size
InFile.seekg(0, InFile.end);
int FileSize = InFile.tellg();
InFile.seekg(0, InFile.beg);
// Read in entire file to FileData
vector<char> FileData(FileSize);
InFile.read(FileData.data(), FileSize);
InFile.close();
/*
* Now simply parse through the char array, saving each
* value to its place in the array of charge density
*/
char* TmpCStr = strtok(FileData.data(), " \n");
// Gets TmpCStr to the first data value
for (int i = 0; i < 3 && TmpCStr != NULL; ++i)
TmpCStr = strtok(NULL, " \n");
for (int i = 0; i < Mz; ++i)
for (int j = 0; j < My; ++j)
for (int k = 0; k < Mx && TmpCStr != NULL; ++k){
Charge[i][j][k] = atof(TmpCStr);
TmpCStr = strtok(NULL, " \n");
}
return 0;
}
Again, this is much faster than the simple >> operator-based method, but still considerably slower than the Fortran version--not to mention much more code.
How to get better performance?
I'm sure that method 2 is the way to go if I am to implement it myself, but I'm curious how I can increase performance to match the Fortran code. The types of things I'm considering and currently researching are:
C++11 and C++14 features
Optimized C or C++ library for doing just this type of thing
Improvements on the individual methods being used in method 2
tokenization library such as that in the C++ String Toolkit Library instead of strtok()
more efficient char to double conversion than atof()
C++ String Toolkit
In particular, the C++ String Toolkit Library will take FileData and the delimiters " \n" and give me a string token object (call it FileTokens, then the triple for loop would look like
for (int k = 0; k < Mz; ++k)
for (int j = 0; j < My; ++j)
for (int i = 0; i < Mx; ++i)
Charge[k][j][i] = FileTokens.nextFloatToken();
This would simplify the code slightly, but there is extra work in copying (in essence) the contents of FileData into FileTokens, which might kill any performance gains from using the nextFloatToken() method (presumedly more efficient than the strtok()/atof() combination).
There is an example on the C++ String Toolkit (StrTk) Tokenizer tutorial page (included at the bottom of the question) using StrTk's for_each_line() processor that looks to be similar to my desired application. A difference between the cases, however, is that I cannot assume how many data will appear on each line of the input file, and I do not know enough about StrTk to say if this is a viable solution.
NOT A DUPLICATE
The topic of fast reading of ASCII data to an array or struct has come up before, but I have reviewed the following posts and their solutions were not sufficient:
Fastest way to read data from a lot of ASCII files
How to read numbers from an ASCII file (C++)
Read Numeric Data from a Text File in C++
Reading a file and storing the contents in an array
C/C++ Fast reading large ASCII data file to array or struct
Read ASCII file into matrix in C++
How can I read ASCII data file in C++
Reading a file and storing the contents in an array
Reading in data in columns from a file (C++)
The Fastest way to read a .txt File
How does fast input/ output work in C/C++, by using registers, hexadecimal number and the likes?
reading file into struct array
Example data
Here is an example of the data file I'm importing. The ASCII data is delimited by spaces and line breaks like the below example:
5 3 3
0.23080516813E+04 0.22712439791E+04 0.21616898980E+04 0.19829996749E+04 0.17438686650E+04
0.14601734127E+04 0.11551623512E+04 0.85678544224E+03 0.59238325489E+03 0.38232265554E+03
0.23514479113E+03 0.14651943589E+03 0.10252743482E+03 0.85927499703E+02 0.86525872161E+02
0.10141182750E+03 0.13113419142E+03 0.18057147781E+03 0.25973252462E+03 0.38303754418E+03
0.57142097675E+03 0.85963728360E+03 0.12548019843E+04 0.17106124085E+04 0.21415379433E+04
0.24687336309E+04 0.26588012477E+04 0.27189091499E+04 0.26588012477E+04 0.24687336309E+04
0.21415379433E+04 0.17106124085E+04 0.12548019843E+04 0.85963728360E+03 0.57142097675E+03
0.38303754418E+03 0.25973252462E+03 0.18057147781E+03 0.13113419142E+03 0.10141182750E+03
0.86525872161E+02 0.85927499703E+02 0.10252743482E+03 0.14651943589E+03 0.23514479113E+03
StrTk example
Here is the StrTk example mentioned above. The scenario is parsing the data file that contains the information for a 3D mesh:
input data:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
code:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
// Process points section
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
// Process triangles section
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
This...
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
...creates a temporary vector<double>(mz), with all 0.0 values, and copies it my times (or perhaps moves then copies my-1 times with a C++11 compiler, but little difference...) to create a temporary vector<vector<double>>(my, ...) which is then copied mx times (...as above...) to initialise all the data. You're reading data in over these elements anyway - there's no need to spend time initialising it here. Instead, create an empty charge and use nested loops to reserve() enough memory for the elements without populating them yet.
Next, check you're compiling with optimisation on. If you are and you're still slower than FORTRAN, in the data-populating nested loops try creating a reference to the vector you're about .emplace_back elements on to:
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
{
std::vector<double>& v = charge[i][j];
for (int k = 0; k < mz; ++k)
{
double d;
InFile >> d;
v.emplace_pack(d);
}
}
That shouldn't help if your optimiser's done a good job, but is worth trying as a sanity check.
If you're still slower - or just want to try to be even faster - you could try optimising your number parsing: you say your data's all formatted ala 0.23080516813E+04 - with fixed sizes like that you can easily calculate how many bytes to read into a buffer to give you a decent number of values from memory, then for each you could start an atol after the . to extract 23080516813 then multiply it by 10 to the power of minus (11 (your number of digits) minus 04): for speed, keep a table of those powers of ten and index into it using the extracted exponent (i.e. 4). (Note multiplying by e.g. 1E-7 can be faster than dividing by 1E7 on a lot of common hardware.)
And if you want to blitz this thing, switch to using memory mapped file access. Worth considering boost::mapped_file_source as it's easier to use than even the POSIX API (let alone Windows), and portable, but programming directly against an OS API shouldn't be much of a struggle either.
UPDATE - response to first & second comments
Example of using boost memory mapping:
#include <boost/iostreams/device/mapped_file.hpp>
boost::mapped_file_params params("dbldat.in");
boost::mapped_file_source file(params);
file.open();
ASSERT(file.is_open());
const char* p = file.data();
const char* nl = strchr(p, '\n');
std::istringstream iss(std::string(p, nl - p));
size_t x, y, z;
ASSERT(iss >> x >> y >> z);
The above maps a file into memory at address p, then parses the dimensions from the first line. Continue parsing the actual double representations from ++nl onwards. I mention an approach to that above, and you're concerned about the data format changing: you could add a version number to the file, so you can use optimised parsing until the version number changes then fall back on something generic for "unknown" file formats. As far as something generic goes, for in-memory representations using int chars_to_skip; double my_double; ASSERT(sscanf(ptr, "%f%n", &my_double, &chars_to_skip) == 1); is reasonable: see sscanf docs here - you can then advance the pointer through the data by chars_to_skip.
Next, are you suggesting to combine the reserve() solution with the reference creation solution?
Yes.
And (pardon my ignorance) why would using a reference to charge[i][j] and v.emplace_back() be better than charge[i][j].emplace_back()?
That suggestion was to sanity check that the compiler's not repeatedly evaluating charge[i][j] for each element being emplaced: hopefully it will make no performance difference and you can go back to the charge[i][j].emplace(), but IMHO it's worth a quick check.
Lastly, I'm skeptical about using an empty vector and reserve()ing at the tops of each loop. I have another program that came to a grinding halt using that method, and replacing the reserve()s with a preallocated multidimensional vector sped it up a lot.
That's possible, but not necessarily true in general or applicable here - a lot depends on the compiler / optimiser (particularly loop unrolling) etc.. With unoptimised emplace_back you're having to check vector size() against capacity() repeatedly, but if the optimiser does a good job that should be reduced to insignificance. As with a lot of performance tuning, you often can't reason about things perfectly and conclude what's going to be fastest, and will have to try alternatives and measure them with your actual compiler, program data etc..
Is there a more effective way to compare data bytewise than using the comparison
operator of the C++ list container?
I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.
Therefore I read files bytewise and store the values in a list of unsigned chars.
The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list
typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;
contentPtr GetContent( fs::path filePath ){
contentPtr actualContent (new list< unsigned char > );
// Read the file with a stream, put read values into actual content
return actualContent;
This is done twice, because there're always two copies of the file.
The content of these two files has to be compared, and throw an exception if a mismatch is found
void CompareContent() throw( NotMatchingException() ){
// this part is very fast, below 50ms
contentPtr contentA = GetContent("/fileA");
contentPtr contentB = GetContent("/fileB");
// the next part takes about 2secs with a file size of ~64kByte
if( *contentA != *contentB )
throw( NotMatchingException() );
}
My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....
Is there a more effective way of doing this comparison? Is it a problem of the used container?
Don't use a std::list use a std::vector.
std::list is a linked-list, elements are not guaranteed to be stored contiguously.
std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).
If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.
My first piece of advice would be to profile your code.
The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.
It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.
If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?
Here's a piece of my code that compares two files bytewise:
// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(host_f)))
{
// we're good: move table to OutPath, remove other files
EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.
As you write, you are comparing contents of two files. Then you can make use of boost's mapped_files. You really do not need to read the whole file. You can read on the fly (in an optimized way as boost does) and stop when you find the first unequal byte...
Just like the very elegant solution in Cubbi's answer here: http://www.cplusplus.com/forum/general/94032/ Note that just below he also adds some benchmarks which clearly show this is the fastest way. I will just rewrite a bit his answer and add zero-file size check which throws exception otherwise and enclose the test into a function to benefit from early returns:
#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>
namespace io = boost::iostreams;
namespace fs = boost::filesystem;
bool files_equal(const std::string& path1, const std::string& path2)
{
fs::path f1(path1);
fs::path f2(path2);
if (fs::file_size(f1) != fs::file_size(f2))
return false;
// zero-sized files cannot be opened with mapped_file_source
// hence we consider all zero-sized files equal
if (fs::file_size(f1) == 0)
return true;
io::mapped_file_source mf1(f1.string());
io::mapped_file_source mf2(f1.string());
return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}
int main()
{
if (files_equal("test.1", "test.2"))
std::cout << "The files are equal.\n";
else
std::cout << "The files are not equal.\n";
}
std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.
If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?
It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)
In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):
std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks
What happens on your system?
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;
static char* buff ;
static vector<char> v0, v1 ;
static int const BufferSize = 100000 ;
static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;
int main (int argc, char** argv)
{
// Allocate a buffer
buff = new char[BufferSize] ;
// Create two vectors
vector<char> v0 (buff, buff + BufferSize) ;
vector<char> v1 (buff, buff + BufferSize) ;
clock_t t ;
// Compare them 10000 times using std::equal
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (!equal (v0.begin(), v0.end(), v1.begin()))
cout << "Error in std::equal\n", exit (1) ;
t = EndTimer (t) ;
cout << "std::equal: " << t << " clock ticks\n" ;
// Compare them 10000 times using operator==
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (v0 != v1)
cout << "Error in operator==\n", exit (1) ;
t = EndTimer (t) ;
cout << "operator==: " << t << " clock ticks\n" ;
// Compare them 10000 times using memcmp
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (memcmp (&v0[0], &v1[0], v0.size()))
cout << "Error in memcmp\n", exit (1) ;
t = EndTimer (t) ;
cout << "memcmp: " << t << " clock ticks\n" ;
return 0 ;
}
static clock_t StartTimer()
{
// Start on a clock tick, to enhance reproducibility
clock_t t = clock() ;
while (clock() == t)
;
return clock() ;
}
static clock_t EndTimer (clock_t t)
{
return clock() - t ;
}
here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}
The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.
I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.
try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.
The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.
You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).
You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.
I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not
Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.
2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.
Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).