Performance problems when scaling MSVC 2005's operator<< accross threads

Performance problems when scaling MSVC 2005's operator<< accross threads - c++

When looking at some of our logging I've noticed in the profiler that we were spending a lot of time in the operator<< formatting ints and such. It looks like there is a shared lock that is used whenever ostream::operator<< is called when formatting an int(and presumably doubles). Upon further investigation I've narrowed it down to this example:
Loop1 that uses ostringstream to do the formatting:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 that uses the same ostringstream to do everything but the int format, that is done with itoa:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
For my test I ran each loop a number of times with 1, 2, 3 and 4 threads (I have a 4 core machine). The number of trials is constant. Here is the output:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
As you can see, the performance when using ostringstream is abysmal. It gets 30 times worse when adding more threads whereas the itoa gets about 2 times faster.
One idea is to use _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) as recommended by M$ in this article. That doesn't seem to help me. Here's another user who seem to be having a similar issue.
We need to be able to format ints in several threads running in parallel for our application. Given this issue we either need to figure out how to make this work or find another formatting solution. I may code up a simple class with operator<< overloaded for the integral and floating types and then have a templated version that just calls operator<< on the underlying stream. A bit ugly, but I think I can make it work, though maybe not for user defined operator<<(ostream&,T) because it's not an ostream.
I should also make clear that this is being built with Microsoft Visual Studio 2005. And I believe this limitation comes from their implementation of the standard library.

If the Visual Studio 2005's standard library implementation has bugs why not try other implementations? Like:
STLport
Apache C++ Standard Library (STDCXX)
or even Dinkumware upon which Visual Studio 2005 standard library is based on, maybe the have fixed the problem since 2005.
Edit: The other user you mentioned used Visual Studio 2008 SP1, which means that probably Dinkumware has not fixed this issue.

Doesn't surprise me, MS has put "global" locks on a fair few shared resources - the biggest headache for us was the BSTR memory lock a few years back.
The best thing you can do is copy the code and replace the ostream lock and shared conversion memory with your own class. I have done that where I write the stream using a printf-style logging system (ie I had to use a printf logger, and wrapped it with my stream operators). Once you've compiled that into your app you should be as fast as itoa. When I'm in the office I'll grab some of the code and paste it for you.
EDIT:
as promised:
CLogger& operator<<(long l)
{
if (m_LoggingLevel < m_levelFilter)
return *this;
// 33 is the max length of data returned from _ltot
resize(33);
_ltot(l, buffer+m_length, m_base);
m_length += (long)_tcslen(buffer+m_length);
return *this;
};
static CLogger& hex(CLogger& c)
{
c.m_base = 16;
return c;
};
void resize(long extra)
{
if (extra + m_length > m_size)
{
// resize buffer to fit.
TCHAR* old_buffer = buffer;
m_size += extra;
buffer = (TCHAR*)malloc(m_size*sizeof(TCHAR));
_tcsncpy(buffer, old_buffer, m_length+1);
free(old_buffer);
}
}
static CLogger& endl(CLogger& c)
{
if (c.m_length == 0 && c.m_LoggingLevel < c.m_levelFilter)
return c;
c.Write();
return c;
};
Sorry I can't let you have all of it, but those 3 methods show the basics - I allocate a buffer, resize it if needed (m_size is buffer size, m_length is current text length) and keep it for the duration of the logging object. The buffer contents get written to file (or OutputDebugString, or a listbox) in the endl method. I also have a logging 'level' to restrict output at runtime. So you just replace your calls to ostringstream with this, and the Write() method pumps the buffer to a file and clears the length. Hope this helps.

The problem could be memory allocation. malloc which "new" uses has an internal lock. You can see it if you step into it. Try to use a thread local allocator and see if the bad performance disappears.

Related

Truly asynchronous file IO in C++

I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.
I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.
Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).
So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?
You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.
Edit:
Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.
#include <fstream>
#include <thread>
#include <memory>
#include <string>
#include <Windows.h> // for SetThreadAffinityMask()
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
void lock_thread(unsigned core_idx)
{
SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}
int main()
{
std::ios_base::sync_with_stdio(false);
lock_thread(0);
auto worker_count = std::thread::hardware_concurrency() - 1;
std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector
for (int i = 0; i < worker_count; ++i)
{
threads[i] = std::thread(
[](unsigned idx) {
lock_thread(idx);
stress_write(1'000'000'000, idx);
},
i + 1
);
}
stress_write(1'000'000'000, 0);
for (int i = 0; i < worker_count; ++i)
{
threads[i].join();
}
}
As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.

The easiest thing to get (up to) a 10x speed up is to change this:
void stress_write(unsigned bytes, int num)
{
std::ofstream out("temp" + std::to_string(num));
for (unsigned i = 0; i < bytes; ++i)
{
out << char(i);
}
}
to this:
void stress_write(unsigned bytes, int num)
{
constexpr auto chunk_size = (1u << 12u); // tune as needed
std::ofstream out("temp" + std::to_string(num));
for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
{
char chunk_buff[chunk_size];
auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
for (unsigned j = 0; j < count; ++j)
{
unsigned i = j + chunk_size*chunk;
chunk_buff[j] = char(i); // processing
}
out.write( chunk_buff, count );
}
}
where we group writes up to 4096 bytes before sending to the std ofstream.
The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.
By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.
See this SO post for more details asto why.
To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.
But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.

Chunking the I/O is indeed the most important optimization here and should suffice in most cases. However, the direct answer to the exact question asked about asynchronous IO is the following.
Boost::Asio added support for file operations in version 1.21.0. The interface is similar to the rest of Asio.
First, we need to create an object representing a file. The most common use cases would use either a random_access_file or a stream_file. In case of this example code, a streaming file is enough.
Reading is done through async_read_some, but the usual async_read helper function can be used to read a specific number of bytes at once.
If the operating system supports that, these operations do indeed run in the background and use little processor time. Both Windows and Linux do support this.

How can I quickly printf 2 dimensional array of chars made of pointers to pointers without using a loop?

I'm making ASCII game and I need performance, so decided to go with printf(). But there is a problem, I designed my char array as multidimensional char ** array, and printing it outputs garbage of memory instead of data. I know it's possible to print it with a for loop but the performance rapidly drops that way. I need to printf it like a static array[][]. Is there a way?
I did some example of working and notWorking array. I only need printf() to work with nonWorking array.
edit: using Visual Studio 2015 on Win 10, and yeah, I tested performance and cout is much slower than printf (but I don't really know why is this happening)
#include <iostream>
#include <cstdio>
int main()
{
const int X_SIZE = 40;
const int Y_SIZE = 20;
char works[Y_SIZE][X_SIZE];
char ** notWorking;
notWorking = new char*[Y_SIZE];
for (int i = 0; i < Y_SIZE; i++) {
notWorking[i] = new char[X_SIZE];
}
for (int i = 0; i < Y_SIZE; i++) {
for (int j = 0; j < X_SIZE; j++) {
works[i][j] = '#';
notWorking[i][j] = '#';
}
works[i][X_SIZE-1] = '\n';
notWorking[i][X_SIZE - 1] = '\n';
}
works[Y_SIZE-1][X_SIZE-1] = '\0';
notWorking[Y_SIZE-1][X_SIZE-1] = '\0';
printf("%s\n\n", works);
printf("%s\n\n", notWorking);
system("PAUSE");
}
Note: I think I could make some kind of a buffer or static array for just copying and displaying data, but I wonder if that can be done without it.

If you would like to print a 2D structure with printf without a loop, you need to present it to printf as a contiguous one-dimension C string. Since your game needs access to the string as a 2D structure, you could make an array of pointers into this flat structure that would look like this:
Array of pointers partitions the buffer for use as a 2D structure, while the buffer itself can be printed by printf because it is a contiguous C string.
Here is the same structure in code:
// X_SIZE+1 is for '\n's; overall +1 is for '\0'
char buffer[Y_SIZE*(X_SIZE+1)+1];
char *array[Y_SIZE];
// Setup the buffer and the array
for (int r = 0 ; r != Y_SIZE ; r++) {
array[r] = &buffer[r*(X_SIZE+1)];
for (int c = 0 ; c != X_SIZE ; c++) {
array[r][c] = '#';
}
array[r][X_SIZE] = '\n';
}
buffer[Y_SIZE*(X_SIZE+1)] = '\0';
printf("%s\n", buffer);
Demo.

Some things you can do to increase performance:
There is absolutely no reason to have an array of pointers, each pointing at an array. This will cause heap fragmentation as your data will end up all over the heap. Allocating memory in adjacent cells have many benefits in terms of speed, for example it might improve the use of data cache.
Instead, allocate a true 2D array:
char (*array2D) [Y] = new char [X][Y];
printf as well as cout are both incredibly slow, as they come with tons of overhead and extra features which you don't need. Since they are just advanced wrappers around the system-specific console functions, you should consider using the system-specific functions directly. For example, the Windows console API. It will however turn your program non-portable.
If that's not an option, you could try to use puts instead of printf, since it has far less overhead.
Main performance issue with printf/cout is that they write to the end of the "standard output stream", meaning you can't write where you like, but always at the bottom of the screen. Forcing you to constantly redraw the whole thing every time you changed something, which will be slow and possibly cause flicker issues.
Old DOS/Turbo C programs solved this with a non-standard function called gotoxy which allowed you to move the "cursor" and print where you liked. In modern programming, you can do this with the console API functions. Example for Windows.
You could/should separate graphics from the rest of the program. If you have one thread handing graphics only and the main thread handling algorithms, the graphic updates will work smoother, without having to wait for whatever else the program is doing. It makes the program far more advanced though, as you have to consider thread safety issues.

c++ new crashing in release only (MSVS10)

When running the release executable only (No problems occur when running through visual studio) my program crashes. When using "attach to process" function visual studio indicates the crash occurred in the following function:
World::blockmap World::newBlankBlockmap(int sideLen, int h){
cout << "newBlankBlockmap side: "<<std::to_string((long long)sideLen) << endl;
cout << "newBlankBlockmap height: "<<std::to_string((long long)h) << endl;
short*** bm = new short**[sideLen];
for(int i=0;i<sideLen;i++){
bm[i] = new short*[h];
for(int j=0;j<h;j++){
bm[i][j] = new short[sideLen];
for (int k = 0; k < sideLen ; k++)
{
bm[i][j][k] = blocks->getAIR_BLOCK();
}
}
}
return (blockmap)bm;
}
Which is called from a child class...
World::chunk* World_E::newChunkMap(World::floatmap north, World::floatmap east, World::floatmap south, World::floatmap west
,float lowlow, float highlow, float highhigh, float lowhigh, bool displaceSides){
World::chunk* c = newChunk(World::CHUNK_SIZE+1,World::HEIGHT);
for (int i = 0; i < World::CHUNK_SIZE ; i++)
{
for (int k = 0; k < World::CHUNK_SIZE ; k++)
{
c->bm[i][0][k] = blocks->getDUMMY_BLOCK();
}
}
c->bm[(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
return c;
}
where...
class World {
public: typedef short*** blockmap;
...
The line which VS points at is...
short*** bm = new short**[sideLen];
The "attach to process" function stats the Local variables are...
sideLen = 1911407648
h = 0
which is what i did NOT expect, but the cout outputs 9 and 30 respectively, which was expected.
I am aware that most "crashes in release only" problems are due to uninitialized variables, however, I fail to see that related here.
The only error message I get is...
Windows has triggered a breakpoint in Blocks Project.exe.
This may be due to a corruption of the heap
I am stumped on this problem, what's the error? how can I better debug release executable?
I can post more code if needed, however, bear in mind there is a lot of it.
Thank you in advanced.
"And I don't see World::newBlankBlockmap() called from that second chunk of code. – Michael Burr", I forgot that bit, here you go...
World::chunk* World::newChunk(int side, int height){
cout << "newChunk side: "<<std::to_string((long long)side) << endl;
cout << "newChunk height: "<<std::to_string((long long)height) << endl;
chunk* ch = new chunk();
ch->bm = newBlankBlockmap(side,height);
ch->fm = newBlankFloatmap(side);
return ch;
}
where...
struct chunk {
blockmap bm;
floatmap fm;
};
as defined in the World class

To reiterate what the comments where hinting at: From what you've posted, you're code seems to be badly structured. Triple pointer constructs like short*** are almost impossible to debug and should be avoided at all costs. The heap corruption error message you got suggests that you have a bad memory access somewhere in your code, which is impossible to find automatically with your current setup.
Your only options at this point are to either dig through your entire code manually, until you've found the bug, or start refactoring. The latter might seem like the more time-consuming now, but it won't be if you plan to work with this code in the future.
Consider the following as possible hints for a refactoring:
Don't use plain arrays for storing values. std::vector is just as effective and a lot easier to debug.
Avoid plain new and delete. In modern C++ with the STL containers and smart pointers, plain memory allocation should only happen in very rare exceptional cases.
Always range-check your array access operations. If you worry about performance, use asserts which disappear in release builds, but be sure the checks are there when you need them for debugging.
Modeling three-dimensional arrays in C++ can be tricky, since operator[] only offers support for one-dimensional arrays. A nice compromise is using operator() instead, which can take an arbitrary number of indices.
Avoid C-style casts. They can be very unpredictable. Use the C++ casts static_cast, dynamic_castand reinterpret_cast instead. If you find yourself using reinterpret_cast regularly, you probably have a mistake in your design somewhere.

There is a problem in this line short*** bm = new short**[sideLen];. The memory is allocated for sideLen elements, but the assignment line bm[i][j][k] = blocks->getAIR_BLOCK(); requires an array having size sideLen * sideLen * h. To fix this problem changing of the 1st line to short*** bm = new short**[sideLen * sideLen * h]; is required.

Executable runs faster on Wine than Windows -- why?

Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}

edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.

Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.

From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.

While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.

Comparing data bytewise in a effective way (with C++)

Is there a more effective way to compare data bytewise than using the comparison
operator of the C++ list container?
I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.
Therefore I read files bytewise and store the values in a list of unsigned chars.
The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list
typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;
contentPtr GetContent( fs::path filePath ){
contentPtr actualContent (new list< unsigned char > );
// Read the file with a stream, put read values into actual content
return actualContent;
This is done twice, because there're always two copies of the file.
The content of these two files has to be compared, and throw an exception if a mismatch is found
void CompareContent() throw( NotMatchingException() ){
// this part is very fast, below 50ms
contentPtr contentA = GetContent("/fileA");
contentPtr contentB = GetContent("/fileB");
// the next part takes about 2secs with a file size of ~64kByte
if( *contentA != *contentB )
throw( NotMatchingException() );
}
My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....
Is there a more effective way of doing this comparison? Is it a problem of the used container?

Don't use a std::list use a std::vector.
std::list is a linked-list, elements are not guaranteed to be stored contiguously.
std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).
If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

My first piece of advice would be to profile your code.
The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.
It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?
Here's a piece of my code that compares two files bytewise:
// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(host_f)))
{
// we're good: move table to OutPath, remove other files
EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

As you write, you are comparing contents of two files. Then you can make use of boost's mapped_files. You really do not need to read the whole file. You can read on the fly (in an optimized way as boost does) and stop when you find the first unequal byte...
Just like the very elegant solution in Cubbi's answer here: http://www.cplusplus.com/forum/general/94032/ Note that just below he also adds some benchmarks which clearly show this is the fastest way. I will just rewrite a bit his answer and add zero-file size check which throws exception otherwise and enclose the test into a function to benefit from early returns:
#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>
namespace io = boost::iostreams;
namespace fs = boost::filesystem;
bool files_equal(const std::string& path1, const std::string& path2)
{
fs::path f1(path1);
fs::path f2(path2);
if (fs::file_size(f1) != fs::file_size(f2))
return false;
// zero-sized files cannot be opened with mapped_file_source
// hence we consider all zero-sized files equal
if (fs::file_size(f1) == 0)
return true;
io::mapped_file_source mf1(f1.string());
io::mapped_file_source mf2(f1.string());
return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}
int main()
{
if (files_equal("test.1", "test.2"))
std::cout << "The files are equal.\n";
else
std::cout << "The files are not equal.\n";
}

std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.
If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?

It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)

In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):
std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks
What happens on your system?
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;
static char* buff ;
static vector<char> v0, v1 ;
static int const BufferSize = 100000 ;
static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;
int main (int argc, char** argv)
{
// Allocate a buffer
buff = new char[BufferSize] ;
// Create two vectors
vector<char> v0 (buff, buff + BufferSize) ;
vector<char> v1 (buff, buff + BufferSize) ;
clock_t t ;
// Compare them 10000 times using std::equal
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (!equal (v0.begin(), v0.end(), v1.begin()))
cout << "Error in std::equal\n", exit (1) ;
t = EndTimer (t) ;
cout << "std::equal: " << t << " clock ticks\n" ;
// Compare them 10000 times using operator==
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (v0 != v1)
cout << "Error in operator==\n", exit (1) ;
t = EndTimer (t) ;
cout << "operator==: " << t << " clock ticks\n" ;
// Compare them 10000 times using memcmp
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (memcmp (&v0[0], &v1[0], v0.size()))
cout << "Error in memcmp\n", exit (1) ;
t = EndTimer (t) ;
cout << "memcmp: " << t << " clock ticks\n" ;
return 0 ;
}
static clock_t StartTimer()
{
// Start on a clock tick, to enhance reproducibility
clock_t t = clock() ;
while (clock() == t)
;
return clock() ;
}
static clock_t EndTimer (clock_t t)
{
return clock() - t ;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js