I am loking for free tool to Visual Studio 2015, which could control accessing to memory. I found just this article (https://msdn.microsoft.com/en-us/library/e5ewb1h3%28v=vs.90%29.aspx?f=255&MSPPError=-2147217396), which controls just memory leaks - its fine, but not enought. When I create array, and access out of that array, I need to know it. For example this code:
int* pointer = (int*)malloc(sizeof(int)*8);
for (int a = 0;a <= 8;a++)
std::cout << *(pointer+a) << ' ';
free(pointer);
_CrtDumpMemoryLeaks();
This example will not throw exception, but still it access to memory, which is out of allocated space. Is tool for Visual Studio 2015, which will tell mi about it? Something like valgrind.
Thank you.
You can replace the given Microsoft'ish code
int* pointer = (int*)malloc(sizeof(int)*8);
for (int a = 0;a <= 8;a++)
std::cout << *(pointer+a) << ' ';
free(pointer);
_CrtDumpMemoryLeaks();
… with this:
{
vector<int> v( 8 );
for( i = 0; i <= 8; ++i )
{
cout << v.at( i ) << ' ';
}
}
_CrtDumpMemoryLeaks();
And since there's no point in checking for memory leaks in that code (indeed the exception from the attempt to access v[8] ensures that the _CrtDumpMemoryLeaks() statement is not executed), you can simplify it to just
vector<int> v( 8 );
for( i = 0; i <= 8; ++i )
{
cout << v.at( i ) << ' ';
}
Now, it's so long since I've been checking for indexing errors etc. that I'm no longer sure of the exact magic incantations to add checking of ordinary [] indexing with g++ and Visual C++. I just remember that one could, at least with g++. But anyway, the practical way to go about things is not to laboriously check indexing code, but rather avoid manual indexing, writing
// Fixed! No more problems! Hurray!
vector<int> v( 8 );
for( int const value : v )
{
cout << value << ' ';
}
About the more general aspect of the question, how to ensure that every invalid memory access generates a trap or something of the sort, that's not generally practically possible, because the granularity of memory protection is a page of memory, which in the old days of Windows was 4 KB (imposed by the processor's memory access architecture). It's probably 4 KB still. There would just be too dang much overhead if every little allocation was rounded up to 4K with a 4K protected page after that.
Still I think it's done for the stack. That's affordable: a single case of great practical utility.
However, the C++ standard library has no functionality for that, and I don't think Visual C++ has any extensions for that either, but if you really want to go that route, then check out the Windows API, such as VirtualAlloc and friends.
Related
My question is related to a problem described here. I have written a C++ implementation of the Sieve of Eratosthenes that hits a memory overflow if I set the target value too high. As suggested in that question, I am able to fix the problem by using a boolean <vector> instead of a normal array.
However, I am hitting the memory overflow at a much lower value than expected, around n = 1 200 000. The discussion in the thread linked above suggests that the normal C++ boolean array uses a byte for each entry, so with 2 GB of RAM, I expect to be able to get to somewhere on the order of n = 2 000 000 000. Why is the practical memory limit so much smaller?
And why does using <vector>, which encodes the booleans as bits instead of bytes, yield more than an eightfold increase in the computable limit?
Here is a working example of my code, with n set to a small value.
#include <iostream>
#include <cmath>
#include <vector>
using namespace std;
int main() {
// Count and sum of primes below target
const int target = 100000;
// Code I want to use:
bool is_idx_prime[target];
for (unsigned int i = 0; i < target; i++) {
// initialize by assuming prime
is_idx_prime[i] = true;
}
// But doesn't work for target larger than ~1200000
// Have to use this instead
// vector <bool> is_idx_prime(target, true);
for (unsigned int i = 2; i < sqrt(target); i++) {
// All multiples of i * i are nonprime
// If i itself is nonprime, no need to check
if (is_idx_prime[i]) {
for (int j = i; i * j < target; j++) {
is_idx_prime[i * j] = 0;
}
}
}
// 0 and 1 are nonprime by definition
is_idx_prime[0] = 0; is_idx_prime[1] = 0;
unsigned long long int total = 0;
unsigned int count = 0;
for (int i = 0; i < target; i++) {
// cout << "\n" << i << ": " << is_idx_prime[i];
if (is_idx_prime[i]) {
total += i;
count++;
}
}
cout << "\nCount: " << count;
cout << "\nTotal: " << total;
return 0;
}
outputs
Count: 9592
Total: 454396537
C:\Users\[...].exe (process 1004) exited with code 0.
Press any key to close this window . . .
Or, changing n = 1 200 000 yields
C:\Users\[...].exe (process 3144) exited with code -1073741571.
Press any key to close this window . . .
I am using the Microsoft Visual Studio interpreter on Windows with the default settings.
Turning the comment into a full answer:
Your operating system reserves a special section in the memory to represent the call stack of your program. Each function call pushes a new stack frame onto the stack. If the function returns, the stack frame is removed from the stack. The stack frame includes the memory for the parameters to your function and the local variables of the function. The remaining memory is referred to as the heap. On the heap, arbitrary memory allocations can be made, whereas the structure of the stack is governed by the control flow of your program. A limited amount of memory is reserved for the stack, when it gets full (e.g. due to too many nested function calls or due to too large local objects), you get a stack overflow. For this reason, large objects should be allocated on the heap.
General references on stack/heap: Link, Link
To allocate memory on the heap in C++, you can:
Use vector<bool> is_idx_prime(target);, which internally does a heap allocation and deallocates the memory for you when the vector goes out of scope. This is the most convenient way.
Use a smart pointer to manage the allocation: auto is_idx_prime = std::make_unique<bool[]>(target); This will also automatically deallocate the memory when the array goes out of scope.
Allocate the memory manually. I am mentioning this only for educational purposes. As mentioned by Paul in the comments, doing a manual memory allocation is generally not advisable, because you have to manually deallocate the memory again. If you have a large program with many memory allocations, inevitably you will forget to free some allocation, creating a memory leak. When you have a long-running program, such as a system service, creating repeated memory leaks will eventually fill up the entire memory (and speaking from personal experience, this absolutely does happen in practice). But in theory, if you would want to make a manual memory allocation, you would use bool *is_idx_prime = new bool[target]; and then later deallocate again with delete [] is_idx_prime.
I'm making ASCII game and I need performance, so decided to go with printf(). But there is a problem, I designed my char array as multidimensional char ** array, and printing it outputs garbage of memory instead of data. I know it's possible to print it with a for loop but the performance rapidly drops that way. I need to printf it like a static array[][]. Is there a way?
I did some example of working and notWorking array. I only need printf() to work with nonWorking array.
edit: using Visual Studio 2015 on Win 10, and yeah, I tested performance and cout is much slower than printf (but I don't really know why is this happening)
#include <iostream>
#include <cstdio>
int main()
{
const int X_SIZE = 40;
const int Y_SIZE = 20;
char works[Y_SIZE][X_SIZE];
char ** notWorking;
notWorking = new char*[Y_SIZE];
for (int i = 0; i < Y_SIZE; i++) {
notWorking[i] = new char[X_SIZE];
}
for (int i = 0; i < Y_SIZE; i++) {
for (int j = 0; j < X_SIZE; j++) {
works[i][j] = '#';
notWorking[i][j] = '#';
}
works[i][X_SIZE-1] = '\n';
notWorking[i][X_SIZE - 1] = '\n';
}
works[Y_SIZE-1][X_SIZE-1] = '\0';
notWorking[Y_SIZE-1][X_SIZE-1] = '\0';
printf("%s\n\n", works);
printf("%s\n\n", notWorking);
system("PAUSE");
}
Note: I think I could make some kind of a buffer or static array for just copying and displaying data, but I wonder if that can be done without it.
If you would like to print a 2D structure with printf without a loop, you need to present it to printf as a contiguous one-dimension C string. Since your game needs access to the string as a 2D structure, you could make an array of pointers into this flat structure that would look like this:
Array of pointers partitions the buffer for use as a 2D structure, while the buffer itself can be printed by printf because it is a contiguous C string.
Here is the same structure in code:
// X_SIZE+1 is for '\n's; overall +1 is for '\0'
char buffer[Y_SIZE*(X_SIZE+1)+1];
char *array[Y_SIZE];
// Setup the buffer and the array
for (int r = 0 ; r != Y_SIZE ; r++) {
array[r] = &buffer[r*(X_SIZE+1)];
for (int c = 0 ; c != X_SIZE ; c++) {
array[r][c] = '#';
}
array[r][X_SIZE] = '\n';
}
buffer[Y_SIZE*(X_SIZE+1)] = '\0';
printf("%s\n", buffer);
Demo.
Some things you can do to increase performance:
There is absolutely no reason to have an array of pointers, each pointing at an array. This will cause heap fragmentation as your data will end up all over the heap. Allocating memory in adjacent cells have many benefits in terms of speed, for example it might improve the use of data cache.
Instead, allocate a true 2D array:
char (*array2D) [Y] = new char [X][Y];
printf as well as cout are both incredibly slow, as they come with tons of overhead and extra features which you don't need. Since they are just advanced wrappers around the system-specific console functions, you should consider using the system-specific functions directly. For example, the Windows console API. It will however turn your program non-portable.
If that's not an option, you could try to use puts instead of printf, since it has far less overhead.
Main performance issue with printf/cout is that they write to the end of the "standard output stream", meaning you can't write where you like, but always at the bottom of the screen. Forcing you to constantly redraw the whole thing every time you changed something, which will be slow and possibly cause flicker issues.
Old DOS/Turbo C programs solved this with a non-standard function called gotoxy which allowed you to move the "cursor" and print where you liked. In modern programming, you can do this with the console API functions. Example for Windows.
You could/should separate graphics from the rest of the program. If you have one thread handing graphics only and the main thread handling algorithms, the graphic updates will work smoother, without having to wait for whatever else the program is doing. It makes the program far more advanced though, as you have to consider thread safety issues.
When running the release executable only (No problems occur when running through visual studio) my program crashes. When using "attach to process" function visual studio indicates the crash occurred in the following function:
World::blockmap World::newBlankBlockmap(int sideLen, int h){
cout << "newBlankBlockmap side: "<<std::to_string((long long)sideLen) << endl;
cout << "newBlankBlockmap height: "<<std::to_string((long long)h) << endl;
short*** bm = new short**[sideLen];
for(int i=0;i<sideLen;i++){
bm[i] = new short*[h];
for(int j=0;j<h;j++){
bm[i][j] = new short[sideLen];
for (int k = 0; k < sideLen ; k++)
{
bm[i][j][k] = blocks->getAIR_BLOCK();
}
}
}
return (blockmap)bm;
}
Which is called from a child class...
World::chunk* World_E::newChunkMap(World::floatmap north, World::floatmap east, World::floatmap south, World::floatmap west
,float lowlow, float highlow, float highhigh, float lowhigh, bool displaceSides){
World::chunk* c = newChunk(World::CHUNK_SIZE+1,World::HEIGHT);
for (int i = 0; i < World::CHUNK_SIZE ; i++)
{
for (int k = 0; k < World::CHUNK_SIZE ; k++)
{
c->bm[i][0][k] = blocks->getDUMMY_BLOCK();
}
}
c->bm[(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)floor((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
c->bm[(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1][1][(int)ceil((float)(World::CHUNK_SIZE+1)/2.0f)-1] = blocks->getSTONE_BLOCK();
return c;
}
where...
class World {
public: typedef short*** blockmap;
...
The line which VS points at is...
short*** bm = new short**[sideLen];
The "attach to process" function stats the Local variables are...
sideLen = 1911407648
h = 0
which is what i did NOT expect, but the cout outputs 9 and 30 respectively, which was expected.
I am aware that most "crashes in release only" problems are due to uninitialized variables, however, I fail to see that related here.
The only error message I get is...
Windows has triggered a breakpoint in Blocks Project.exe.
This may be due to a corruption of the heap
I am stumped on this problem, what's the error? how can I better debug release executable?
I can post more code if needed, however, bear in mind there is a lot of it.
Thank you in advanced.
"And I don't see World::newBlankBlockmap() called from that second chunk of code. – Michael Burr", I forgot that bit, here you go...
World::chunk* World::newChunk(int side, int height){
cout << "newChunk side: "<<std::to_string((long long)side) << endl;
cout << "newChunk height: "<<std::to_string((long long)height) << endl;
chunk* ch = new chunk();
ch->bm = newBlankBlockmap(side,height);
ch->fm = newBlankFloatmap(side);
return ch;
}
where...
struct chunk {
blockmap bm;
floatmap fm;
};
as defined in the World class
To reiterate what the comments where hinting at: From what you've posted, you're code seems to be badly structured. Triple pointer constructs like short*** are almost impossible to debug and should be avoided at all costs. The heap corruption error message you got suggests that you have a bad memory access somewhere in your code, which is impossible to find automatically with your current setup.
Your only options at this point are to either dig through your entire code manually, until you've found the bug, or start refactoring. The latter might seem like the more time-consuming now, but it won't be if you plan to work with this code in the future.
Consider the following as possible hints for a refactoring:
Don't use plain arrays for storing values. std::vector is just as effective and a lot easier to debug.
Avoid plain new and delete. In modern C++ with the STL containers and smart pointers, plain memory allocation should only happen in very rare exceptional cases.
Always range-check your array access operations. If you worry about performance, use asserts which disappear in release builds, but be sure the checks are there when you need them for debugging.
Modeling three-dimensional arrays in C++ can be tricky, since operator[] only offers support for one-dimensional arrays. A nice compromise is using operator() instead, which can take an arbitrary number of indices.
Avoid C-style casts. They can be very unpredictable. Use the C++ casts static_cast, dynamic_castand reinterpret_cast instead. If you find yourself using reinterpret_cast regularly, you probably have a mistake in your design somewhere.
There is a problem in this line short*** bm = new short**[sideLen];. The memory is allocated for sideLen elements, but the assignment line bm[i][j][k] = blocks->getAIR_BLOCK(); requires an array having size sideLen * sideLen * h. To fix this problem changing of the 1st line to short*** bm = new short**[sideLen * sideLen * h]; is required.
when using C++ vector, time spent is 718 milliseconds,
while when I use Array, time is almost 0 milliseconds.
Why so much performance difference?
int _tmain(int argc, _TCHAR* argv[])
{
const int size = 10000;
clock_t start, end;
start = clock();
vector<int> v(size*size);
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
v[i*size+j] = 1;
}
}
end = clock();
cout<< (end - start)
<<" milliseconds."<<endl; // 718 milliseconds
int f = 0;
start = clock();
int arr[size*size];
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
arr[i*size+j] = 1;
}
}
end = clock();
cout<< ( end - start)
<<" milliseconds."<<endl; // 0 milliseconds
return 0;
}
Your array arr is allocated on the stack, i.e., the compiler has calculated the necessary space at compile time. At the beginning of the method, the compiler will insert an assembler statement like
sub esp, 10000*10000*sizeof(int)
which means the stack pointer (esp) is decreased by 10000 * 10000 * sizeof(int) bytes to make room for an array of 100002 integers. This operation is almost instant.
The vector is heap allocated and heap allocation is much more expensive. When the vector allocates the required memory, it has to ask the operating system for a contiguous chunk of memory and the operating system will have to perform significant work to find this chunk of memory.
As Andreas says in the comments, all your time is spent in this line:
vector<int> v(size*size);
Accessing the vector inside the loop is just as fast as for the array.
For an additional overview see e.g.
[What and where are the stack and heap?
[http://computer.howstuffworks.com/c28.htm][2]
[http://www.cprogramming.com/tutorial/virtual_memory_and_heaps.html][3]
Edit:
After all the comments about performance optimizations and compiler settings, I did some measurements this morning. I had to set size=3000 so I did my measurements with roughly a tenth of the original entries. All measurements performed on a 2.66 GHz Xeon:
With debug settings in Visual Studio 2008 (no optimization, runtime checks, and debug runtime) the vector test took 920 ms compared to 0 ms for the array test.
98,48 % of the total time was spent in vector::operator[], i.e., the time was indeed spent on the runtime checks.
With full optimization, the vector test needed 56 ms (with a tenth of the original number of entries) compared to 0 ms for the array.
The vector ctor required 61,72 % of the total application running time.
So I guess everybody is right depending on the compiler settings used. The OP's timing suggests an optimized build or an STL without runtime checks.
As always, the morale is: profile first, optimize second.
If you are compiling this with a Microsoft compiler, to make it a fair comparison you need to switch off iterator security checks and iterator debugging, by defining _SECURE_SCL=0 and _HAS_ITERATOR_DEBUGGING=0.
Secondly, the constructor you are using initialises each vector value with zero, and you are not memsetting the array to zero before filling it. So you are traversing the vector twice.
Try:
vector<int> v;
v.reserve(size*size);
To get a fair comparison I think something like the following should be suitable:
#include <sys/time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <numeric>
int main()
{
static size_t const size = 7e6;
timeval start, end;
int sum;
gettimeofday(&start, 0);
{
std::vector<int> v(size, 1);
sum = std::accumulate(v.begin(), v.end(), 0);
}
gettimeofday(&end, 0);
std::cout << "= vector =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
gettimeofday(&start, 0);
int * const arr = new int[size];
std::fill(arr, arr + size, 1);
sum = std::accumulate(arr, arr + size, 0);
delete [] arr;
gettimeofday(&end, 0);
std::cout << "= Simple array =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
}
In both cases, dynamic allocation and deallocation is performed, as well as accesses to elements.
On my Linux box:
$ g++ -O2 foo.cpp
$ ./a.out
= vector =
(0 s, 21085 us)
sum = 7000000
= Simple array =
(0 s, 21148 us)
sum = 7000000
Both the std::vector<> and array cases have comparable performance. The point is that std::vector<> can be just as fast as a simple array if your code is structured appropriately.
On a related note switching off optimization makes a huge difference in this case:
$ g++ foo.cpp
$ ./a.out
= vector =
(0 s, 120357 us)
sum = 7000000
= Simple array =
(0 s, 60569 us)
sum = 7000000
Many of the optimization assertions made by folks like Neil and jalf are entirely correct.
HTH!
EDIT: Corrected code to force vector destruction to be included in time measurement.
Change assignment to eg. arr[i*size+j] = i*j, or some other non-constant expression. I think compiler optimizes away whole loop, as assigned values are never used, or replaces array with some precalculated values, so that loop isn't even executed and you get 0 milliseconds.
Having changed 1 to i*j, i get the same timings for both vector and array, unless pass -O1 flag to gcc, then in both cases I get 0 milliseconds.
So, first of all, double-check whether your loops are actually executed.
You are probably using VC++, in which case by default standard library components perform many checks at run-time (e.g whether index is in range). These checks can be turned off by defining some macros as 0 (I think _SECURE_SCL).
Another thing is that I can't even run your code as is: the automatic array is way too large for the stack. When I make it global, then with MingW 3.5 the times I get are 627 ms for the vector and 26875 ms (!!) for the array, which indicates there are really big problems with an array of this size.
As to this particular operation (filling with value 1), you could use the vector's constructor:
std::vector<int> v(size * size, 1);
and the fill algorithm for the array:
std::fill(arr, arr + size * size, 1);
Two things. One, operator[] is much slower for vector. Two, vector in most implementations will behave weird at times when you add in one element at a time. I don't mean just that it allocates more memory but it does some genuinely bizarre things at times.
The first one is the main issue. For a mere million bytes, even reallocating the memory a dozen times should not take long (it won't do it on every added element).
In my experiments, preallocating doesn't change its slowness much. When the contents are actual objects it basically grinds to a halt if you try to do something simple like sort it.
Conclusion, don't use stl or mfc vectors for anything large or computation heavy. They are implemented poorly/slowly and cause lots of memory fragmentation.
When you declare the array, it lives in the stack (or in static memory zone), which it's very fast, but can't increase its size.
When you declare the vector, it assign dynamic memory, which it's not so fast, but is more flexible in the memory allocation, so you can change the size and not dimension it to the maximum size.
When profiling code, make sure you are comparing similar things.
vector<int> v(size*size);
initializes each element in the vector,
int arr[size*size];
doesn't. Try
int arr[size * size];
memset( arr, 0, size * size );
and measure again...
When looking at some of our logging I've noticed in the profiler that we were spending a lot of time in the operator<< formatting ints and such. It looks like there is a shared lock that is used whenever ostream::operator<< is called when formatting an int(and presumably doubles). Upon further investigation I've narrowed it down to this example:
Loop1 that uses ostringstream to do the formatting:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 that uses the same ostringstream to do everything but the int format, that is done with itoa:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
For my test I ran each loop a number of times with 1, 2, 3 and 4 threads (I have a 4 core machine). The number of trials is constant. Here is the output:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
As you can see, the performance when using ostringstream is abysmal. It gets 30 times worse when adding more threads whereas the itoa gets about 2 times faster.
One idea is to use _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) as recommended by M$ in this article. That doesn't seem to help me. Here's another user who seem to be having a similar issue.
We need to be able to format ints in several threads running in parallel for our application. Given this issue we either need to figure out how to make this work or find another formatting solution. I may code up a simple class with operator<< overloaded for the integral and floating types and then have a templated version that just calls operator<< on the underlying stream. A bit ugly, but I think I can make it work, though maybe not for user defined operator<<(ostream&,T) because it's not an ostream.
I should also make clear that this is being built with Microsoft Visual Studio 2005. And I believe this limitation comes from their implementation of the standard library.
If the Visual Studio 2005's standard library implementation has bugs why not try other implementations? Like:
STLport
Apache C++ Standard Library (STDCXX)
or even Dinkumware upon which Visual Studio 2005 standard library is based on, maybe the have fixed the problem since 2005.
Edit: The other user you mentioned used Visual Studio 2008 SP1, which means that probably Dinkumware has not fixed this issue.
Doesn't surprise me, MS has put "global" locks on a fair few shared resources - the biggest headache for us was the BSTR memory lock a few years back.
The best thing you can do is copy the code and replace the ostream lock and shared conversion memory with your own class. I have done that where I write the stream using a printf-style logging system (ie I had to use a printf logger, and wrapped it with my stream operators). Once you've compiled that into your app you should be as fast as itoa. When I'm in the office I'll grab some of the code and paste it for you.
EDIT:
as promised:
CLogger& operator<<(long l)
{
if (m_LoggingLevel < m_levelFilter)
return *this;
// 33 is the max length of data returned from _ltot
resize(33);
_ltot(l, buffer+m_length, m_base);
m_length += (long)_tcslen(buffer+m_length);
return *this;
};
static CLogger& hex(CLogger& c)
{
c.m_base = 16;
return c;
};
void resize(long extra)
{
if (extra + m_length > m_size)
{
// resize buffer to fit.
TCHAR* old_buffer = buffer;
m_size += extra;
buffer = (TCHAR*)malloc(m_size*sizeof(TCHAR));
_tcsncpy(buffer, old_buffer, m_length+1);
free(old_buffer);
}
}
static CLogger& endl(CLogger& c)
{
if (c.m_length == 0 && c.m_LoggingLevel < c.m_levelFilter)
return c;
c.Write();
return c;
};
Sorry I can't let you have all of it, but those 3 methods show the basics - I allocate a buffer, resize it if needed (m_size is buffer size, m_length is current text length) and keep it for the duration of the logging object. The buffer contents get written to file (or OutputDebugString, or a listbox) in the endl method. I also have a logging 'level' to restrict output at runtime. So you just replace your calls to ostringstream with this, and the Write() method pumps the buffer to a file and clears the length. Hope this helps.
The problem could be memory allocation. malloc which "new" uses has an internal lock. You can see it if you step into it. Try to use a thread local allocator and see if the bad performance disappears.