Performance tuning

Performance tuning - c++

I have the following code segment . For a vector of size 176000 the loop takes upto 8 minutes to execute . I am not sure what is taking so much time
XEPComBSTR bstrSetWithIdsAsString; //Wrapper class for BSTR
std::vector<__int64>::const_iterator it;
for(it = vecIds.begin();
it != vecIds.end();
it++)
{
__int64 i64Id = (*it);
__int64 i64OID = XPtFunctions::GetOID(i64Id);
// set ',' between two set members
if (it != vecIds.begin())
bstrSetWithIdsAsString.Append(XEPComBSTR(L","));
wchar_t buf[20];
_i64tow_s(i64OID, buf, 20, 10);
bstrSetWithIdsAsString.Append(buf);
}
__int64 GetOID( const __int64 &i64Id)
{
__int64 numId = i64Id;
numId <<= 16;
numId >>= 16;
return numId;
}

I think your bottleneck is the Append function. You see, the string has some allocated memory inside, and when you try to append something that won't fit, it reallocates more memory, which takes a lot of time. Try allocating necessary memory once in the beginning. HTH

I am not sure what this is doing:
bstrSetWithIdsAsString.Append(buf);
but I guess this is where the slowness is, particularly if it has to work out where the end of the buffer is every time by looking for the first zero byte, and possibly needs to do a lot of reallocating.
Why not use wostringstream?

The only way to find out what takes up all this time is to profile the application. Some editions of Visual Studio come with a fully functional profiler.
Alternatively, simply run the program in the debugger, and break it at random intervals, and note where in the code you are.
But I can see a few potential trouble spots:
you perform a lot of string appends. Do they allocate new memory each time? Does your string type allow you to reserve memory in advance, like std::string can do? In general, is the string class efficient?
you loop over iterators, and given the horrible Hungarian Notation you appear to use, I am assuming you are working on Windows, probably using MSVC. Some versions of MSVC enable a lot of runtime checking of STL iterators even in release builds, unless you explicitly disable it. VS2005 and 2008 specifically are guilty of this. 2010 only enables this checking in debug mode.
and, of course, you are building with optimizations enabled, right?
But I'm just pointing out what seems like it could potentially slow down your code. I have no clue what's actually happening. To be sure, I'd have to profile your code. You can do that. I can't. So do it.

Related

Slow memcpy Performance

This may seem like a stupid/obvious question to some of you, but I'm still learning so please be gentle haha.
I'm writing an application without the CRT, so I have to implement my own memcpy function. After doing everything and getting it working, I noticed the application was performing significantly slower than it's CRT counterpart. After a while I tracked it down to my custom memcpy function.
void* _memcpy(void* destination, void* source, size_t num)
{
char* d = (char*)destination;
char* s = (char*)source;
while (num--)
*d++ = *s++;
return destination;
}
My friend told me this was a complete sh*t implementation, so I'm posting this here to ask how I could at least improve it to meet the performance of it's CRT counterpart. And also to get an explanation of why it's so slow

First thing first. Computers handle things in words. Typical word size is 4 or 8 bytes long (except on some 8 bit micros). If you can copy a word at a time, things will be much faster.
There are complications though. Many processors don't like misaligned access so each copy should on word boundaries.
Other optimizations might include pre fetching data but these start becoming more complicated.
Take a look a newlib-nano's implementation for inspiration. https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c

Text iteration, Assembly versus C++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,

No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.

An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...

Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.

Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Different Unallocated Memory Behaviour Between Visual Studio Versions

i'm having a weird situation. i'm trying to implement a 10+ years old pci camera device SDK to my camera management software. Manifacturer no longer in business and i have no chance to get official help. So here I am, looking for some help to my ugly problem.
SDK comes with Visual Studio 6.0 samples. One of the include files has a structure ending with a one byte array like below;
typedef struct AVData {
...
BYTE audioVideoData[1];
}AVDATA, *PAVDATA;
But this single byte allocated byte array receives video frames and weird enough, it works fine with Visual Studio 6.0 version. If I try it with Visual Studio 2005/2008/2010, i start getting Memory Access Violation error messages which actully makes sense since it shouldn't be possible to allocate space to a fixed size array afterwards, no? But same code runs fine with VS 6.0?! It's probably caused by either compiler or c++ runtime differences but i'm not very experienced on this subject so it's hard to tell the certain reason for me.
I tried changing the size to an expected maximum number of bytes like below;
typedef struct AVData {
...
BYTE audioVideoData[20000];
}AVDATA, *PAVDATA;
This helped it get working but from time to time i get memory access violation problems when trying to destroy the decoder object of the library.
There is something definitely wrong with this. I don't have the source codes of the SDK, only the DLL, Lib and Header files. My questions are:
1) Is it really legal to allocate space to a fixed size array in Visual Studio 6.0 version?
2) Is there any possible way (a compiler option etc.) to make the same code work with newer VS versions / C++ runtimes?
3) Since my workaround of editing the header file works up to a point but still having problems, do you know any better way to get around of this problem?

IIRC its an old trick to create a struct that is variable in size.
consider
struct {
int len;
char name[1];
} s;
the 'name' can now be of variable length if the appropriate allocation is done and it will be sequentially laid out in memory :
char* foo = "abc";
int len = strlen(foo);
struct s* p = malloc( sizeof(int) + len + 1 );
p->len = len;
strcpy(p->name, foo );
I think the above should work fine in newer versions of visual studio as well, maybe it is a matter of packing, have you done #pragma pack(1) to get structs on byte boundaries? I know that VS6 had that as default.

A one-element array in a C structure like this often means that the size is unknown until runtime. (For a Windows example, see BITMAPINFO.)
Usually, there will be some other information (possibly in the struct) that tells you how large the buffer needs to be. You would never allocate one of these directly, but instead allocate the right size block of memory, then cast it:
int size = /* calculate frame size somehow */
AVDATA * data = (AVDATA*) malloc(sizeof(AVDATA) + size);
// use data->audioVideoData

The code almost certainly exhibits undefined behaviour in some way, and there is no way to fix this except to fix the interface or source code of the SDK. As it is no longer in business, this is impossible.

From where starts the process' memory space and where does it end?

On Windows platform, I'm trying to dump memory from my application where the variables lie. Here's the function:
void MyDump(const void *m, unsigned int n)
{
const unsigned char *p = reinterpret_cast<const unsigned char *>(m);
char buffer[16];
unsigned int mod = 0;
for (unsigned int i = 0; i < n; ++i, ++mod) {
if (mod % 16 == 0) {
mod = 0;
std::cout << " | ";
for (unsigned short j = 0; j < 16; ++j) {
switch (buffer[j]) {
case 0xa:
case 0xb:
case 0xd:
case 0xe:
case 0xf:
std::cout << " ";
break;
default: std::cout << buffer[j];
}
}
std::cout << "\n0x" << std::setfill('0') << std::setw(8) << std::hex << (long)i << " | ";
}
buffer[i % 16] = p[i];
std::cout << std::setw(2) << std::hex << static_cast<unsigned int>(p[i]) << " ";
if (i % 4 == 0 && i != 1)
std::cout << " ";
}
}
Now, how can I know from which address starts my process memory space, where all the variables are stored? And how do I now, how long the area is?
For instance:
MyDump(0x0000 /* <-- Starts from here? */, 0x1000 /* <-- This much? */);
Best regards,
nhaa123

The short answer to this question is you cannot approach this problem this way. The way processes are laid out in memory is very much compiler and operating system dependent, and there is no easy to to determine where all of the code and variables lie. To accurately and completely find all of the variables, you'd need to write large portions of a debugger yourself (or borrow them from a real debugger's code).
But, you could perhaps narrow the scope of your question a little bit. If what you really want is just a stack trace, those are not too hard to generate: How can one grab a stack trace in C?
Or if you want to examine the stack itself, it is easy to get a pointer to the current top of the stack (just declare a local variable and then take it's address). Tthe easiest way to get the bottom of the stack is to declare a variable in main, store it's address in a global variable, and use that address later as the "bottom" (this is easy but not really 'clean').
Getting a picture of the heap is a lot lot lot harder, because you need extensive knowledge of the internal workings of the heap to know which pieces of it are currently allocated. Since the heap is basically "unlimited" in size, that's quite alot of data to print if you just print all of it, even the unused parts. I don't know of a way to do this, and I would highly recommend you not waste time trying.
Getting a picture of static global variables is not as bad as the heap, but also difficult. These live in the data segments of the executable, and unless you want to get into some assembly and parsing of executable formats, just avoid doing this as well.

Overview
What you're trying to do is absolutely possible, and there are even tools to help, but you'll have to do more legwork than I think you're expecting.
In your case, you're particularly interested in "where the variables lie." The system heap API on Windows will be an incredible help to you. The reference is really quite good, and though it won't be a single contiguous region the API will tell you where your variables are.
In general, though, not knowing anything about where your memory is laid out, you're going to have to do a sweep of the entire address space of the process. If you want only data, you'll have to do some filtering of that, too, because code and stack nonsense are also there. Lastly, to avoid seg-faulting while you dump the address space, you may need to add a segfault signal handler that lets you skip unmapped memory while you're dumping.
Process Memory Layout
What you will have, in a running process, is multiple disjoint stretches of memory to print out. They will include:
Compiled code (read-only),
Stack data (local variables),
Static Globals (e.g. from shared libraries or in your program), and
Dynamic heap data (everything from malloc or new).
The key to a reasonable dump of memory is being able to tell which range of addresses belongs to which family. That's your main job, when you're dumping the program. Some of this, you can do by reading the addresses of functions (1) and variables (2, 3 and 4), but if you want to print more than a few things, you'll need some help.
For this, we have...
Useful Tools
Rather than just blindly searching the address space from 0 to 2^64 (which, we all know, is painfully huge), you will want to employ OS and compiler developer tools to narrow down your search. Someone out there needs these tools, maybe even more than you do; it's just a matter of finding them. Here are a few of which I'm aware.
Disclaimer: I don't know many of the Windows equivalents for many of these things, though I'm sure they exist somewhere.
I've already mentioned the Windows system heap API. This is a best-case scenario for you. The more things you can find in this vein, the more accurate and easy your dump will be. Really, the OS and the C runtime know quite a bit about your program. It's a matter of extracting the information.
On Linux, memory types 1 and 3 are accessible through utilities like /proc/pid/maps. In /proc/pid/maps you can see the ranges of the address space reserved for libraries and program code. You can also see the protection bits; read-only ranges, for instance, are probably code, not data.
For Windows tips, Mark Russinovich has written some articles on how to learn about a Windows process's address space and where different things are stored. I imagine he might have some good pointers in there.

Well, you can't, not really... at least not in a portable manner. For the stack, you could do something like:
void* ptr_to_start_of_stack = 0;
int main(int argc, char* argv[])
{
int item_at_approximately_start_of_stack;
ptr_to_start_of_stack = &item_at_approximately_start_of_stack;
// ...
// ... do lots of computation
// ... a function called here can do something similar, and
// ... attempt to print out from ptr_to_start_of_stack to its own
// ... approximate start of stack
// ...
return 0;
}
In terms of attempting to look at the range of the heap, on many systems, you could use the sbrk() function (specifically sbrk(0)) to get a pointer to the start of the heap (typically, it grows upward starting from the end of the address space, while the stack typically grows down from the start of the address space).
That said, this is a really bad idea. Not only is it platform dependent, but the information you can get from it is really not as useful as good logging. I suggest you familiarize yourself with Log4Cxx.
Good logging practice, in addition to the use of a debugger such as GDB, is really the best way to go. Trying to debug your program by looking at a full memory dump is like trying to find a needle in a haystack, and so it really is not as useful as you might think. Logging where the problem might logically be, is more helpful.

AFAIK, this depends on OS, you should look at e.g. memory segmentation.

Assuming you are running on a mainstream operating system. You'll need help from the operating system to find out which addresses in your virtual memory space have mapped pages. For example, on Windows you'd use VirtualQueryEx(). The memory dump you'll get can be as large as two gigabytes, it isn't that likely you discover anything recognizable quickly.
Your debugger already allows you to inspect memory at arbitrary addresses.

You can't, at least not portably. And you can't make many assumptions either.
Unless you're running this on CP/M or MS-DOS.
But with modern systems, the where and hows of where you data and code are located, in the generic case, aren't really up to you.
You can play linker games, and such to get better control of the memory map for you executable, but you won't have any control over, say, any shared libraries you may load, etc.
There's no guarantee that any of your code, for example, is even in a continuous space. The Virtual Memory and loader can place code pretty much where it wants. Nor is there any guarantee that your data is anywhere near your code. In fact, there's no guarantee that you can even READ the memory space where your code lives. (Execute, yes. Read, maybe not.)
At a high level, your program is split in to 3 sections: code, data, and stack. The OS places these where it sees fit, and the memory manager controls what and where you can see stuff.
There are all sorts of things that can muddy these waters.
However.
If you want.
You can try having "markers" in your code. For example, put a function at the start of your file called "startHere()" and then one at the end called "endHere()". If you're lucky, for a single file program, you'll have a continuous blob of code between the function pointers for "startHere" and "endHere".
Same thing with static data. You can try the same concept if you're interested in that at all.

segfault on Win XP x64, doesn't happen on XP x32 - strncpy issue? How to fix?

I'm pretty inexperienced using C++, but I'm trying to compile version 2.0.2 of the SBML toolbox for matlab on a 64-bit XP platform. The SBML toolbox depends upon Xerces 2.8 and libsbml 2.3.5.
I've been able to build and compile the toolbox on a 32-bit machine, and it works when I test it. However, after rebuilding it on a 64-bit machine (which is a HUGE PITA!), I get a segmentation fault when I try to read long .xml files with it.
I suspect that the issue is caused by pointer addresses issues.
The Stack Trace from the the segmentation fault starts with:
[ 0] 000000003CB3856E libsbml.dll+165230 (StringBuffer_append+000030)
[ 6] 000000003CB1BFAF libsbml.dll+049071 (EventAssignment_createWith+001631)
[ 12] 000000003CB1C1D7 libsbml.dll+049623 (SBML_formulaToString+000039)
[ 18] 000000003CB2C154 libsbml.dll+115028 (
So I'm looking at the StringBuffer_append function in the libsbml code:
LIBSBML_EXTERN
void
StringBuffer_append (StringBuffer_t *sb, const char *s)
{
unsigned long len = strlen(s);
StringBuffer_ensureCapacity(sb, len);
strncpy(sb->buffer + sb->length, s, len + 1);
sb->length += len;
}
ensureCapacity looks like this:
LIBSBML_EXTERN
void
StringBuffer_ensureCapacity (StringBuffer_t *sb, unsigned long n)
{
unsigned long wanted = sb->length + n;
unsigned long c;
if (wanted > sb->capacity)
{
/**
* Double the total new capacity (c) until it is greater-than wanted.
* Grow StringBuffer by this amount minus the current capacity.
*/
for (c = 2 * sb->capacity; c < wanted; c *= 2) ;
StringBuffer_grow(sb, c - sb->capacity);
}
}
and StringBuffer_grow looks like this:
LIBSBML_EXTERN
void
StringBuffer_grow (StringBuffer_t *sb, unsigned long n)
{
sb->capacity += n;
sb->buffer = (char *) safe_realloc(sb->buffer, sb->capacity + 1);
}
Is it likely that the
strncpy(sb->buffer + sb->length, s, len + 1);
in StringBuffer_append is the source of my segfault?
If so, can anyone suggest a fix? I really don't know C++, and am particularly confused by pointers and memory addressing, so am likely to have no idea what you're talking about - I'll need some hand-holding.
Also, I put details of my build process online here, in case anyone else is dealing with trying to compile C++ for 64-bit systems using Microsoft Visual C++ Express Edition.
Thanks in advance!
-Ben

Try printing or using a debugger to see what values your getting for some of your intermediate variables. In StringBuffer_append() O/P len, in StringBuffer_ensureCapacity() observe sb->capacity and c before and in the loop. See if the values make sense.
A segmentation fault may be caused by accessing data beyond the end of the string.
The strange fact that it worked on a 32-bit machine and not a 64-bit O/S is also a clue. Is the physical and pagefile memory size the same for the two machines? Also, in a 64-bit machine the kernel space may be larger than the 32-bit machine, and eating some available memory space that was in the user part of the memory space for 32-bit O/S. For XML the entire document must fit into memory. There are probably some switches to set the size if this is the problem. The difference in machines being the cause of the problem should only be the case if you are working with a very large string. If the string is not huge, it may be some problem with library or utility method that doesn't work well in a 64-bit environment.
Also, use a simple/small xml file to start with if you have nothing else to try.
Where do you initialize sb->length. Your problem is likely in strncpy(), though I don't know why the 32bit -> 64-bit O/S change would matter. Best bet is looking at the intermediate values and your problem will then be obvious.

being one of the developers of libsbml i just stumbled over this. Is this still a problem for you? In the meantime we have released libsbml 5, with separate 64bit and 32bit versions and improved testing a lot, please have a look at:
http://sf.net/projects/sbml/files/libsbml

The problem could be pretty much anything. True, it might be that strncpy does something bad, but most likely, it is simply passed a bad pointer. Which could originate from anywhere. A segfault (or access violation in Windows) simply means that the application tried to read or write to an address it did not have permission to access. So the real question is, where did that address come from? The function that tried to follow the pointer is probably ok. But it was passed a bad pointer from somewhere else. Probably.
Unfortunately, debugging C code is not trivial at the best of time. If the code isn't your own, that doesn't make it easier. :)

StringBuffer is defined as follows:
/**
* Creates a new StringBuffer and returns a pointer to it.
*/
LIBSBML_EXTERN
StringBuffer_t *
StringBuffer_create (unsigned long capacity)
{
StringBuffer_t *sb;
sb = (StringBuffer_t *) safe_malloc(sizeof(StringBuffer_t));
sb->buffer = (char *) safe_malloc(capacity + 1);
sb->capacity = capacity;
StringBuffer_reset(sb);
return sb;
}
A bit more of the stack trace is:
[ 0] 000000003CB3856E libsbml.dll+165230 (StringBuffer_append+000030)
[ 6] 000000003CB1BFAF libsbml.dll+049071 (EventAssignment_createWith+001631)
[ 12] 000000003CB1C1D7 libsbml.dll+049623 (SBML_formulaToString+000039)
[ 18] 000000003CB2C154 libsbml.dll+115028 (Rule::setFormulaFromMath+000036)
[ 20] 0000000001751913 libmx.dll+137491 (mxCheckMN_700+000723)
[ 25] 000000003CB1E7B2 libsbml.dll+059314 (KineticLaw_getFormula+000018)
[ 37] 0000000035727749 TranslateSBML.mexw64+030537 (mexFunction+009353)

Seems if it was in any of the StringBuffer_* functions, that would be in the stack trace. I disagree with how _ensureCapacity and _grow are implemented. None of the functions check if realloc works or not. Realloc failure will certainly cause a segfault. I would insert a check for null after _ensureCapacity. With the way _ensureCapacity and _grow are, it seems possible to get an off-by-one error. If you're running on Windows, the 64-bit and 32-bit systems may have different page protection mechanisms that cause it to fail. (You can often live through off-by-one errors in malloc'ed memory on systems with weak page protection.)

Let's assume that safe_malloc and safe_realloc do something sensible like aborting the program when they can't get the requested memory. That way your program won't continue executing with invalid pointers.
Now let's look at how big StringBuffer_ensureCapacity grows the buffer to, in comparison to the wanted capacity. It's not an off-by-one error. It's an off-by-a-factor-of-two error.
How did your program ever work in x32, I can't guess.

In response to bk1e's comment on the question - unfortunately, I need version 2.0.2 for use with the COBRA toolbox, which doesn't work with the newer version 3. So, I'm stuck with this older version for now.
I'm also hitting some walls attempting to debug - I'm building a .dll, so in addition to recompiling xerces to make sure it has the same debugging settings in MSVC++, I also need to attach to the Matlab process to do the debugging - it's a pretty big jump for my limited experience in this environment, and I haven't dug into it very far yet.
I had hoped there was some obvious syntax issue with the buffer allocation or expansion. Looks like I'm in for a few more days of pain, though.

unsigned long is not a safe type to use for sizes on a 64-bit machine in Windows. Unlike Linux, Windows defines long to be 32 bits on both 32- and 64-bit architectures. So if the buffer being appended to grows beyond 4 GB in size (or if you're trying to append a string whose length is >4GB), you need to change the unsigned long type declarations to size_t (which is 64 bits on 64-bit architectures, in all operating systems).
However, if you're only reading a 1.5 MB file, I don't see how you'd ever get a StringBuffer to exceed 4 GB in size, so this might be a blind alley.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js