C++ Benchmark tool

C++ Benchmark tool - c++

I have some application, which makes database requests. I guess it doesn't actually matter, what kind of the database I am using, but let's say it's a simple SQLite-driven database.
Now, this application runs as a service and does some amount of requests per minute (this number might actually be huge).
I'm willing to benchmark the queries to retrieve their number, maximal / minimal / average running time for some period and I wish to design my own tool for this (obviously, there are some, but I need my own for some appropriate reasons :).
So - could you advice an approach for this task?
I guess there are several possible cases:
1) I have access to the application source code. Here, obviously, I want to make some sort of cross-application integration, probably using pipes. Could you advice something about how this should be done and (if there is one) any other possible solution?
2) I don't have sources. So, is this even possible to perform some neat injection from my application to benchmark the other one? I hope there is a way, maybe hacky, whatever.
Thanks a lot.

See C++ Code Profiler for a range of profilers.
Or C++ Logging and performance tuning library for rolling your own simple version

My answer is valid just for the case 1).
In my experience profiling it is a fun a difficult task. Using professional tools can be effective but it can take a lot of time to find the right one and learn how to use it properly. I usually start in a very simple way. I have prepared two very simple classes. The first one ProfileHelper the class populate the start time in the constructor and the end time in the destructor. The second class ProfileHelperStatistic is a container with extra statistical capability (a std::multimap + few methods to return average, standard deviation and other funny stuff).
The ProfilerHelper has an reference to the container and before exit the destructor push the data in the container.You can declare the ProfileHelperStatistic in the main and if you create on the stack ProfilerHelper at the beginning of a specific function the job is done. The constructor of the ProfileHelper will store the starting time and the destructor will push the result on the ProfileHelperStatistic.
It is fairly easy to implement and with minor modification can be implemented as cross-platform. The time to create and destroy the object are not recorded, so you will not polluted the result. Calculating the final statistic can be expensive, so I suggest you to run it once at the end.
You can also customize the information that you are going to store in ProfileHelperStatistic adding extra information (like timestamp or memory usage for example).
The implementation is fairly easy, two class that are not bigger than 50 lines each. Just two hints:
1) catch all in the destructor!
2) consider to use collection that take constant time to insert if you are going to store a lot of data.
This is a simple tool and it can help you profiling your application in a very effective way. My suggestion is to start with few macro functions (5-7 logical block) and then increase the granularity. Remember the 80-20 rule: 20% of the source code use 80% of the time.
Last note about database: database tunes the performance dynamically, if you run a query several time at the end the query will be quicker than at the beginning (Oracle does, I guess other database as well). In other word, if you test heavily and artificially the application focusing on just few specific queries you can get too optimistic results.

I guess it doesn't actually matter,
what kind of the database I am using,
but let's say it's a simple
SQLite-driven database.
It's very important what kind of database you use, because the database-manager might have integrated monitoring.
I could speak only about IBM DB/2, but I beliefe that IBM DB/2 is not the only dbm with integrated monitoring tools.
Here for example an short overview what you could monitor in IBM DB/2:
statements (all executed statements, execution count, prepare-time, cpu-time, count of reads/writes: tablerows, bufferpool, logical, physical)
tables (count of reads / writes)
bufferpools (logical and physical reads/writes for data and index, read/write times)
active connections (running statements, count of reads/writes, times)
locks (all locks and type)
and many more
Monitor-data could be accessed via SQL or API from own software, like for example DB2 Monitor does.

Under Unix, you might want to use gprof and its graphical front-end, kprof. Compile your app with the -pg flag (I assume you're using g++) and run it through gprof and observe the results.
Note, however, that this type of profiling will measure the overall performance of an application, not just SQL queries. If it's the performance of queries you want to measure, you should use special tools that are designed for your DBMS - for example, MySQL has a builtin query profiler (for SQLite, see this question: Is there a tool to profile sqlite queries? )

There is a (linux) solution you might find interesting since it could be used on both cases.
It's the LD_PRELOAD trick. It's an environment variable that let's you specify a shared library to be loaded right before your program is executed. The symbols load from this library will override any other available on the system.
The basic idea is to this custom library as a wrapper around the original functions.
There is a bunch of resources available that explain how to use this trick: 1 , 2, 3

Here, obviously, I want to make some sort of cross-application integration, probably using pipes.
I don't think that's obvious at all.
If you have access to the application, I'd suggest dumping all the necessary information to a log file and process that log file later on.
If you want to be able to activate and deactivate this behavior on-the-fly, without re-starting the service, you could use a logging library that supports enabling/disabling log channels on-the-fly.
Then you'd only need to send a message to the service by whatever means (socket connection, ...) to enable/disable logging.
If you don't have access to the application, then I think the best way would be what MacGucky suggested: let the profiling/monitoring tools of the DBMS do it. E.g. MS-SQL has a nice profiler that can capture requests to the server, including all kinds of useful data (CPU time for each request, IO time, wait time etc.).
And if it's really SQLite (plus you don't have access to the source) then your chances are rather low. If the program in question uses SQLite as a DLL, then you could substitute your own version of SQLite, modified to write the necessary log files.

Use the apache jmeter.
To test performances of your sql queries under high load

Related

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

Performance goes bad for range based query on cassandra wide column family when implemented using multithreaded pycassa xget() calls?

I have a typical wide column family with {rowkey--> uuid4+date and timeseries data as columns}, on which I have implemented a range based query using pycassa xget() calls. Not that I was plagued with poor performance with single threaded code, I was more like curious to understand the difference in performance when the xget() calls are made in parallel rather than sequential (from inside of a for: loop).
I have used the "threading" python library to implement the multithreaded version of the range based query and performance actually degraded considerably. Now I am aware of the effect that python GIL has on multithreaded code but is there any way I can be sure that this is infact caused by GIL? Can it be something else that is causing this ?
Thanks in advance.
Note: I am not considering the "multiprocessing" library because I can't afford to have different ConnectionPool object for each sub-process.

One thing I would try is playing around with different values for the buffer_size kwarg for xget() (the default is 1024).
If the GIL is the problem, you'll see CPU usage somewhere between ~90% and ~120% for the process. Otherwise, you may want to adjust the size of the ConnectionPool to make sure there is at least one connection available for each thread.
If all else fails, try profiling your application: http://docs.python.org/2/library/profile.html.

check the performance of an exe through code

I want to check the performance of an application (whose exe i have, no source code) by running it multiple times and possibly compare the results, dint find much on the internet regarding this topic,
Since i have to do it with multiple input times, i thought doing it through code(no bar on the language used) can make things easier, as i may have to repeat them many times,
can anyone help me start off???
Note: by Performance i mean the memory usage, cpu and possibly the time taken to do it!
(I'm currently using perfmon on windows by using necessary counters to check these parameters and manually noting it down)
Thanks

It strongly depends upon your operating system. On Linux, you could use the time utility. And strace might help you understanding the system calls that are used.
I have no idea of the equivalent on Windows systems.

I think that you could create a bash/batch script to call your program as many times as you need and with different inputs.
You could then have your script create a CSV file that contains the time it took to execute your program (start date and end date for example). CSV files are usually compatible with most spreadsheet programs like Excel, so I think that can make it easier for you to process your data, like creating means and standard deviations.
I don't have much to say regarding the memory and CPU usage, but if you are in Windows it wouldn't hurt to take a look at the Process Explorer and the Process Monitor (you can find them in this page). I think that they might help you in your task.
Finally if you are in Linux I think that you might be able to use grep with the top command to gather some statistics.
Regards,
Felipe

If you want exact results, Rational Purify (on Windows), or valgrind (on Linux) are the best tools; these run your application in a virtual machine that can be instructed to do exact cycle counting.

In another post an utility named timethis.exe was mentioned for measuring time under Windows. Maybe it is useful for your purposes.

I used the perform im using to manually note down in an automated way,
that is, i used the performance counter class available in dot net and obtained samples of the particular application at regular intervals and generated a graph with those values..
Thanks :)

How to get a debug flow of execution in C++

I work on a global trading system which supports many users. Each user can book,amend,edit,delete trades. The system is regulated by a central deal capture service. The deal capture service informs all the user of any updates that occur.
The problem comes when we have crashes, as the production environment is impossible to re-create on a test system, I have to rely on crash dumps and log files.
However this doesn't tell me what the user has been doing.
I'd like a system that would (at the time of crashing) dump out a history of what the user has been doing. Anything that I add has to go into the live environment so it can't impact performance too much.
Ideas wise I was thinking of a MACRO at the top of each function which acted like a stack trace (only I could supply additional user information, like trade id's, user dialog choices, etc ..) The system would record stack traces (on a per thread basis) and keep a history in a cyclic buffer (varying in size, depending on how much history you wanted to capture). Then on crash, I could dump this history stack.
I'd really like to hear if anyone has a better solution, or if anyone knows of an existing framework?
Thanks
Rich

Your solution sounds pretty reasonable, though perhaps rather than relying on viewing your audit trail in the debugger you can trigger it being printed with atexit() handlers. Something as simple as a stack of strings that have __FILE__,__LINE__,pthread_self() in them migth be good enough
You could possibly use some existing undo framework, as its similar to an audit trail, but it's going to be more heavyweight than you want. It will likely be based on the command pattern and expect you to implement execute() methods, though I suppose you could just leave them blank.

Trading systems usually don't suffer the performance hit of instrumentation of that level. C++ based systems, in particular, tend to sacrifice the ease of debugging for performance. Otherwise, more companies would be developing such systems in Java/C#.
I would avoid an attempt to introduce stack traces into C++. I am also not confident that you could introduce such a system in a way that would not affect the behavior of the program in some way (e.g., affect threading behavior).
It might, IMHO, be preferable to log the external inputs (e.g., user GUI actions and message traffic) rather than attempt to capture things internally in the program. In that case, you might have a better chance of replicating the failure and debugging it.
Are you currently logging all network traffic to/from the client? Many FIX based systems record this for regulatory purposes. Can you easily log your I/O?

I suggest creating another (circular) log file that contains your detailed information. Beware that this file will grow exponentially compared to other files.
Another method is to save the last N transactions. Write a program that reads the transaction log and feeds the data into your virtual application. This may help create the cause. I've used this technique with embedded systems before.

Fastest small datastore on Windows

My app keeps track of the state of about 1000 objects. Those objects are read from and written to a persistent store (serialized) in no particular order.
Right now the app uses the registry to store each object's state. This is nice because:
It is simple
It is very fast
Individual object's state can be read/written without needing to read some larger entity (like pulling out a snippet from a large XML file)
There is a decent editor (RegEdit) which allow easily manipulating individual items
Having said that, I'm wondering if there is a better way. SQLite seems like a possibility, but you don't have the same level of multiple-reader/multiple-writer that you get with the registry, and no simple way to edit existing entries.
Any better suggestions? A bunch of flat files?

If what you mean by 'multiple-reader/multiple-writer' is that you keep a lot of threads writing to the store concurrently, SQLite is threadsafe (you can have concurrent SELECTs and concurrent writes are handled transparently). See the [FAQ [1]] and grep for 'threadsafe'
[1]: http://www.sqlite.org/faq.html/ FAQ

If you do begin to experiment with SQLite, you should know that "out of the box" it might not seem as fast as you would like, but it can quickly be made to be much faster by applying some established optimization tips:
SQLite optimization
Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
On the other hand, if you tell sqlite to use the harddisk, then you will get a similar benefit to your current usage of RegEdit to manipulate the program's data "on the fly."
The way you could simulate your current RegEdit technique with sqlite would be to use the sqlite command-line tool to connect to the on-disk database. You can run UPDATE statements on the sql data from the command-line while your main program is running (and/or while it is paused in break mode).

I doubt any sane person would go this route these days, however some of what you describe could be done with Window's Structured/Compound Storage. I only mention this since you're asking about Windows - and this is/was an official Windows way to do this.
This is how DOC files were put together (but not the new DOCX format). From MSDN it'll appear really complicated, but I've used it, it isn't the worst API in Win32.
it is not simple
it is fast, I would guess it's faster then the registry.
Individual object's state can be read/written without needing to read some larger entity.
There is no decent editor, however there are some real basic stuff (VC++ 6.0 had the "DocFile Viewer" under Tools. (yeah, that's what that thing did) I found a few more online.
You get a file instead of registry keys.
You gain some old-school Windows developer geek-cred.
Other random thoughts:
I think XML is the way to go (despite the random access issue). Heck, INI files may work. The registry gives you very fine grain security if you need it - people seem to forget this when the claim using files are better. An embedded DB seems like overkill if I'm understanding what you're doing.

Do you need to persist the objects on each change event or just in memory and store on shutdown? If so, just load them up and serialize them at the end, assuming your app runs for a long time (and you don't share that state with another program) then in memory is going to be a winner.
If you've got fixed size structures then you could consider just using a memory mapped file and allocate memory from that?

If the only thing you do is serialize/deserialize individual objects (no fancy queries), then use a btree database, for example Berkeley DB. It is very fast at storing and retrieving chunks of data by key (I assume your objects have some id that can be used as a key) and access by multiple processes is supported.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Benchmark tool - c++

See C++ Code Profiler for a range of profilers. Or C++ Logging and performance tuning library for rolling your own simple version

Use the apache jmeter. To test performances of your sql queries under high load

Related

Profiling a multiprocess system

Performance goes bad for range based query on cassandra wide column family when implemented using multithreaded pycassa xget() calls?

check the performance of an exe through code

How to get a debug flow of execution in C++

Fastest small datastore on Windows

Categories

Resources