Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
To compare C++ and Java on certain tasks I have made two similar programs, one in Java and one in C++. When I run the Java one it takes 25% CPU without fluctuation, which you would expect as I'm using a quad core. However, the C++ version only uses about 8% and fluctuates havily. I run both programs on the same computer, on the same OS, with the same programs active in the background. How do I make C++ use one full core? These are 2 programs both not interrupted by anything. They both ask for some info and then enter an infinite loop until you exit the program, giving feedback on how many calculations per second.
The code:
http://pastebin.com/5rNuR9wA
http://pastebin.com/gzSwgBC1
http://pastebin.com/60wpcqtn
To answer some questions:
I'm basically looping a bunch of code and seeing how often per second it loops. The problem is: it doesn't use all the CPU it can use. The whole point is to have the same processor do the same task in Java and C++ and compare the amount of loops per second. But if one is using irregular amounts of CPU time and the other one is looping stable at a certain percentage they are hard to compare. By the way, if I ask it to execute this:
while(true){}
it takes 25%, why doesn't it do that with my code?
----edit:----
After some experimenting it seems that my code starts to use less than 25% if I use a cout statement. It isn't clear to me why a cout would cause the program to use less cpu (I guess it pauses until the statement is written which appearantly takes a while?
With this knowledge I will reprogram both programs (to keep them comparable) and just let it report the results after 60 seconds instead of every time it completed a loop.
Thanks for all the help, some of the tips were really helpful. After I discovered the answer someone also turned out to give this as an answer, so even if I wouldn't have found it myself I would have gotten the answer. Thanks!
(though I would like to know why a std::cout takes such an amount of time)
Your main loop has a cout in it, which will call out to the OS to write the accumulated output at some point. Either OS time is not counted against your app, or it causes some disk IO or other activity that forces your program to wait.
It's probably not accurate to compare both of these running at the same time without considering the fact that they will compete for cpu time. The OS will automatically choose the scheduling for these two tasks which can be affected by which one started first and a multitude of other criteria.
Running them both at the same time would require some type of configuration of the scheduling so that each one is confined to run to one (or two) cpus and each application uses different cpus. This can be done by having each main function execute a separate thread that performs all the work and then setting the cpu where this thread will run. In c++11 this can be done using a std::thread and then setting the underlying cpu affinity by getting the native_handle and setting it there.
I'm not sure how to do this in Java but I'm sure the process is similar.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I used to play with Visual Basic 5.0 as a kid. It allowed me to place lots of 'timers' on a 'form' that would seemingly run simultaneously... whereas now I'm starting to learn lower-level programming languages and everything seems to run one-thing-at-a-time.
Can someone help my mind grasp this concept of simultaneity and why VB seemed to have it easily available but learning c++ so far I've not met anything that feels like I can replicate that simultaneous running of code?
Is most of the 'simultaneity' in simple Visual Basic programs actually an illusion that c++ code can easily recreate? Sorry for lacking the terminology.
edit: Thanks for the replies. They have clarified that it was indeed usually an illusion of simultaneity. To explain further what was in my mind, at my early stage in learning c++, I don't think I know how to write a program that, every 2 seconds, will increment the value of 'x'... while simultaneously every 5 seconds, incrementing 'y'.
This question isn't appropriate for Stack Overflow, but I really like answering people's questions, so I will anyway, even though the answer will be necessarily incomplete.
What Visual Basic was doing was providing you with a special environment in which things appeared to be running simultaneously. All those timers were just entries in a data structure that told Visual Basic when it had to go act like that timer had just expired. It used the current time to decide when it should run them.
In general, Visual Basic is what controlled the flow of your program. In between all the statements you wrote, the Visual Basic runtime was doing all kinds of things behind the scenes for you. Things like checking up on the data structures it was keeping of which timer to fire when and so on.
There are two main ways of controlling the flow of your program in such a way as to make it seem like several things are happening at the same time. What Visual Basic was doing is called an 'event loop'. That's a main loop in your program that keeps track of all the things that need doing and goes and runs the code that does them at the right times. When someone clicks on a window, your program gets an event, the event loop sees that event and runs the 'clicked on a window' code when it gets control. It sort of seems like your program just responds instantly to the click, but that's not what's really happening.
The other way is called 'multi-threading'. In that way your program may really do several things at the same time. The operating system itself decides how to allocate CPU resources to running 'threads'. If there is a lot of competition for CPU resources, the operating system may start running one program (or thread) for a little while (a thousandth of a second or so) and then switch to running a different program (aka process (threads and processes are strongly related, but definitely distinct concepts)). The switching happens so fast, and can happen at any instant, so it seems like several things are happening at once. But, if there are enough CPU resources available (multiple cores) the computer can actually just run several things at the same time.
There are advantages and disadvantages to each approach. Actually dealing with two things modifying the same data at the exact same time and making sure everything stays consistent can be very hard, and is usually accomplished by having sections of your program in only one thread is allowed to be running at a time so that those modifications can't actually happen in the same instant and they are 'serialized' so that one thread's modification happens before another. These are usually called 'critical sections' and access to these critical sections is typically controlled by something called a 'mutex' (aka mutual exclusion) lock.
Event driven code doesn't have that problem since typically one event must be fully processed before the code to process the next event can start. But this can lead to under-utilization of CPU resources, and it can also sometimes introduce noticeable delays in handling events as one event can't be processed until the code to process the preceding event is finished.
In short, Visual Basic was making event driven programming really easy and clean to use, and providing you with the illusion that several things were running at the same time.
These, in general, are fairly deep computer science topics, and people get their PhDs studying some of this stuff. But a working understanding of exactly how event driven code and how multi-threading code works isn't that hard.
There are also other models of concurrency out there, and hybrid models like 'co-routines' that look like threads but are really events.
I put all the most useful concept handles in quotes to emphasize them.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there a minimum number of instruction guaranteed to be executed by a thread during any given time slot. The Wikipedia page for Execution Model says "The addition operation is an indivisible unit of work in many languages"
I would like to learn more about the execution model of POSIX Thread used with C/C++ and minimum number of indivisible instruction or statements guaranteed to be executed in a single time slot. Can someone give a pointer from where I can learn more on this. Thanks is advance
No, there are no guarantees on the number of instructions per time. The way things work is more complicated than executing a set number of instructions anyway.
Executed instructions depends more on the processor architecture than the language. The "traditional" MIPS architecture taught in many introductory design courses would execute one instruction per clock cycle; a processor designed like this running at 1MHz would execute one million operations per second. Real-world processors use techniques such as pipelines, branch prediction, "hyper-threading", etc. and do not have a set number of operations per clock cycle.
Beyond that, real-world processors will generally function under an operating system with multi-tasking capabilities. This means that a thread can be interrupted by the kernel at unknown points, and not execute any code at all as other threads are given processor time. There are "real-time" operating systems that are designed to give more guarantees about how long it takes to execute the code running on the processor.
You have already done some research on Wikipedia; some of the keywords above should help track down more articles on the subject, and from there you should be able to find plenty of primary sources to learn more on the subject.
In POSIX threads, there are two main scheduling policies (FIFO and Round Robin). Round Robin is the default scheduler as it's more fair.
When RR scheduler is used, each thread has an amount of time (AKA quantum) to run, so there's no guarantee that an X number of instructions will get executed - unless we knew how much time each instruction takes.
You can find more about scheduling algorithms on PThreads here: http://maxim.int.ru/bookshelf/PthreadsProgram/htm/r_37.html
Just to give an idea on how Linux defines the round round quantum:
/*
* default timeslice is 100 msecs (used only for SCHED_RR tasks).
* Timeslices get refilled after they expire.
*/
#define RR_TIMESLICE (100 * HZ / 1000)
#endif /* _LINUX_SCHED_RT_H */
I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?
I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.
The question is - what should I look for, and what best-known-methods can I use to debug this?
Additional info:
I have enough RAM - most of it is unused at any given moment.
According to perfmon - memory doesn't grow (so I don't think it's leaking).
The code is a mix of .Net and native c++, with some data marshaling back and forth.
I saw this on several different machines (servers with 24 logical cores).
One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.
Edit 1
One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library.
Specifically, I noticed that I have the following warnings:
warning LNK4087: CONSTANT keyword is obsolete; use DATA
Edit 2
Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users.
Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx).
After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.
Conclusion Understand your warnings! Eliminate as much of them as you can!
Thanks.
For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)
One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.
You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.
As to what might be the issue, a degradation that slow is strange.
How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.
If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").
Let us know what you ultimately find out.
I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.