Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.
The question is - what should I look for, and what best-known-methods can I use to debug this?
Additional info:
I have enough RAM - most of it is unused at any given moment.
According to perfmon - memory doesn't grow (so I don't think it's leaking).
The code is a mix of .Net and native c++, with some data marshaling back and forth.
I saw this on several different machines (servers with 24 logical cores).
One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.
Edit 1
One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library.
Specifically, I noticed that I have the following warnings:
warning LNK4087: CONSTANT keyword is obsolete; use DATA
Edit 2
Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users.
Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx).
After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.
Conclusion Understand your warnings! Eliminate as much of them as you can!
Thanks.
For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)
One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.
You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.
As to what might be the issue, a degradation that slow is strange.
How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.
If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").
Let us know what you ultimately find out.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In embedded/mobile environment battery power drain has to be considered when we are developing Software for it, therefore energy efficient software programming matters in embedded/mobile world.
Problem is,(regarding Assembly/C/C++)
Think we have stable software release when it's run on arbitrary platform X it consumesY amount of power(in Watts), now we are going to carry out some code change and we want to measure how it's gonna effect on energy consumption and its efficiency at build time.
INT16U x = OXFFFF;
/... some code in stable release .../
for(;x<4096;++x)
{
/... some code in stable release .../
INT64U foo = x >> 256 ? x : 4096 ; // point 1 sample code change;
if(~foo & foo) foo %= 64 ; // point 1 sample code change;
/... some code in stable release .../
}
Simply if we want to measure how this code change #point1 effects on energy efficiency(relative to the stable release statistics) rather than profiling space and time (performance plus memory), If we want to build a simple energy/power analyzing and profiling tool in C/ C++,
Is there any recommended C/C++ library or source to build power analysis tool?
If we have to analyze and determine the change in power consumption levels through CPU/GPU instruction level changes for each code change for example as in point1, how can we determine power consumption for each instruction for arbitrary CPU or GPU on the respective platform?
How can developer aware of how much of power consumption reduced or increase due to his code change at Application build time rather than runtime?
TL;DR You can't do it with software, only with a physical meter.
To refine what "#Some programmer dude" already hinted at:
One problem is, that the actual implementation of a certain operation on hardware is unknown. You might get lists with cycles/opcode but you do not know what those cycles do. The can go a long route around, some need more parts to pass some less and so on, hence it is unknown how much power the individual cycle needs.
Another problem are the nearly indeterministic paths in complex code with a large-domain (e.g. 16-bit ADCs) and multiple inputs (e.g.: reading several sensors at once.), especially if you work with floating-point arithmetic.
It is possible to get relative differences in power consumption but only coarse ones. Coarse like in "100 loops of the same code need more power than 10" coarse. Or: if it runs faster it most likely needs less power.
No, you have to swallow the bitter pill and go to the next Rhode&Schwarz (not affiliated, just saw an ad in the sidebar at the moment of writing this) shop and get a power-supply, a meter (two, actually), a frequency generator and the necessary connecting material. Will set you short in the mid to upper five digit range (US $). Once you have it you need to measure the power consumption of several MCUs/CPUs (ca. 30+ to be able to assume uniform distribution) of every single batch of the processor you got.
That is a lot of work and investment if you don't have the tools already. Measuring is also a form of art in itself, you need to know what you are doing, a lot of things can get wrong.
It might be a good idea to spend that money if you want a million dollar contract with the military and need to guarantee your stuff to be able to run five years on a single battery (and the first thing said military does is slapping a sticker on it that says "Change battery every 6 month!") otherwise: don't even start, not worth the headaches.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hyperthreading can hurt performance of some applications and it should not be used. From the Microsoft website (https://msdn.microsoft.com/en-us/library/cc615012%28BTS.10%29.aspx):
It is critical hyper-threading be turned off for BizTalk Server
computers. Hyper-threading makes the server appear to have more
processors/processor cores than it actually does; however
hyper-threaded processors typically provide between 20 and 30% of the
performance of a physical processor/processor core. When BizTalk
Server counts the number of processors to adjust its self-tuning
algorithms; the hyper-threaded processors cause these adjustments to
be skewed which is detrimental to overall performance.
Process Lasso program allows to disable hyperthreading for some processes:
You can use programs like Process Lasso (free) to set default CPU
affinities for critical processes, so that their threads never get
allocated to logical cores. We call this feature HyperThreaded Core
Avoidance.
I've got some older programs which perform a lot of mathematical computations. It is frustrating to see them use one core if they could use 4. I want to rewrite them to use many threads. They use large continuous memory blocks so number of cache misses is minimal. My questions are following:
How to decide whether to use hyperthreading or not in your application ? (general guidance with some technical details if necessary)
Does it come down to performing experiments to make final decision ?
How to avoid hyperthreading in your application if it is not advantageous ? (examples in c++ and c)
I don't know Process Lasso works wrt "disabling HyperThreading". For that particular app, the best you can do is to inject DLL into every process into the system, call SetProcessAffinityMask with something that only amounts to a guess, disable every other core, in the hopes that the OS will avoid scheduling to the hyperthreaded logical cores.
Guesses and hopes, there's nothing in the Windows API that will do this for certain. This answers your third bullet point.
You can disable HyperThreading as the BIOS level (usually).
I can't comment on the Microsoft advice of disabling HT for BizTalk, your linked article, since I can't find a date for this article. The only interesting bit was about "Assigning interrupt affinity to logical processors...", new to me. The only other advice in that article regarding HT is rather weak.
On a larger note: I don't know why you're asking about HyperThreading, when you should be concerned with multithreading in general. If you're concerned about multiple threads contending for the same shared resource... then don't use threads in your app.
A humorous aside: the same company also sells a product called SmartTrim, reminiscent of the RAM-doublers that were popular in the '90's.
Basically, it comes down to configuring the number of concurrent threads executing CPU workloads. The OS is aware of hyperthreading, and will assign threads to physical cores until it runs out, and only if there are more threads than physical cores will it start assigning work to logical cores.
To decide whether the optimal number of threads is the number of physical or logical cores, measuring performance of your real tasks is the best approach. Synthetic benchmarks can teach you something about how hyperthreading works, but won't tell you what is best for your particular mix of instructions.
The exact way to control number of threads depends on the multithreading construct you use -- if you create threads yourself, it is obvious, but threadpool and automated parallelism frameworks such as OpenMP provide ways to tune thread count also.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In a few months I will start to write my bachelor-thesis. Although we only discussed the topic of my thesis very roughly, the main problem will be something like this:
A program written in C++ (more or less a HTTP-Server, but I guess it doesn't matter here) has to be executed to fulfill its task. There are several instances of this program running at the same time, and a loadbalancer takes care of equal distribution of http-requests between all instances. Every time the program's code is changed to enhance it, or to get rid of bugs, all instances have to be restarted. This can take up to 40 minutes, for one instance. As there are more than ten instances running, the restart process can take up to one work day. This is way to slow.
The presumed bottleneck is the access to the database during startup to load all necessary data (guess it will be a mysql-database). The idea of the teamleader to decrease the amount of time needed for the startup-process is to serialize the content of the database to a file, and read from this file instead of reading from the database. That would be my task. Of course the problem is to check if there is new data in the database, that is not in the file. I guess write processes are still applied to the database, not to the serialized file. My first idea is to use apache thrift for serialization and deserialization, as I already worked with it and it is fast, as far as I know (maybe i write some small python programm, to take care of this). However, I have some basic questions regarding this problem:
Is it a good solution to read from file instead of reading from database. Is there any chance this will save time?
Would thrift work well in this scenario, or is there some faster way for serialization/deserialization
As I am only reading, not writing, I don't have to take care of consistency, right?
Can you recommend some books or online literature that is worth to read regarding this topic.
If I'm missing Information, just ask. Thanks in advance. I just want to be well informed and prepared before I start with the thesis, this is why I ask.
Kind regards
Michael
Cache is king
As a general recommendation: Cache is king, but don't use files.
Cache? What cache?
The cache I'm talking about is of course an external cache. There are plenty of systems available, a lot of them are able to form a cache cluster with cached items spread across multiple machine's RAM. If you are doing it cleverly, the cost of serializing/deserializing into memory will make your algorithms shine, compared to the cost of grinding the database. And on top of that, you get nice features like TTL for cached data, a cache that persists even if your business logic crashes, and much more.
What about consistency?
As I am only reading, not writing, I don't have to take care of consistency, right?
Wrong. The issue is not, who writes to the database. It is about whether or not someone writes to the database, how often this happens, and how up-to-date your data need to be.
Even if you cache your data into a file as planned in your question, you have to be aware that this produces a redundant data duplicate, disconnected from the original data source. So the real question you have to answer (I can't do this for you) is, what the optimum update frequency should be. Do you need immediate updates in near-time? Is a certain time lag be acceptable?
This is exactly the purpose of the TTL (time to live) value that you can put onto your cached data. If you need more frequent updates, set a short TTL. If you are ok with updates in a slower frequency, set the TTL accordingly or have a scheduled task/thread/process running that does the update.
Ok, understood. Now what?
Check out Redis, or the "oldtimer" Memcached. You didn't say much about your platform, but there are Linux and Windows versions available for both (and especially on Windows you will have a lot more fun with Redis).
PS: Oh yes, Thrift serialization can be used for the serialization part.
Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?
Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).
In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.
If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.
To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.
Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a application (basically C++ application) which has below properties
Multi Threaded
Each thread has its own thread attributes (like stack size, etc).
Multi process (i.e will run multiple process).
Run on a 8 core processor.
Uses shared memory/IPC's/extensive heap management (allocation/deallocation), system sleep etc.
So now, I am supposed to find the system CAPS at max CPU. The ideal way is to load the system to 100% CPU and them check the CAPS (successful) the system supports.
I know, that in complex systems, CPU will be "dead" for context switches, page swaps, I/O etc.
But my system is max able to run at 95% CPU (not more than that irrespective of the load). So the idea here is to find out these points which is really contributing to "CPU eating" and then see if I can engineer them to reduce/eliminate the unused CPU's.
Question
How do we find out which IO/Context switching... etc is the cause of the un-conquerable 5% CPU? Is there any tool for this? I am aware of OProfile/Quantify and vmstat reports. But none of them would give this information.
There may be some operations which I am not aware - which may restrict the MAX CPU utilization. Any link/document which can help me in understanding a detailed set of operation which will reduce my CPU usage would be very helpful.
Edit 1:
Added some more information
a. The OS under question is SUSE10 Linux server.
b. CAPS - it is the average CALLS you can run on your system per second. Basically a telecommunication term - But it can be considered generic - Assume your application provides a protocol implementation. How many protocol calls can you make per second?
"100% CPU" is a convenient engineering concept, not a mathematical absolute. There's no objective definition of what it means. For instance, time spent waiting on DRAM is often counted as CPU time, but time spent waiting on Flash is counted as I/O time. With my hardware hat on, I'd say that both Flash and DRAM are solid-state cell-organized memories, and could be treated the same.
So, in this case, your system is running at "100% CPU" for engineering purposes. The load is CPU-limited, and you can measure the Calls Per Second in this state.