Perfmon - L2 cache miss per thread - scheduling

Architecture:
**AMD** Opteron quad-core using 2 CPUs --- Numa system
Each CPU has a shared L3 Cache ; Each Core has a private L1 and L2
Processor : x86_64 Operating System: GNU/Linux
I am new the world of Perfmon. I am trying to get the performance counters like Last level cache misses(LLCM) and IPS.
I am able to fetch them when there is just one thread/core
is it also possible to fetch the performance counters like IPS and LLCM per thread, when there are more than 2 threads/core?.
From my research - I realized that it is not possible to get LLCM/IPS per thread when there is more than one thread/core as AMD does not provide those performance counters.
So, my question is, is it possible to fetch the PC/thread at Level 2 cache.
if yes - How?
Thanks.

I researched about it a little more and asked my professor about it.
It looks like perfmon does allow us to do it.
Hope this helps. Let me know if I can help out in some way.

Related

yourkit cpu view and top command - showing different results

I am looking at my process via top command and it shows very high value on the CPU%. however when I look on the same process via yourkit cpu view it shows completely different result. how can it be ?
YourKit profiler treats entire CPU with all cores as 100%. It means that if you have 4 cores and 1 core is fully loaded and other 3 cores sleep, then CPU usage will be 25% (not 100%).
After this explanation YourKit results correlate good with "top".
Even I have the same confusion. From what I understand top command displays as a percentage of a single CPU. On multi-core systems, you can have percentages that are greater than 100%
https://unix.stackexchange.com/questions/145247/understanding-cpu-while-running-top-command

How to make TensorFlow use more available CPU

How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar

Concurrency problems in netty application

I implemented a simple http server link, but the result of the test (ab -n 10000 -c 100 http://localhost:8080/status) is very bad (look through the test.png in the previous link)
I don't understand why it doesn't work correctly with multiple threads.
I believe that, by default, Netty's default thread pool is configured with as many threads as there are cores on the machine. The idea being to handle requests asynchronously and non-blocking (where possible).
Your /status test includes a database transaction which blocks because of the intrinsic design of database drivers etc. So your performance - at high level - is essentially a result of:-
a.) you are running a pretty hefty test of 10,000 requests attempting to run 100 requests in parallel
b.) you are calling into a database for each request so this is will not be quick (relatively speaking compared to some non-blocking I/O operation)
A couple of questions/considerations for you:-
Machine Spec.?
What is the spec. of the machine you are running your application and test on?
How many cores?
If you only have 8 cores available then you will only have 8 threads running in parallel at any time. That means those batches of 100 requests per time will be queueing up
Consider what is running on the machine during the test
It sound like you are running the application AND Apache Bench on the same machine so be aware that both your application and the testing tool will both be contending for those cores (this is in addition to any background processes going on also contending for those cores - such as the OS)
What will the load be?
Predicting load is difficult right. If you do think you are likely to have 100 requests into the database at any one time then you may need to think about:-
a. your production environment may need a couple of instance to handle the load
b. try changing the config. of Netty's default thread pool to increase the number of threads
c. think about your application architecture - can you cache any of those results instead of going to the database for each request
May be linked to the usage of Database access (synchronous task) within one of your handler (at least in your TrafficShappingHandler) ?
You might need to "make async" your database calls (other threads in a producer/consumer way for instance)...
If something else, I do not have enough information...

CUDA: How many concurrent threads in total on a device?

How do I programatically find the maximum number of concurrent cuda threads or streaming multiprocessors on a device / nvidia graphics card? I know about warpSize, but there is no warpCount.
most answers on the internet concern themselves with looking up things from pdfs.
Have you tried checking their SDK samples , i think this sample is the one you want
Device Query
This does not only depend on the device but also on your code - e.g. things like the number of registers each thread uses or the amount of shared memory your block needs. I would suggest reading about occupancy.
Another thing I would note is that if your code relies on having a certain number of threads resident on the device (e.g. if you wait for several threads to reach some execution point) you are bound to face some race conditions and see your code hanging.

How can i learn if am using all of my cores in the maximum level

I have a time-critical application which processes a sequence of images coming from camera. It is written in C++ and it uses Qt, OpenCV and boost libraries. It is going to run on a dedicated PC.
Currently, the gui functions in main thread and i open a new thread for image processing. I didn't bother to divide the process section into threads because i think OpenCV is already doing that. However, i am having trouble maintaining the maximum tolerable delay.
My question is, how can i learn if my application using all the cores in the maximum level ?
When i look at the performance monitor, the pattern i see is really strange. The CPU usage is likely %35-40, all the cores are working but not at a full throttle.
Am i doing something wrong ?
You are not doing anything wrong, however you could change your code to take full use of the cpu cores by:
1 - setting the core affinity so that the thread does not change from one core to another, this could improve the cache usage (L1 and maybe L2)
2 - setting the scheduling of threads to FIFO so it does not get context-switched before finishing its processing
3 - run that thread on a higher priority process (this would require root privilege for the process)
Cheers