Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I'm running a gSLIC segmentation algorithm on my GT 740m (GK107) and the segmentation takes 93ms.
From the gSLIC report http://www.robots.ox.ac.uk/~carl/papers/gSLIC_report.pdf I know that they were using GTX 460 (GF104) and their segmentation takes 13ms.
The GK107 architecture has 384 cuda cores in two SMXs and the GF104 has 336 cuda cores in seven SMs.
Depends on the algorithm (shared memory occupancy) I calculated that my GK107 able to run 1280 active threads on one SMX, what's 2x1280 = 2560 active threads overall and the GF104 able to run 1280 active threads on one SM, what's 7x1280 = 8960 active threads overall. But the GF104 has less cuda cores then GK107 so it should process less threads concurrently, shouldn't it? Or the GF104 because of the number of SMs has smaller on cost?
What could be the reason of these results?
But the GF104 has less cuda cores then GK107 so it should process less
threads concurrently, shouldn't it?
Number of concurrent threads is not the only metric, especially considering the fact that GTX460 is of Fermi architecture, whereas GT740m is Kepler. How about the speed at which these threads are executed? That's where one of the main differences between Fermi and Kepler lies, you may read more about it in this article which should provide you with necessary insight. Small teaser:
Because NVIDIA has essentially traded a fewer number of higher clocked
units (Fermi) for a larger number of lower clocked units (Kepler), NVIDIA had to go in
and double the size of each functional unit inside their SM. Whereas a
block of 16 CUDA cores would do when there was a shader clock, now a
full 32 CUDA cores are necessary.
Also sonicwave pointed out GT740m is a mobile GPU which, we could say by definition, has narrower bus than desktop GPU, simply because of space limitations (desktop vs laptop). This results into quite significant difference in bandwidth, as Robert Crovella states as well, and therefore in memory heavy applications will GTX460 simply outperform GT740m. At gpuBoss they have a nice GPU compare utility, see here for compelete results or below for important points.
Reasons to consider the Nvidia GeForce GTX 460
Higher effective memory clock speed 3,400 MHz vs 1,802 MHz Around 90% higher effective memory clock speed
Higher memory bandwidth 108.8 GB/s vs 28.8 GB/s More than 3.8x higher memory bandwidth
More render output processors 32 vs 16 Twice as many render output processors
Wider memory bus 256 bit vs 128 bit 2x wider memory bus
More texture mapping units 56 vs 32 24 more texture mapping units
Related
I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 12 months ago.
Improve this question
I saw that GCloud offers N2 instances with up to 128 vCPUs. I wonder what kind of hardware that is. Do they really put 128 cores into 1 chip? If so, Intel doesn't make them generally available for sale to the public, right? If they use several chips, how do they split the cores? Also, I assume that all cores are on the same node, do they place more than 2 CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
Thanks!
You can easily build or purchase a system with 128 vCPUs. Duplicating Google's custom hardware and firmware is another matter. 128 vCPUs is not large today.
Google Cloud publishes the processor families: CPU platforms
The Intel Ice Lake Xeon motherboards support multiple processor chips.
With a two-processor motherboard using the 40 core model (8380), 160 vCPUs are supported.
For your example, Google is using 32-core CPUs.
Note: one physical core is two vCPUs. link
I am not sure what Google is using for n2d-standard-224 which supports 224 vCPUs. That might be the Ice Lake 4 processor 28-core models.
GCloud N2 machines: 128 vCPUs in 1 chip?
Currently, the only processors that support 64 cores (128 vCPUs) that I am aware of are ARM processors from Ampere. That means Google is using one or more processor chips on a multi-cpu motherboard.
If so, Intel doesn't make them generally available for sale to the
public, right?
You can buy just about any processor on Amazon, for example.
If they use several chips, how do they split the cores? Also, I
assume that all cores are on the same node, do they place more than 2
CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
You are thinking in terms of laptop and desktop technology. Enterprise rack mounted servers typically support two or more processor chips. This has been the norm for a long time (decades).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a program that works just fine. I now want to run 32 independent instances of it, in parallel, on our 32 core machine (AMD Threadripper 2990wx, 128GB DDR4 RAM, Ubuntu 18.04). However, the performance gains are almost null after about 12 processes running concurrently on the same machine. I now need to optimize this. Here is a plot of the average speedup:
I want to identify the source of this scaling bottleneck.
I would like to know the available techniques to see, in my code, if there are any "hot" parts that prevent 32 processes to yield significant gains compared to 12
My guess is it has to do with memory access and the NUMA architecture. I tried experimenting with numactl and assign a core to each process, without noticeable improvement.
Each instance of the application uses at most about 1GB of memory. It is written in C++, and there is no "parallel code" (no threads, no mutexes, no atomic operations), each instance is totally independent, there is no interprocess communication (I just start them with nohup, through a bash script). The core of this application is an agent-based simulation: a lot of objects are progressively created, interact with each other and are regularly updated, which is probably not very cache friendly.
I have tried to use linux perf but I am not sure what I should look for; also, the mem modules of perf doesn't work on AMD CPU.
I have also tried using AMD uProf but again I am not sure where this system wide bottleneck would appear.
Any help would be greatly appreciated.
The problem may be the Threadripper architecture. It is 32-core CPU, but those cores are distributed among 4 NUMA nodes with half of them not directly connected to the memory. So you may need to
set processor affinity for all your processes to ensure that they never jump between cores
ensure that processes running on the normal NUMA nodes only access memory directly attached to that node
put less load on cores situated on crippled NUMA nodes
I have an Intel Xeon E5-2620 which has 24 on 2 CPU. I have write an application which creates 24 threads for decryption of AES using openssl. When I increase thread number from 1 to 24 on 1 million data decryption I get a result such as following image.
The problem is when I increase thread numbers all of core which I determined are becoming 100% and because of 32GB ram of system always at least half of the ram is free which indicate that the problem is not core usage or ram limit.
I wonder to know that should I set a special parameter for increasing performance in OS level or it is process limitation which can not attain more than 4 thread in maximum performance.
I have to mention that when I execute "openssl evp ..." for testing aes encryption decryption because of process fork it increase the performance about 20 times more than one core performance.
Does anyone has any idea?
I finally found the the reason. multiple CPUs have different rams on servers which have different distances. when I created threads until 4 threads are created on one single cpu but fifth thread will be placed on second cpu which decrease performance because of not using NUMA in os.
so when I disabled cores of second cpu, performance of 6 threads increased as expected.
you can disable 7th core using following command:
cd /sys/devices/system/cpu/
echo 0 > cpu6/online
If multiprocessing gives a 20x speedup and equivalent multithreading only gives 2.5x, there's clearly a bottleneck in the multithreaded code. Furthermore, this bottleneck is unrelated to the hardware architecture.
It could be something in your code, or in the underlying library. It's really impossible to tell without studying both in some detail.
I'd start by looking at lock contention in your multithreaded application.
My computer has 2 Intel® Xeon® Processor X5650 with 6 cores each and HT support. But when i run MPI code it wont get past 6x speedup.
Here are some current run-times:
NP 1: 20 minutes
NP 6: 4 minutes
NP 12: 3.5 minutes
NP 24: 3.1 minutes (full HT)
So until 6 started Processes it runs like planned. All cores are active and the runtime decrease is linear. Same with OpenMP.
Could this be due cache incoherence on the mashine?
I heard about it someday on a MPI conference.
Is there a fix to this?
In short, yes, but this is problem specific - some applications simply do not scale linearly with the number of cores and there are many causes for this (e..g insufficient thread/data level parallelism in your application). In fact, in my experience, you'd be hard pressed to find an application other than embarrassingly parallel applications (e.g. Monte Carol simulation?) that do scale perfectly with the number of cores. It's unlikely anyone will give you an accurate answer without profiling the application, since there are many possible causes for sub-linear scaling.
However, in your case, the most obvious issue may be caused by HyperThreading (HT). The most counter-intuitive result you show is that moving from 12 threads to 24 threads (i.e. when using hyperthreading at its maximum) results in almost no speedup. In some cases, HT does not lead to performance increase. This is typical when:
running applications which fully utilise the CPU's arithmetic units. See this for example.
when there is substantial I/O from main memory (for example) per thread (in other words if your application becomes memory bound). You can use the roofline model to see if your application is memory or compute bound.
This is because ultimately HT works by sharing many of the execution units within a CPU core between threads running on that core. If, for example each core has one floating point unit, that is shared for all threads running on that core, you cannot perform more than one floating point operation per clock cycle, regardless of how many threads you use. To investigate if this is the cause, I would suggest to disable HT (as there may even be a performance overhead). There is typically a kernel boot option on Unix machines to disable HT.
Finally, another typical issue is that dual socket machines are usually (?) NUMA machines. This means that accessing the same memory contents from different CPUs may take different time. So your implementation should be NUMA-aware.