GCloud N2 machines: 128 vCPUs in 1 chip? [closed] - google-cloud-platform

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 12 months ago.
Improve this question
I saw that GCloud offers N2 instances with up to 128 vCPUs. I wonder what kind of hardware that is. Do they really put 128 cores into 1 chip? If so, Intel doesn't make them generally available for sale to the public, right? If they use several chips, how do they split the cores? Also, I assume that all cores are on the same node, do they place more than 2 CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
Thanks!

You can easily build or purchase a system with 128 vCPUs. Duplicating Google's custom hardware and firmware is another matter. 128 vCPUs is not large today.
Google Cloud publishes the processor families: CPU platforms
The Intel Ice Lake Xeon motherboards support multiple processor chips.
With a two-processor motherboard using the 40 core model (8380), 160 vCPUs are supported.
For your example, Google is using 32-core CPUs.
Note: one physical core is two vCPUs. link
I am not sure what Google is using for n2d-standard-224 which supports 224 vCPUs. That might be the Ice Lake 4 processor 28-core models.
GCloud N2 machines: 128 vCPUs in 1 chip?
Currently, the only processors that support 64 cores (128 vCPUs) that I am aware of are ARM processors from Ampere. That means Google is using one or more processor chips on a multi-cpu motherboard.
If so, Intel doesn't make them generally available for sale to the
public, right?
You can buy just about any processor on Amazon, for example.
If they use several chips, how do they split the cores? Also, I
assume that all cores are on the same node, do they place more than 2
CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
You are thinking in terms of laptop and desktop technology. Enterprise rack mounted servers typically support two or more processor chips. This has been the norm for a long time (decades).

Related

How to identify the bottlenecks preventing my program to scale well on a 32 core CPU? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a program that works just fine. I now want to run 32 independent instances of it, in parallel, on our 32 core machine (AMD Threadripper 2990wx, 128GB DDR4 RAM, Ubuntu 18.04). However, the performance gains are almost null after about 12 processes running concurrently on the same machine. I now need to optimize this. Here is a plot of the average speedup:
I want to identify the source of this scaling bottleneck.
I would like to know the available techniques to see, in my code, if there are any "hot" parts that prevent 32 processes to yield significant gains compared to 12
My guess is it has to do with memory access and the NUMA architecture. I tried experimenting with numactl and assign a core to each process, without noticeable improvement.
Each instance of the application uses at most about 1GB of memory. It is written in C++, and there is no "parallel code" (no threads, no mutexes, no atomic operations), each instance is totally independent, there is no interprocess communication (I just start them with nohup, through a bash script). The core of this application is an agent-based simulation: a lot of objects are progressively created, interact with each other and are regularly updated, which is probably not very cache friendly.
I have tried to use linux perf but I am not sure what I should look for; also, the mem modules of perf doesn't work on AMD CPU.
I have also tried using AMD uProf but again I am not sure where this system wide bottleneck would appear.
Any help would be greatly appreciated.
The problem may be the Threadripper architecture. It is 32-core CPU, but those cores are distributed among 4 NUMA nodes with half of them not directly connected to the memory. So you may need to
set processor affinity for all your processes to ensure that they never jump between cores
ensure that processes running on the normal NUMA nodes only access memory directly attached to that node
put less load on cores situated on crippled NUMA nodes

Mapping of Google Cloud VMs to physical machines

I am using Google Cloud to run a few experiments. Now, when I create a VM instance of say 4 VCPUs, what is the mapping of those 4 VCPUs to the actual physical machine? Also, what does 4 VCPUs actually entail? Am I getting a machine that has say, 4 processors? Or do I get 4 nodes on a machine that has say, 8 processors? If the latter is the case, doesn't the utilization of the remaining 4 nodes affect the performance of my job?
In the Google Cloud documentation, they say that For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread. The thing is, I'm not exactly sure what a single hardware hyper-thread means. An interesting fact is that I did cat /proc/cpuinfo on an 8 VCPU instance that I had reserved, and it had a field called cpu cores whose value was 4. Again, what does that indicate?
I would like to understand the underlying hardware below the VM instances as it would help me in optimizing jobs that have multithreading enabled.
Any help will be appreciated. Thanks.
When we run cat /proc/cpuinfo, which shows 8, it means that the system has access to 8 threads (as in your example) and cpu cores is 4 because that is the number of physical cores.
About your question of "optimizing jobs that have multithreading enabled", the only difference is that you're accessing the vCPUs through a hypervisor rather than directly as hyper-threads on an Intel processor. In fact, multi-threading strategies for applications shouldn't really be any different just because of the hypervisor layer.
You can also read the discussion here which is about the relations among virtual CPU, hyper-thread, and physical core.

Detecting CPU and Core information from my Intel System

I am currently using Windows 8 Pro OS, along with the Processor: Intel(R) Core(TM) i7-4790 CPU # 3.60GHz, with RAM 8 GB.
I wanted to know how many Physical processors and how many actual Cores my System has. With my very basic understanding for Hardware and this discussion here, when I am searching Intel Information for this processor at this Intel site here, it says:
# of Cores 4
# of Threads 8
In the Task Manager of my System for CPU, it says:
Maximum Speed: 3.60 GHz
Sockets: 1
Cores: 4
Physical processors: 8
Am I correct in assuming that I have 1 Physical processor with 4 actual physical cores, and each physical core has 2 virtual cores (= 2 threads). As such the total physical processors are 8, as mentioned in my Task Manager. But, if my assumption is correct, then why say physical processors =8, and not virtual processors?
I need to know the core details of my machine as I need to write Low Latency programs, using maybe OpenMP.
Thanks for your time...
From the perspective of your operating system, even HyperThreaded processors are "real" processors - they exist in the CPU. They use real, physical resources like instruction decoders and ALUs. Just because those resources are shared between HT cores doesn't mean they're not "real".
General computing will see a speedup by using Hyper Threading, because the various threads are doing different kinds of things, leveraging the shared resources. A CPU-intensive task running in parallel may not see as high of performance however, due to the strain on the shared resources. For example, if there's only one ALU, it doesn't make sense to have two threads competing for it.
Run benchmarks and determine for your application what the appropriate settings are, regarding HT being enabled or not. With a question this broad, we can't give you a definitive answer.

GTX 460 (GF104) is faster then GT 740m (GK107), why? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I'm running a gSLIC segmentation algorithm on my GT 740m (GK107) and the segmentation takes 93ms.
From the gSLIC report http://www.robots.ox.ac.uk/~carl/papers/gSLIC_report.pdf I know that they were using GTX 460 (GF104) and their segmentation takes 13ms.
The GK107 architecture has 384 cuda cores in two SMXs and the GF104 has 336 cuda cores in seven SMs.
Depends on the algorithm (shared memory occupancy) I calculated that my GK107 able to run 1280 active threads on one SMX, what's 2x1280 = 2560 active threads overall and the GF104 able to run 1280 active threads on one SM, what's 7x1280 = 8960 active threads overall. But the GF104 has less cuda cores then GK107 so it should process less threads concurrently, shouldn't it? Or the GF104 because of the number of SMs has smaller on cost?
What could be the reason of these results?
But the GF104 has less cuda cores then GK107 so it should process less
threads concurrently, shouldn't it?
Number of concurrent threads is not the only metric, especially considering the fact that GTX460 is of Fermi architecture, whereas GT740m is Kepler. How about the speed at which these threads are executed? That's where one of the main differences between Fermi and Kepler lies, you may read more about it in this article which should provide you with necessary insight. Small teaser:
Because NVIDIA has essentially traded a fewer number of higher clocked
units (Fermi) for a larger number of lower clocked units (Kepler), NVIDIA had to go in
and double the size of each functional unit inside their SM. Whereas a
block of 16 CUDA cores would do when there was a shader clock, now a
full 32 CUDA cores are necessary.
Also sonicwave pointed out GT740m is a mobile GPU which, we could say by definition, has narrower bus than desktop GPU, simply because of space limitations (desktop vs laptop). This results into quite significant difference in bandwidth, as Robert Crovella states as well, and therefore in memory heavy applications will GTX460 simply outperform GT740m. At gpuBoss they have a nice GPU compare utility, see here for compelete results or below for important points.
Reasons to consider the Nvidia GeForce GTX 460
Higher effective memory clock speed 3,400 MHz vs 1,802 MHz Around 90% higher effective memory clock speed
Higher memory bandwidth 108.8 GB/s vs 28.8 GB/s More than 3.8x higher memory bandwidth
More render output processors 32 vs 16 Twice as many render output processors
Wider memory bus 256 bit vs 128 bit 2x wider memory bus
More texture mapping units 56 vs 32 24 more texture mapping units

System not reaching 100% CPU how to trouble shoot [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a application (basically C++ application) which has below properties
Multi Threaded
Each thread has its own thread attributes (like stack size, etc).
Multi process (i.e will run multiple process).
Run on a 8 core processor.
Uses shared memory/IPC's/extensive heap management (allocation/deallocation), system sleep etc.
So now, I am supposed to find the system CAPS at max CPU. The ideal way is to load the system to 100% CPU and them check the CAPS (successful) the system supports.
I know, that in complex systems, CPU will be "dead" for context switches, page swaps, I/O etc.
But my system is max able to run at 95% CPU (not more than that irrespective of the load). So the idea here is to find out these points which is really contributing to "CPU eating" and then see if I can engineer them to reduce/eliminate the unused CPU's.
Question
How do we find out which IO/Context switching... etc is the cause of the un-conquerable 5% CPU? Is there any tool for this? I am aware of OProfile/Quantify and vmstat reports. But none of them would give this information.
There may be some operations which I am not aware - which may restrict the MAX CPU utilization. Any link/document which can help me in understanding a detailed set of operation which will reduce my CPU usage would be very helpful.
Edit 1:
Added some more information
a. The OS under question is SUSE10 Linux server.
b. CAPS - it is the average CALLS you can run on your system per second. Basically a telecommunication term - But it can be considered generic - Assume your application provides a protocol implementation. How many protocol calls can you make per second?
"100% CPU" is a convenient engineering concept, not a mathematical absolute. There's no objective definition of what it means. For instance, time spent waiting on DRAM is often counted as CPU time, but time spent waiting on Flash is counted as I/O time. With my hardware hat on, I'd say that both Flash and DRAM are solid-state cell-organized memories, and could be treated the same.
So, in this case, your system is running at "100% CPU" for engineering purposes. The load is CPU-limited, and you can measure the Calls Per Second in this state.