Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a SQL system that is doing a large amount of reads to disk at a throughput rate of about 1GB/s at times.
During the time there is a lot of disk activity, the ISCSI is translated to over 8GB/s of network receive activity.
in this scenario we are using Jumbo frames, so I would assume that there is about 99.1% of efficiency, and we are certainly not seeing that.
Setup:
HPE 3PAR SAN
VMware virtual machine with dedicated 10GB virtual adapter for ISCSI
All the recommended advanced network adapter settings are set on the NIC
Using the in-guest Microsoft ISCSI initiator driver and setup - no 3rd party tools
We are set up with ISCSI MPIO (round robin with subset) from the single Client side ISCSI IP to two 10GB Array side ISCSI IP/IQNs
Can anyone think of why the Disk reads would be causing such a large amount of Network traffic?
Normally, disks are measured in Bytes (more precisely, Octets) and networks are measured in Bits. Since there is no international standard for those units, it is usually advised to never abbreviate them. In particular, both Byte and Bit are commonly abbreviated as "B", and it is assumed that the meaning is clear from context (e.g. when talking about storage, we never talked about Bits, when talking about networks, we generally talk about Bits).
So, a throughput of 1 GByte/s on the disk is 8 Gbit/s, which matches pretty much exactly what you are seeing on the network.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 12 months ago.
Improve this question
I saw that GCloud offers N2 instances with up to 128 vCPUs. I wonder what kind of hardware that is. Do they really put 128 cores into 1 chip? If so, Intel doesn't make them generally available for sale to the public, right? If they use several chips, how do they split the cores? Also, I assume that all cores are on the same node, do they place more than 2 CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
Thanks!
You can easily build or purchase a system with 128 vCPUs. Duplicating Google's custom hardware and firmware is another matter. 128 vCPUs is not large today.
Google Cloud publishes the processor families: CPU platforms
The Intel Ice Lake Xeon motherboards support multiple processor chips.
With a two-processor motherboard using the 40 core model (8380), 160 vCPUs are supported.
For your example, Google is using 32-core CPUs.
Note: one physical core is two vCPUs. link
I am not sure what Google is using for n2d-standard-224 which supports 224 vCPUs. That might be the Ice Lake 4 processor 28-core models.
GCloud N2 machines: 128 vCPUs in 1 chip?
Currently, the only processors that support 64 cores (128 vCPUs) that I am aware of are ARM processors from Ampere. That means Google is using one or more processor chips on a multi-cpu motherboard.
If so, Intel doesn't make them generally available for sale to the
public, right?
You can buy just about any processor on Amazon, for example.
If they use several chips, how do they split the cores? Also, I
assume that all cores are on the same node, do they place more than 2
CPU chips on that node or do they have chips with 56 cores (which also is a lot)?
You are thinking in terms of laptop and desktop technology. Enterprise rack mounted servers typically support two or more processor chips. This has been the norm for a long time (decades).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a program that works just fine. I now want to run 32 independent instances of it, in parallel, on our 32 core machine (AMD Threadripper 2990wx, 128GB DDR4 RAM, Ubuntu 18.04). However, the performance gains are almost null after about 12 processes running concurrently on the same machine. I now need to optimize this. Here is a plot of the average speedup:
I want to identify the source of this scaling bottleneck.
I would like to know the available techniques to see, in my code, if there are any "hot" parts that prevent 32 processes to yield significant gains compared to 12
My guess is it has to do with memory access and the NUMA architecture. I tried experimenting with numactl and assign a core to each process, without noticeable improvement.
Each instance of the application uses at most about 1GB of memory. It is written in C++, and there is no "parallel code" (no threads, no mutexes, no atomic operations), each instance is totally independent, there is no interprocess communication (I just start them with nohup, through a bash script). The core of this application is an agent-based simulation: a lot of objects are progressively created, interact with each other and are regularly updated, which is probably not very cache friendly.
I have tried to use linux perf but I am not sure what I should look for; also, the mem modules of perf doesn't work on AMD CPU.
I have also tried using AMD uProf but again I am not sure where this system wide bottleneck would appear.
Any help would be greatly appreciated.
The problem may be the Threadripper architecture. It is 32-core CPU, but those cores are distributed among 4 NUMA nodes with half of them not directly connected to the memory. So you may need to
set processor affinity for all your processes to ensure that they never jump between cores
ensure that processes running on the normal NUMA nodes only access memory directly attached to that node
put less load on cores situated on crippled NUMA nodes
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I'm using Ubuntu 32 BIT.
- My app need to store incoming data at RAM (because I need to do a lot of searches on the incoming data and calc somthing).
- I have a need to save the data for X seconds => So I need to allocate 12GB of memory. (client requirements)
- I'm using Ubuntu 32 BIT (and dont want to to work with Ubuntu 64 BIT)
- So I am using Ram Disk to save the incomming data and to searach on it. (So I can use 12GB of Ram on 32 BIT system)
when I test the app with 2GB of allocated memory (instead of 12GB) I saw that the performance of the CPU when using RAM is better than when using RAM DISK when I just write data into my DB (15% VS 17% CPU usage)
but when I test the queries (which read a lot of data / or Files if I'm working with RAM disk) I saw a huge different (20% vs 80% CPU usage).
I dont understand why there is a huge of DIFF ?
Both RAM and RAM DISK work on RAM ? no ? Is there anything I can do to get better performance ?
There are two reasons that I can think of as to why a RAM disk is slower.
With a RAMDisk we might use RAM as the file media but we still have the overhead of using a filesystem. This involved system calls to access data with other forms of indirection or copying. Directly accessing memory is just that.
Memory access tends to be fast because we often can find what we are looking for in the processor cache. This saves us from reading directly from slower RAM. Using a RAMDisk will probably not be able to make use of the processor cache to the same extent if for no other reason, it requires a system call.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
If I have two Linux boxes and I am writing a C/C++ program to send a message on one box and receive on the other box, what is the fastest approach?
I am not sure if the various socket/networking technologies I hear banded around are simply wrappers around an underlying technology, or if they are alternative possibilities. I just want to know what would be the closest to "bare metal" which I could implement from my application.
I was thinking the fastest method would including writing my program as a driver and load this into the kernel. However, I would still need to know the fastest socket implementation to use with this idea.
Any modern PC will be able to keep the ethernet chip buffers fully loaded so "bare metal" programming will provide no benefit. The added latency through the kernel is so small compared to the network latency (i.e. speed of light limitations) that it's not worth optimizing.
For "fast" as in high-bandwidth data-moving between two connected Linux boxes, TCP is your friend as it will optimize itself to the maximum network ability without you having to detect and adjust yourself. Direct connect will have negligible packet-loss and generally low latency so you don't have to worry about window sizes, etc.
If you want "fast" as in quick turnaround to small requests, use UDP.
If you have some other definition of "fast", well, you need to elaborate.
The question is incomplete, as you do not specify any requirements except that it must be fast. There are lots of aspects to consider here, such as the protocol to use (TCP for reliability, UDP for streaming, etc), serialization (what kind of data do you plan to send over the network, can you use a serialization library such as Google Protobuf?), and so on.
My suggestion would be to have a look at various RPC frameworks such as Apache Thrift, Apace Etch or ZeroC Ice and benchmark them before you decide that you really need to use tbe BSD sockets API or a similar low level abstraction.
Well, unless you want to build a kernel module for custom communications over Ethernet, the fastest userspace API from libc is the Berkley Sockets API. Yes, that is a wrapper over the kernels TCP/IP and UDP/IP, which is a layer over IP, which is a layer of WWAN, LAN, and Ethernet, which is a layer over something else, but unless you need such incredible and exact performance, I suggest staying in the simple stuff in userland rather then writing the kernel modules you'd need to use anything lower. Unless I'm completely wrong, there is no way to access raw Ethernet, WWAN, or LAN from userspace, let alone actually accessing the hardware.
Note: If you have a few years to rewrite the entire UNIX networking stack and networking card drivers, you can get x86 I/O port access from userspace when running as root with the ioperm() call, but I don't suggest rewriting the entire UNIX networking stack. That's almost 2 decades of work. Also, direct hardware access from a 3-d party application is a security disaster waiting to happen.
Note: If you are OK with not using any traditional hardware for networking, you could write a custom driver for double ended USB cables and create a custom network protocol over that, as writing Linux USB device drivers is probably the easiest kind of driver to write, as there is a large API for it. I really don't know how the speed will stack up here though, as USb 2.0 is faster then older Ethernet standards, but then they are starting to have 1 Gbps Ethernet, be now there's SUB 3.0, so this could be faster or slower, depending on available hardware. This is more about ease of use.
EDIT: Please, never, ever, ever put code in the kernel for the sake of speed. Please. The huge security hole you put in a machine is not worth the small boost in performance. There was a time when system calls were very expensive, and you wanted to minimize and adding to the kernel was an option, but with newer standards like Intel's sysenter/sysexit, and AMD's syscall/sysret, they are cheep enough to not warrant the security hole.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a application (basically C++ application) which has below properties
Multi Threaded
Each thread has its own thread attributes (like stack size, etc).
Multi process (i.e will run multiple process).
Run on a 8 core processor.
Uses shared memory/IPC's/extensive heap management (allocation/deallocation), system sleep etc.
So now, I am supposed to find the system CAPS at max CPU. The ideal way is to load the system to 100% CPU and them check the CAPS (successful) the system supports.
I know, that in complex systems, CPU will be "dead" for context switches, page swaps, I/O etc.
But my system is max able to run at 95% CPU (not more than that irrespective of the load). So the idea here is to find out these points which is really contributing to "CPU eating" and then see if I can engineer them to reduce/eliminate the unused CPU's.
Question
How do we find out which IO/Context switching... etc is the cause of the un-conquerable 5% CPU? Is there any tool for this? I am aware of OProfile/Quantify and vmstat reports. But none of them would give this information.
There may be some operations which I am not aware - which may restrict the MAX CPU utilization. Any link/document which can help me in understanding a detailed set of operation which will reduce my CPU usage would be very helpful.
Edit 1:
Added some more information
a. The OS under question is SUSE10 Linux server.
b. CAPS - it is the average CALLS you can run on your system per second. Basically a telecommunication term - But it can be considered generic - Assume your application provides a protocol implementation. How many protocol calls can you make per second?
"100% CPU" is a convenient engineering concept, not a mathematical absolute. There's no objective definition of what it means. For instance, time spent waiting on DRAM is often counted as CPU time, but time spent waiting on Flash is counted as I/O time. With my hardware hat on, I'd say that both Flash and DRAM are solid-state cell-organized memories, and could be treated the same.
So, in this case, your system is running at "100% CPU" for engineering purposes. The load is CPU-limited, and you can measure the Calls Per Second in this state.