Sandy Bridge QPI bandwidth perf event

Sandy Bridge QPI bandwidth perf event - profiling

I'm trying to find the proper raw perf event descriptor to monitor QPI traffic (bandwidth) on Intel Xeon E5-2600 (Sandy Bridge).
I've found an event that seems relative here (qpi_data_bandwidth_tx: Number of data flits transmitted . Derived from unc_q_txl_flits_g0.data. Unit: uncore_qpi) but I can't use it in my system. So probably these events refer to a different micro-architecture.
Moreover, I've looked into the "Intel ® Xeon ® Processor E5-2600 Product Family Uncore Performance Monitoring Guide" and the most relative reference I found is the following:
To calculate "data" bandwidth, one should therefore do:
data flits * 8B / time (for L0)
or 4B instead of 8B for L0p
The events that monitor the data flits are:
RxL_FLITS_G0.DATA
RxL_FLITS_G1.DRS_DATA
RxL_FLITS_G2.NCB_DATA
Q1: Are those the correct events?
Q2: If yes, should I monitor all these events and add them in order to get the total data flits or just the first?
Q3: I don't quite understand in what the 8B and time refer to.
Q4: Is there any way to validate?
Also, please feel free to suggest alternatives in monitoring QPI traffic bandwidth in case there are any.
Thank you!

A Xeon E5-2600 processor has two QPI ports, each port can send up to one flit and receive up to one flit per QPI domain clock cycle. Not all flits carry data, but all non-idle flits consume bandwidth. It seems to me that you're interested in counting only data flits, which is useful for detecting remote access bandwdith bottlenecks at the socket level (instead of a particular agent within a socket).
The event RxL_FLITS_G0.DATA can be used to count the number of data flits received. This is equal to the sum of RxL_FLITS_G1.DRS_DATA and RxL_FLITS_G2.NCB_DATA. You only need to measure the latter two events if you care about the break down. Note that there are only 4 event counter per QPI port. The event TxL_FLITS_G0.DATA can be used to count the number of data flits transmitted to other sockets.
The events RxL_FLITS_G0.DATA and TxL_FLITS_G0.DATA can be used to measure the total number of flits transferred through the specified port. So it takes two out of the four counts available in each port to count total data flits.
There is no accurate way to convert data flits to bytes. A flit may contain up to 8 valid bytes. This depends on the type of transaction and power state of the link direction (power states are per link per direction). A good estimate can be obtained by reasonably assuming that most data flits are part of full cache line packets and are being transmitted in the L0 power state, so each flit does contains exactly 8 valid bytes. Alternatively, you can just measure port utilization in terms of data flits rather than bytes.
The unit of time is up to you. Ultimately, if you want to determine whether QPI bandwdith is a bottleneck, the bandwdith has to be measured periodically and compared against the theoretical maximum bandwidth. You can, for example, use total QPI clock cycles, which can be counted on one of the free QPI port PMU counters. The QPI frequency is fixed on JKT.
For validation, you can write a simple program that allocates a large buffer in remote memory and reads it. The measured number of bytes should be about the same as the size of the buffer in bytes.

Related

8051 external ram(62256) and also using address & data lines as GPIO

My application requires 8051 with external RAM 32K(62256) I plan to use one chip(62256) to address 32k, and I want to use the other 32K to access GPIO like higher 32k goes to RAM & lower 32k to keypad and other GPIO peripherals is this possible to do so?

Yes, it's possible. In this particular case, it's even pretty simple/easy.
You're splitting the address space in half. When you address the lower half of the address space, A15 will be low. When you address the upper half, A15 will be high.
The 62256 has an active low chip-enable pin (CE#), meaning the chip is enabled only when CE# is low. You want to enable the 62256 only when A15 is high, so you'll connect A15 on the 8051 to an inverter, and from there to CE# on the 62256.
Although you haven't described the other chips in any real detail, the same basic idea applies with them--you wire up logic that enables each chip if and only if the address is in the correct range. For example, let's say you have some peripheral that looks to the processor like 256 bytes of memory. To keep things really simple, let's assume this peripheral has an AD0 through AD7 that it uses for addresses and data, and uses the same bus cycles as an 8051.
Since you want the CPU to see that chip in the first 256 bytes of the address space, that means it should be active only when all the higher address lines (A8 through A15) are low. So, we feed them into an 8-input OR gate, so its output is high if any of its inputs are high.
So, as a starting point, your decoding circuitry would look vaguely like this:
This is just a sketch though. Just for example, you'll also need circuitry for the OE# pin on the 62256, which will be activated by the WR# pin on the 8051, and unless the bus cycles for the other chips happen to match perfectly with those for the 8051, you'll end up with (for example) some buffers to hold data coming from one until it's time to send it to the other.

Measuring Round Trip Time using DPDK

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.

It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.

The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram

Low latency serial communication on Linux

I'm implementing a protocol over serial ports on Linux. The protocol is based on a request answer scheme so the throughput is limited by the time it takes to send a packet to a device and get an answer. The devices are mostly arm based and run Linux >= 3.0. I'm having troubles reducing the round trip time below 10ms (115200 baud, 8 data bit, no parity, 7 byte per message).
What IO interfaces will give me the lowest latency: select, poll, epoll or polling by hand with ioctl? Does blocking or non blocking IO impact latency?
I tried setting the low_latency flag with setserial. But it seemed like it had no effect.
Are there any other things I can try to reduce latency? Since I control all devices it would even be possible to patch the kernel, but its preferred not to.
---- Edit ----
The serial controller uses is an 16550A.

Request / answer schemes tends to be inefficient, and it shows up quickly on serial port. If you are interested in throughtput, look at windowed protocol, like kermit file sending protocol.
Now if you want to stick with your protocol and reduce latency, select, poll, read will all give you roughly the same latency, because as Andy Ross indicated, the real latency is in the hardware FIFO handling.
If you are lucky, you can tweak the driver behaviour without patching, but you still need to look at the driver code. However, having the ARM handle a 10 kHz interrupt rate will certainly not be good for the overall system performance...
Another options is to pad your packet so that you hit the FIFO threshold every time. It will also confirm that if it is or not a FIFO threshold problem.
10 msec # 115200 is enough to transmit 100 bytes (assuming 8N1), so what you are seeing is probably because the low_latency flag is not set. Try
setserial /dev/<tty_name> low_latency
It will set the low_latency flag, which is used by the kernel when moving data up in the tty layer:
void tty_flip_buffer_push(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work);
else
schedule_work(&tty->buf.work);
}
The schedule_work call might be responsible for the 10 msec latency you observe.

Having talked to to some more engineers about the topic I came to the conclusion that this problem is not solvable in user space. Since we need to cross the bridge into kernel land, we plan to implement an kernel module which talks our protocol and gives us latencies < 1ms.
--- edit ---
Turns out I was completely wrong. All that was necessary was to increase the kernel tick rate. The default 100 ticks added the 10ms delay. 1000Hz and a negative nice value for the serial process gives me the time behavior I wanted to reach.

Serial ports on linux are "wrapped" into unix-style terminal constructs, which hits you with 1 tick lag, i.e. 10ms. Try if stty -F /dev/ttySx raw low_latency helps, no guarantees though.
On a PC, you can go hardcore and talk to standard serial ports directly, issue setserial /dev/ttySx uart none to unbind linux driver from serial port hw and control the port via inb/outb to port registers. I've tried that, it works great.
The downside is you don't get interrupts when data arrives and you have to poll the register. often.
You should be able to do same on the arm device side, may be much harder on exotic serial port hw.

Here's what setserial does to set low latency on a file descriptor of a port:
ioctl(fd, TIOCGSERIAL, &serial);
serial.flags |= ASYNC_LOW_LATENCY;
ioctl(fd, TIOCSSERIAL, &serial);

In short: Use a USB adapter and ASYNC_LOW_LATENCY.
I've used a FT232RL based USB adapter on Modbus at 115.2 kbs.
I get about 5 transactions (to 4 devices) in about 20 mS total with ASYNC_LOW_LATENCY. This includes two transactions to a slow-poke device (4 mS response time).
Without ASYNC_LOW_LATENCY the total time is about 60 mS.
With FTDI USB adapters ASYNC_LOW_LATENCY sets the inter-character timer on the chip itself to 1 mS (instead of the default 16 mS).
I'm currently using a home-brewed USB adapter and I can set the latency for the adapter itself to whatever value I want. Setting it at 200 µS shaves another mS off that 20 mS.

None of those system calls have an effect on latency. If you want to read and write one byte as fast as possible from userspace, you really aren't going to do better than a simple read()/write() pair. Try replacing the serial stream with a socket from another userspace process and see if the latencies improve. If they don't, then your problems are CPU speed and hardware limitations.
Are you sure your hardware can do this at all? It's not uncommon to find UARTs with a buffer design that introduces many bytes worth of latency.

At those line speeds you should not be seeing latencies that large, regardless of how you check for readiness.
You need to make sure the serial port is in raw mode (so you do "noncanonical reads") and that VMIN and VTIME are set correctly. You want to make sure that VTIME is zero so that an inter-character timer never kicks in. I would probably start with setting VMIN to 1 and tune from there.
The syscall overhead is nothing compared to the time on the wire, so select() vs. poll(), etc. is unlikely to make a difference.

Packet Delay Variation (PDV)

I am currently implementing video streaming application where the goal is to utilize as much as possible gigabit ethernet bandwidth
Application protocol is built over tcp/ip
Network library is using asynchronous iocp mechanism
Only streaming over LAN is needed
No need for packets to go through routers
This simplifies many things. Nevertheless, I am experiencing problems with packet delay variation.
It means that a video frame which should arrive for example every 20 ms (1280 x 720p 50Hz video signal) sometimes arrives delayed by tens of milliseconds. More:
Average frame rate is kept
Maximum video frame delay is dependent on network utilization
The more data on LAN, the higher the maximum video frame delay
For example, when bandwidth usage is 800mbps, PDV is about 45 - 50 ms.
To my questions:
What are practical boundaries in lowering that value?
Do you know about measurement report available on internet dealing with this?
I want to know if there is some subtle error in my application (perhaps excessive locking) or if there is no way to make numbers better with current technology.

For video streaming, I would recommend using UDP instead of TCP, as it has less overhead and packet confirmation is usually not needed, as the retransmited data would already be obsolete.

How do you throttle the bandwidth of a socket connection in C?

I'm writing a client-server app using BSD sockets. It needs to run in the background, continuously transferring data, but cannot hog the bandwidth of the network interface from normal use. Depending on the speed of the interface, I need to throttle this connection to a certain max transfer rate.
What is the best way to achieve this, programmatically?

The problem with sleeping a constant amount of 1 second after each transfer is that you will have choppy network performance.
Let BandwidthMaxThreshold be the desired bandwidth threshold.
Let TransferRate be the current transfer rate of the connection.
Then...
If you detect your TransferRate > BandwidthMaxThreshold then you do a SleepTime = 1 + SleepTime * 1.02 (increase sleep time by 2%)
Before or after each network operation do a
Sleep(SleepTime)
If you detect your TransferRate is a lot lower than your BandwidthMaxThreshold you can decrease your SleepTime. Alternatively you could just decay/decrease your SleepTime over time always. Eventually your SleepTime will reach 0 again.
Instead of an increase of 2% you could also do an increase by a larger amount linearly of the difference between TransferRate - BandwidthMaxThreshold.
This solution is good, because you will have no sleeps if the user's network is already not as high as you would like.

The best way would be to use a token bucket.
Transmit only when you have enough tokens to fill a packet (1460 bytes would be a good amount), or if you are the receive side, read from the socket only when you have enough tokens; a bit of simple math will tell you how long you have to wait before you have enough tokens, so you can sleep that amount of time (be careful to calculate how many tokens you gained by how much you actually slept, since most operating systems can sleep your process for longer than you asked).
To control the size of the bursts, limit the maximum amount of tokens you can have; a good amount could be one second worth of tokens.

I've had good luck with trickle. It's cool because it can throttle arbitrary user-space applications without modification. It works by preloading its own send/recv wrapper functions which do the bandwidth calculation for you.
The biggest drawback I found was that it's hard to coordinate multiple applications that you want to share finite bandwidth. "trickled" helps, but I found it complicated.
Update in 2017: it looks like trickle moved to https://github.com/mariusae/trickle

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js