How to get current data rate or available bandwidth value in NS-3 or DCE-NS-3? - c++

I am trying to get the current data rate or available bandwidth on each Wi-Fi Interface in DCE-NS-3 or NS-3. I have two mobile nodes in Wi-Fi Adhoc mode, with IEEE 802.11b and g standards. The bandwidth that I am setting is 11 Mbps and 54 Mbps. But when using the iperf application in dce-ns3, with UDP mode, the data transfer rate for 11Mbps is 1.31 Mbytes and for 54 Mbps it is 6.46 MBytes constant always.
I can get the static value which is never changed, through GetBitRate() method of DataRate class, but I need the available bandwidth or data rate, periodically as due to data transfer of other layer packets, the available bandwidth or data rate on each wireless interface will be changing continuously.
Hence, I am looking for the proper method to be able to get the current data rate or available data rate during my simulation (at runtime) whenever I want to transfer the data using my scheduling mechanism.

Related

Measuring Round Trip Time using DPDK

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.
It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.
The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram

Sandy Bridge QPI bandwidth perf event

I'm trying to find the proper raw perf event descriptor to monitor QPI traffic (bandwidth) on Intel Xeon E5-2600 (Sandy Bridge).
I've found an event that seems relative here (qpi_data_bandwidth_tx: Number of data flits transmitted . Derived from unc_q_txl_flits_g0.data. Unit: uncore_qpi) but I can't use it in my system. So probably these events refer to a different micro-architecture.
Moreover, I've looked into the "Intel ® Xeon ® Processor E5-2600 Product Family Uncore Performance Monitoring Guide" and the most relative reference I found is the following:
To calculate "data" bandwidth, one should therefore do:
data flits * 8B / time (for L0)
or 4B instead of 8B for L0p
The events that monitor the data flits are:
RxL_FLITS_G0.DATA
RxL_FLITS_G1.DRS_DATA
RxL_FLITS_G2.NCB_DATA
Q1: Are those the correct events?
Q2: If yes, should I monitor all these events and add them in order to get the total data flits or just the first?
Q3: I don't quite understand in what the 8B and time refer to.
Q4: Is there any way to validate?
Also, please feel free to suggest alternatives in monitoring QPI traffic bandwidth in case there are any.
Thank you!
A Xeon E5-2600 processor has two QPI ports, each port can send up to one flit and receive up to one flit per QPI domain clock cycle. Not all flits carry data, but all non-idle flits consume bandwidth. It seems to me that you're interested in counting only data flits, which is useful for detecting remote access bandwdith bottlenecks at the socket level (instead of a particular agent within a socket).
The event RxL_FLITS_G0.DATA can be used to count the number of data flits received. This is equal to the sum of RxL_FLITS_G1.DRS_DATA and RxL_FLITS_G2.NCB_DATA. You only need to measure the latter two events if you care about the break down. Note that there are only 4 event counter per QPI port. The event TxL_FLITS_G0.DATA can be used to count the number of data flits transmitted to other sockets.
The events RxL_FLITS_G0.DATA and TxL_FLITS_G0.DATA can be used to measure the total number of flits transferred through the specified port. So it takes two out of the four counts available in each port to count total data flits.
There is no accurate way to convert data flits to bytes. A flit may contain up to 8 valid bytes. This depends on the type of transaction and power state of the link direction (power states are per link per direction). A good estimate can be obtained by reasonably assuming that most data flits are part of full cache line packets and are being transmitted in the L0 power state, so each flit does contains exactly 8 valid bytes. Alternatively, you can just measure port utilization in terms of data flits rather than bytes.
The unit of time is up to you. Ultimately, if you want to determine whether QPI bandwdith is a bottleneck, the bandwdith has to be measured periodically and compared against the theoretical maximum bandwidth. You can, for example, use total QPI clock cycles, which can be counted on one of the free QPI port PMU counters. The QPI frequency is fixed on JKT.
For validation, you can write a simple program that allocates a large buffer in remote memory and reads it. The measured number of bytes should be about the same as the size of the buffer in bytes.

How to calculate throughput for each node.

I am working on a network congestion control algorithm and I would like to get the fairness, more specifically Jain's fairness index.
However I need to get the throughput of each node or vehicle.
I know the basic definition of throughput in networks, which is the amount of data transferred over a period of time.
I am not sure what the term means in vehicular network due to the nature of the network being broadcast only.
My current set up, the only message transmitted is BSM with a rate of 10Hz.
Using: Omnet++ 5.1.1 VEINS 4.6 and SUMO 0.30
Thank you.

Measure data transfer rate (bandwidth) between 2 apps across network using C++, how to get unbiased and accurate result?

I am trying to measure IO data transfer rate (bandwidth) between 2 simulation applications (written in C++). I created a very simple perfclient and perfserver program just to verify that my approach in calculating the network bandwidth is correct before implementing this calculation approach in the real applications. So in this case, I need to do it programatically (NOT using Iperf).
I tried to run my perfclient and perfserver program on various domain (localhost, computer connected to ethernet,and computer connected to wireless connection). However I always get about the similar bandwidth on each of these different hosts, around 1900 Mbps (tested using data size of 1472 bytes). Is this a reasonable result, or can I get a better and more accurate bandwidth?
Should I use 1472 (which is the ethernet MTU, not including header) as the maximum data size for each send() and recv(), and why/why not? I also tried using different data size, and here are the average bandwidth that I get (tested using ethernet connection), which did not make sense to me because the number exceeded 1Gbps and reached something like 28 Gbps.
SIZE BANDWIDTH
1KB 1396 Mbps
2KB 2689 Mbps
4KB 5044 Mbps
8KB 9146 Mbps
16KB 16815 Mbps
32KB 22486 Mbps
64KB 28560 Mbps
HERE is my current approach:
I did a basic ping-pong fashion loop, where the client continuously send bytes of data stream to the server program. The server will read those data, and reflect (send) the data back to the client program. The client will then read those reflected data (2 way transmission). The above operation is repeated 1000 times, and I then divided the time by 1000 to get the average latency time. Next, I divided the average latency time by 2, to get the 1 way transmission time. Bandwidth can then be calculated as follow:
bandwidth = total bytes sent / average 1-way transmission time
Is there anything wrong with my approach? How can I make sure that my result is not biased? Once I get this right, I will need to test this approach in my original application (not this simple testing application), and I want to put this performance testing result in a scientific paper.
EDIT:
I have solved this problem. Check out the answer that I posted below.
Unless you have a need to reinvent the wheel iperf was made to handle just this problem.
Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.
I was finally able to figure and solve this out :-)
As I mentioned in the question, regardless of the network architecture that I used (localhost, 1Gbps ethernet card, Wireless connection, etc), my achieved bandwidth scaled up for up to 28Gbps. I have tried to bind the server IP address to several different IP addresses, as follow:
127.0.0.1
IP address given by my LAN connection
IP address given by my wireless connection
So I thought that this should give me correct result, in fact it didn't.
This was mainly because I was running both of the client and server program on the same computers (different terminal window, even though the client and server are both bound to different IP addresses). My guess is that this is caused by the internal loopback. This is the main reason why the result is so biased and not accurate.
Anyway, so I then tried to run the client on one workstation, and the server on another workstation, and I tested them using the different network connection, and it worked as expected :-)
On 1Gbps connection, I got about 9800 Mbps (0.96 Gbps), and on 10Gbps connection, I got about 10100 Mbps (9.86 Gbps). So this work exactly as I expected. So my approach is correct. Perfect !!

Packet Delay Variation (PDV)

I am currently implementing video streaming application where the goal is to utilize as much as possible gigabit ethernet bandwidth
Application protocol is built over tcp/ip
Network library is using asynchronous iocp mechanism
Only streaming over LAN is needed
No need for packets to go through routers
This simplifies many things. Nevertheless, I am experiencing problems with packet delay variation.
It means that a video frame which should arrive for example every 20 ms (1280 x 720p 50Hz video signal) sometimes arrives delayed by tens of milliseconds. More:
Average frame rate is kept
Maximum video frame delay is dependent on network utilization
The more data on LAN, the higher the maximum video frame delay
For example, when bandwidth usage is 800mbps, PDV is about 45 - 50 ms.
To my questions:
What are practical boundaries in lowering that value?
Do you know about measurement report available on internet dealing with this?
I want to know if there is some subtle error in my application (perhaps excessive locking) or if there is no way to make numbers better with current technology.
For video streaming, I would recommend using UDP instead of TCP, as it has less overhead and packet confirmation is usually not needed, as the retransmited data would already be obsolete.