Realtime receiving of UDP packets with QNX RTOS - c++

I have a source which sends UDP packets at a rate of 819.2 Hz (~1.2ms) to my QNX Neutrino machine. I want to receive and process those messages with as little delay and jitter as possible.
My first code was basically:
SetupUDPSocket();
while (true) {
recv(socket, buffer, BufferSize, MSG_WAITALL); // blocks until whole packet is received
processPacket(buffer);
}
The problem is that recv() only checks at each timer tick of the system if there is a new packet available. The timer tick is usually 1ms. So, if I use this I will get a huge jitter, because I process a packet every 1ms or every 2ms. I could reset the size of the timer ticks, but that would affect the whole system (and other timers of other processes, etc). And I still would have a jitter, because I certainly would never exactly match the 819.2 Hz.
So, I tried to use the interrupt line of the network card (5). But it seems as there are also other things which causes the interrupt to rise. I used to following code:
ThreadCtl(_NTO_TCTL_IO, 0);
SIGEV_INTR_INIT(&event);
iID = InterruptAttachEvent(IRQ5, &event, _NTO_INTR_FLAGS_TRK_MSK);
while(true) {
if (InterruptWait(0, NULL) == -1) {
std::cerr << "errno: " << errno << std::endl;
}
length = recv(socket, buffer, bufferSize, 0); // non-blocking this time
LogTimeAndLength();
InterruptUnmask(IRQ5, iID;
}
This results in a single succesful read in the beginning, followed by reads with 0 byte length after 0 time passing. It seems, that after do the InterruptUnmask(), the InterruptWait() does not wait at all, so there must already be a new interrupt (or the same?!).
Is it possible to do something like that with the interrupt line of the network card? Are there any other possibilties to receive the packets at a rate of 819.2 Hz?
Some information about the network card:
'pci -vvv' outputs:
Class = Network (Ethernet)
Vendor ID = 8086h, Intel Corporation
Device ID = 107ch, 82541PI Gigabit Ethernet Controller
PCI index = 0h
Class Codes = 020000h
Revision ID = 5h
Bus number = 4
Device number = 15
Function num = 0
Status Reg = 230h
Command Reg = 17h
I/O space access enabled
Memory space access enabled
Bus Master enabled
Special Cycle operations ignored
Memory Write and Invalidate enabled
Palette Snooping disabled
Parity Error Response disabled
Data/Address stepping disabled
SERR# driver disabled
Fast back-to-back transactions to different agents disabled
Header type = 0h Single-function
BIST = 0h Build-in-self-test not supported
Latency Timer = 40h
Cache Line Size= 8h un-cacheable
PCI Mem Address = febc0000h 32bit length 131072 enabled
PCI Mem Address = feba0000h 32bit length 131072 enabled
PCI IO Address = ec00h length 64 enabled
Subsystem Vendor ID = 8086h
Subsystem ID = 1376h
PCI Expansion ROM = feb80000h length 131072 disabled
Max Lat = 0ns
Min Gnt = 255ns
PCI Int Pin = INT A
Interrupt line = 5
CPU Interrupt = 5h
Capabilities Pointer = dch
Capability ID = 1h - Power Management
Capabilities = c822h - 28002000h
Capability ID = 7h - PCI-X
Capabilities = 2h - 400000h
Device Dependent Registers:
0x040: 0000 0000 0000 0000 0000 0000 0000 0000
...
0x0d0: 0000 0000 0000 0000 0000 0000 01e4 22c8
0x0e0: 0020 0028 0700 0200 0000 4000 0000 0000
0x0f0: 0500 8000 0000 0000 0000 0000 0000 0000
and 'nicinfo' outputs:
wm1:
INTEL 82544 Gigabit (Copper) Ethernet Controller
Physical Node ID ........................... 000E0C C5F6DD
Current Physical Node ID ................... 000E0C C5F6DD
Current Operation Rate ..................... 100.00 Mb/s full-duplex
Active Interface Type ...................... MII
Active PHY address ....................... 0
Maximum Transmittable data Unit ............ 1500
Maximum Receivable data Unit ............... 0
Hardware Interrupt ......................... 0x5
Memory Aperture ............................ 0xfebc0000 - 0xfebdffff
Promiscuous Mode ........................... Off
Multicast Support .......................... Enabled
Thanks for reading!

I am quite not sure why the statement "The problem is that recv() only checks at each timer tick of the system if there is a new packet available. The timer tick is usually 1ms." would be true for preemptive OS. There must be something in the system configuration or the network protocol stack implementation has some issues.
Years ago when I was working on some IPTV STB project for Yahoo BB Japan, i got an issue in RTP receiving. The issues is not delay or jitter, but the overall system performance in the STB after we add some NDS algorithm. We are using vxWorks, and vxWorks support ethernet hook interface, which will be called each time a ethernet packet is received by the driver.
I hook an API into it and just parse the UDP with specified port from the ethernet packets directly. Of course we have some assumption that there is no fragmentation, which is guaranteed by the network setup for performance issues. Maybe you can also check to see if you can get the same hook in the QNX ethernet driver. At lease you found out if the jitter comes from driver or not.

How big are your UDP packets ? If the packet size is small you will gain greater efficiency by packing more data into single packet and decreasing transmission rate.

I suspect the interrupt service routing (ISR) is not masking the interrupt. Perhaps it is designed for edge-sensitivity and the interrupt is level-sensitive.

sorry I'm a bit late to the party, but I came across your question and saw that it was similar to a situation I encountered. Instead of hardware interrupts, you could try a software interrupt using signals. QNX has some documentation here: http://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysarch/microkernel.html#IPCSIGNALS . I was using CentOS at the time but the theory is the same. According to http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/lib_ref/s/socket.html you can use ioctl() to set up a receive group for the SIGIO signal for a given file descriptor...in your case a UDP socket. When the socket has data that is ready for reading, a SIGIO signal is sent to the process indicated by ioctl(). Use sigaction() to tell the OS what signal handling function to use. In your case, the signal handler can read the data off the socket and store it in a buffer for processing. Use pause() to suspend the process until it handles the SIGIO signal. When the signal handler returns, the process will wake up and you can process the data in the buffer.
That should allow you to process your data as it comes in without having to deal with timers or hardware interrupts. One thing to be aware of is that your system can process those signals as fast as the UDP traffic is coming in.

Related

Interrupt to trigger SPI read of external ADC (MCP3464)

I have been trying to speed up the sampling rate on a SPI controlled ADC for an ECG device and have been stumping my head against the wall.
I am using an external ADC (MCP3464 datasheet) to read ADC data over 8 channels.
Here is how I am reading the adc:
// SPI full duplex transfer
digitalWrite(adcChipSelectPin,LOW);
SPI.transfer(readConversionData);
adcReading = (SPI.transfer(0) << 8);
adcReading += SPI.transfer(0);
digitalWrite(adcChipSelectPin, HIGH);
I tried single conversion mode, but this was slow speeds of 3200 samples per second. I believe the ESP32 can flash SPI at 80MHz, so surprised it's that slow.
Instead, I have set the MCP3464 to SCAN mode (see page 89 of datasheet) where it completes continuous conversions and you can set delays between each channel and each cycle (8 channels). That way you don't have to write an additional 2 SPI commands to change the channel and start the next conversion.
From the datasheet:
Each conversion within the SCAN cycle leads to a data ready interrupt and to an update of the ADCDATA register as soon as the current conversion is finished. In SCAN mode, each result has to be read when it is available and before it is overwritten by the next conversion result
Hence the data ready interrupt needs to be detected by the ESP32 to immediately read the latest value from the channel, before a new conversion to the next channel occurs.
I have tried waiting for IRQ to be LOW (I am multi-threading so I can afford to blocking wait) :
// wait for IRQ to trigger active LOW indicating data ready state
while(digitalRead(interruptPin));
*(packetLocation + ((ch+2) + (ts*(interface.numOfCh + 2)))) = adc.read();
*(unsigned long*)(packetLocation + (ts*(interface.numOfCh + 2))) = micros();
However this results in slow speeds of around 2640 samples per second and I am worried the synchronicity of reading the correct channel will collapse if just one active LOW is undetected.
I have also tried attaching an interrupt:
attachInterrupt(interruptPin, packageADC, FALLING);
but I am unable to link a non-static method to an IRQ.
Any ideas how to either get the interrupt to work quickly and reliably in SCAN mode or speed up the ESP32 SPI communication in Single shot mode?
Sorry, quite a lot to take in but any suggestions much appreciated!
Thanks in advance,
Will

rte_eth_tx_burst() descriptor/mbuf management guarantees vs. free thresholds

The rte_eth_tx_burst() function is documented as:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
I have a small test program where this doesn't seem to hold true (when using the ixgbe driver on a vfio X553 1GbE NIC).
So my program sets up one transmit queue like this:
uint16_t tx_ring_size = 1024-32;
rte_eth_dev_configure(port_id, 0, 1, &port_conf);
r = rte_eth_dev_adjust_nb_rx_tx_desc(port_id, &rx_ring_size, &tx_ring_size);
struct rte_eth_txconf txconf = dev_info.default_txconf;
r = rte_eth_tx_queue_setup(port_id, 0, tx_ring_size,
rte_eth_dev_socket_id(port_id), &txconf);
The transmit mbuf packet pool is created like this:
struct rte_mempool *pkt_pool = rte_pktmbuf_pool_create("pkt_pool", 1023, 341, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
In that way, when sending packets I rather run out of TX descriptors before I run out of packet buffers. (the program generates packets with just one segment)
My expectation is that when I call rte_eth_tx_burst() in a loop (to send one packet after another) that it never fails since it transparently frees mbufs of already sent packets.
However, this doesn't happen.
I basically have a transmit loop like this:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
I use the default threshold values, i.e. the queried values are (from dev_info.default_txconf):
tx thresh : 32
tx rs thresh: 32
wthresh : 0
So the main question now is: How hard is rte_eth_tx_burst() supposed to try to free mbuf buffers (and thus descriptors)?
I mean, it could busy loop until the transmission of previously supplied mbufs is completed.
Or it could just quickly check if some descriptors are free again. But if not, just give up.
Related question: Are the default threshold values appropriate for this use case?
So I work around this like that:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "cannot send packet\n");
int r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
}
With that I get output like this:
USER1: cannot send packet
USER1: 1086. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1118. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1150. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 1182. cleaned up 0 descriptors ...
[..]
USER1: cannot send packet
USER1: 1950. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1982. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 2046. cleaned up 32 descriptors ...
Meaning that it frees at most 32 descriptors like this. And that it doesn't always succeed, but then the next rte_eth_tx_burst() succeeds freeing some.
Side question: Is there a better more dpdk-idiomatic way to handle the recycling of mbufs?
When I change the code such that I run out of mbufs before I run out of transmit descriptors (i.e. tx ring created with 1024 descriptors, mbuf pool still has 1023 elements), I have to change the alloc part like this:
struct rte_mbuf *pkt;
do {
pkt = rte_pktmbuf_alloc(args.pkt_pool);
if (!pkt) {
r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
} while (!pkt);
The output is similar, e.g.:
USER1: 1023. cleaned up 95 descriptors ...
USER1: 1118. cleaned up 32 descriptors ...
USER1: 1150. cleaned up 32 descriptors ...
USER1: 1182. cleaned up 32 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 32 descriptors ...
[..]
That means the freeing of descriptors/mbufs is so 'slow' that it has to busy loop up to 3 times.
Again, is this a valid approach, or are there better dpdk ways to solve this?
Since rte_eth_tx_done_cleanup() might return -ENOTSUP, this may point to the direction that my usage of it might not be the best solution.
Incidentally, even with the ixgbe driver it fails for me when I disable checksum offloads!
Apparently, ixgbe_dev_tx_done_cleanup() then invokes ixgbe_tx_done_cleanup_vec() instead of ixgbe_tx_done_cleanup_full() which unconditionally returns -ENOTSUP:
static int
ixgbe_tx_done_cleanup_vec(struct ixgbe_tx_queue *txq __rte_unused,
uint32_t free_cnt __rte_unused)
{
return -ENOTSUP;
}
Does this make sense?
So then perhaps the better strategy is then to make sure that there are less descriptors than pool elements (e.g. 1024-32 < 1023) and just re-call rte_eth_tx_burst() until it returns one?
That means like this:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "%u. cannot send packet - retry\n", i);
}
}
This works, and the output shows again that the descriptors are freed 32 at a time, e.g.:
USER1: 1951. cannot send packet - retry
USER1: 1951. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2047. cannot send packet - retry
USER1: 2047. cannot send packet - retry
I know that I also can use rte_eth_tx_burst() to submit bigger bursts. But I want to get the simple/edge cases right and understand the dpdk semantics, first.
I'm on Fedora 33 and DPDK 20.11.2.
Recommendation/Solution: after analyzing the cause of the issue is indeed with TX descriptor with either rte_mempool_list_dump or dpdk-procinfo, please use rte_eth_tx_buffer_flush or change the settings for TX thresholds.
Explanation:
The behaviour mbuf_free is varied across PMD, and within the same NIC PF and VF also varies. Follow are some points to understand this propely
rte_mempool can be created with or without cache elements.
when created with cached elements, depending upon the available lcores (eal_options) and number of cache elements per core parameter, the configured mbufs are added per core cache.
When HW offload DEV_TX_OFFLOAD_MBUF_FAST_FREE is available and enabled, the agreement is the mbuf will have ref_cnt as 1.
So when ever tx_burst (success or failure is invoked) threshold levels are checked if free mbuf/mbuf-segments can be pushed back to pool.
With DEV_TX_OFFLOAD_MBUF_FAST_FREE enabled the driver blindly puts the elements into lcore cache.
while in case of no DEV_TX_OFFLOAD_MBUF_FAST_FREE, generic approach of validating the MBUF ensuring the nb_segments and ref_cnt are checked, then pushed to mempool.
But always the either fixed (32 I believe is the default set for all PMD) or available free mbuf is pushed to cache or pool always.
Facts:
In the case of the IXGBE VF driver the option DEV_TX_OFFLOAD_MBUF_FAST_FREE is not available. Which means each time whenever thresholds are met, each individual mbuf are checked and pushed to the mempool.
as per the code snippet rte_eth_dev_configure is configured only for TX, and rte_pktmbuf_pool_create is created to have 341 elements as cache.
Assumption has to be made, that there is only 1 Lcore based (which runs the loop of alloc and tx).
Code Snippet-1:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
[Observation] If indeed the mbuf were running, the rte_pktmbuf_alloc should be failing before rte_eth_tx_burst. But failing at 1086, creates an interesting phenomenon because total mbuf created is 1023, and failure happens are 2 iteration of 32 mbuf_release to mempool. Analyzing the driver code for ixgbe, it can be found that (only place return as 0) in tx_xmit_pkts is
/* Only use descriptors that are available */
nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
if (unlikely(nb_pkts == 0))
return 0;
Even though in config tx_ring_size is set to 992, internally rte_eth_dev_adjust_nb_desc sets to max of *nb_desc, desc_lim->nb_min. Based on the code it is not because there are no free mbuf, but it due to TX descriptor is low or not availble.
while in all other cases, whenever rte_eth_tx_done_cleanup or rte_eth_tx_buffer_flush these actually pushes any pending descriptors to be DMA immediately out of SW PMD. This internally frees up more descriptors which makes the tx_burst much smoother.
To identify the root cause, whenever DPDK API tx_burst return either
invoke rte_mempool_list_dump or
make use of mempool dump via dpdk-procinfo
Note: most PMD operates on amortizing the cost of the descriptor (PCIe payload) write by batching and bunching for at least 4 (in case of SSE). Hence a single packet even if DPDK tx_burst returning 1 will not be pushing the packet out of NIC. Hence to ensure use rte_eth_tx_buffer_flush.
Say, you invoke rte_eth_tx_burst() to send one small packet (single mbuf, no offloads). Suppose, the driver indeed pushes the packet to the HW. Doing so eats up one descriptor in the ring: the driver "remembers" that this packet mbuf is associated with that descriptor. But the packet is not sent instantly. The HW typically has some means to notify the driver of completions. Just imagine: if the driver checked for completions on every rte_eth_tx_burst() invocation (thus ignoring any thresholds), then calling rte_eth_tx_burst() one more time in a tight loop manner for another packet would likely consume one more descriptor rather than recycle the first one. So, given this fact, I'd not use tight loop when investigating tx_free_thresh semantics. And it shouldn't matter whether you invoke rte_eth_tx_burst() once per a packet or once per a batch of them.
Now. Say, you have a Tx ring of size N. Suppose, tx_free_thresh is M. And you have a mempool of size Z. What you do is allocate a burst of N - M - 1 small packets and invoke rte_eth_tx_burst() to send this burst (no offloads; each packet is assumed to eat up one Tx descriptor). Then you wait for some wittingly sufficient (for completions) amount of time and check the number of free objects in the mempool. This figure should read Z - (N - M - 1). Then you allocate and send one extra packet. Then wait again. This time, the number of spare objects in the mempool should read Z - (N - M). Finally, you allocate and send one more packet (again!) thus crossing the threshold (the number of spare Tx descriptors becomes less than M). During this invocation of rte_eth_tx_burst(), the driver should detect crossing the threshold and start checking for completions. This should make the driver free (N - M) descriptors (consumed by two previous rte_eth_tx_burst() invocations) thus clearing up the whole ring. Then the driver proceeds to push the new packet in question to the HW thus spending one descriptor. You then check the mempool: this should report Z - 1 free objects.
So, the short of it: no loop, just three rte_eth_tx_burst() invocations with sufficient waiting time between them. And you check the spare object count in the mempool after each send operation. Theoretically, this way, you'll be able to understand the corner case semantics. That's the gist of it. However, please keep in mind that the actual behaviour may vary across different vendors / PMDs.
Relying on rte_eth_tx_done_cleanup() really isn't an option since many PMDs don't implement it. Mostly Intel PMD's provide it, but e.g. SFC, MLX* and af_packet ones don't.
However, it's still unclear why the ixgbe PMD doesn't support cleanup when no offloads are enabled.
The requirements on rte_eth_tx_burst() with respect to freeing are really light - from the API docs:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
[..]
* #return
* The number of output packets actually stored in transmit descriptors of
* the transmit ring. The return value can be less than the value of the
* *tx_pkts* parameter when the transmit ring is full or has been filled up.
So just attempting to free (but not waiting on the results of that attempt) and returning 0 (since 0 is less than tx_pkts) is covered by that 'contract'.
FWIW, no example distributed with dpdk loops around rte_eth_tx_burst() to re-submit not-yet-sent packages. There are some examples that use rte_eth_tx_burst() and discard unsent packages, though.
AFAICS, besides rte_eth_tx_done_cleanup() and rte_eth_tx_burst() there is no other function for requesting the release of mbufs previously submitted for transmission.
Thus, it's advisable to size the mbuf packet pool larger than the configured ring size in order to survive situations where all mbufs are inflight and can't be recovered because there is no mbuf left for calling rte_eth_tx_burst() again.

STM32F767ZI External Interrupt Handling

I'm attempting to create a proper SPI slave interface for an AD7768-4 ADC. The ADC has a SPI interface, but it doesn't output the conversions via SPI. Instead, there are data outputs that are clocked out on individual GPIO pins. So I basically need to bit-bang data, and output to SPI to get a proper slave SPI interface. Please don't ask why I'm doing it this way, it was assigned to me.
The issue I'm having is with the interrupts. I'm using the STM32F767ZI processor - it runs at 216 MHz, and my ADC data MUST BE clocked out at 20MHz. I've set up my NMIs but what I'm not seeing is where the system calls or points to the interrupt handler.
I used the STMCubeMX software to assign pins and generate the setup code, and in the stm32F7xx.c file, it shows the NMI_Handler() function, but I don't see a pointer to it anywhere in the system files. I also found void HAL_GPIO_EXTI_IRQHandler() function in STM32F7xx_hal_gpio.c, which appears to check if the pin is asserted, and clears any pending bits, but it doesn't reset the interrupt flag, or check it, and again, I see no pointer to this function.
To more thoroughly complicate things, I have 10 clock cycles to determine which flag is set (1 of two at a time), reset it, incerment a variable, and move data from the GPIO registers. I believe this is possible, but again, I'm uncertain of what the system is doing as soon as the interrupt is tripped.
Does anyone have any experience in working with external interrupts on this processor that could shed some light on how this particular system handles things? Again - 10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
EDIT:
We changed the DCLK speed to 5.12 MHz (20.48 MHz MCLK/4) because at 2.56 MHz we had exactly 12.5 microseconds to pipe data out and set up for the next DRDY pulse, and 80 kHz speed gives us exactly zero margin. At 5.12 MHz, I have 41 clock cycles to run the interrupt routine, which I can reduce slightly if I skip checking the second flag and just handle incoming data. But I feel I must use the DRDY flag check at least, and use the routine to enable the second interrupt otherwise I'll be constantly interrupting because DCLK on the ADC is always running. This allows me 6.12 microseconds to read in the data, and 6.25 microseconds to shuffle it out before the next DRDY pulse. I should be able to do that at 32 MHz SPI clock (slave) but will most likely do it at 50MHz. This is my current interrupt code:
void NMI_Handler(void)
{
if(__HAL_GPIO_EXTI_GET_IT(GPIO_PIN_0) != RESET)
{
count = 0;
__HAL_GPIO_EXTI_CLEAR_IT(GPIO_PIN_0);
HAL_GPIO_EXTI_Callback(GPIO_PIN_0);
// __HAL_GPIO_EXTI_CLEAR_FLAG(GPIO_PIN_0);
HAL_NVIC_EnableIRQ(GPIO_PIN_1);
}
else
{
if(__HAL_GPIO_EXTI_GET_IT(GPIO_PIN_1) != RESET)
{
data_pad[count] = GPIOF->IDR;
count++;
if (count == 31)
{
data_send = !data_send;
HAL_NVIC_DisableIRQ(GPIO_PIN_1);
}
__ HAL_GPIO_EXTI_CLEAR_IT(GPIO_PIN_1);
HAL_GPIO_EXTI_Callback(GPIO_PIN_1);
// __HAL_GPIO_EXTI_CLEAR_FLAG(GPIO_PIN_0);
}
}
}
I am still concerned about clock cycles, and I believe I can get away with only checking the DRDY flag if I operate on the presumption that the only other EXTI flag that will trip is for the clock pin. Although I question how this will work if SYS_TICK is running in the background... I'll have to find out.
We're investigating a faster processor to handle the bit-banging, but right now, it looks like the PI3 won't be able to handle it if it's running Linux, and I'm unaware of too many faster processors that run either a very small reliable RTOS, or can be bare metal programmed in a pinch...
10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
No way. Interrupt entry (pushing registers, fetching the vector and filling the pipeline) takes 10-12 cycles even on a Cortex-M7. Then consider a very simple interrupt handler, just moving the input data bits to a buffer and clearing the interrupt flag:
uint32_t *p;
void handler(void) {
*p++ = GPIOA->IDR;
EXTI->PR = 0x10;
}
it gets translated to something like this
handler:
ldr r0, .addr_of_idr // load &GPIOA->IDR
ldr r1, [r0] // load GPIOA->IDR
ldr r2, .addr_ofr_p // load &p
ldr r3, [r2] // load p
str r1, [r3] // store the value from IDR to *p
adds r3, r3, #4 // increment p
str r3, [r2] // store p
ldr r0, .addr_of_pr // load &EXTI->PR
movs r1, #0x10
str r1, [r0] // store 0x10 to EXTI->PR
bx lr
.addr_of_p:
.word p
.addr_of_idr
.word 0x40020010
.addr_of_pr
.word 0x40013C14
So it's 11 instructions, each taking at least one cycle, after interrupt entry. That's assuming the code, vector table, and the stack are all in the fastest RAM region. I'm not sure whether literal pools work in ITCM at all, using immediate literals would add 3 more cycles. Forget it.
This has to be solved with hardware.
The controller has 6 SPI interfaces, pick 4 of them. Connect DRDY to all four NSS pins, DCLK to all SCK pins, and each DOUT pin to one MISO pin. Now each SPI interface handles a single channel, and can collect up to 32 bits in its internal FIFO.
Then I'd set an interrupt on a rising edge on one of the NSS pins (EXTI still works even if the pin is in alternate function mode), and read all data at once.
EDIT
It turns out that the STM32 SPI requres an inordinate amount of delay between NSS falling and SCK rising, which the AD7768 does not provide, so it will not work.
Sigma-Delta interface
The STM32F767 has a DFSDM peripheral, designed to receive data from external ADCs. It can receive up to 8 channels of serial data with 20 MHz, and it can even do some preprocessing that your application might need.
The problem is that the DFSDM has no DRDY input, I don't exactly know how could the data transfer be synchronized. It might work by asserting the START# singal to reset the communication.
If that doesn't work, then you can try starting the DFSDM channels using a timer and DMA. Connect DRDY to the external trigger of TIM1 or TIM8 (other timers won't work, because they are connected to the slower APB1 bus and the other DMA controller), start it on the rising edge of ETR, and let it generate a DMA request after ~20 ns. Then let the DMA write the value needed to start the channel to the DFSDM channel configuration register. Repeat for the oher three channels.
There's a startup file generated before compile: startup_stm32f767xx.s - which contains all the pointers to functions.
Under the marker g_pfnVectors: is .word NMI_Handler pointing to a function for handling the non-masked interrupts, and two other pointers, .word EXTI0_IRQHandler and .word EXTI1_IRQHandler as vectors to the external interrupt handlers. Further down in the same file, is the following compiler directives:
.weak NMI_Handler
.thumb_set NMI_Handler,Default_Handler
.weak EXTI0_IRQHandler
.thumb_set EXTI0_IRQHandler,Default_Handler
.weak EXTI1_IRQHandler
.thumb_set EXTI1_IRQHandler,Default_Handler
This was the info I was looking for to be able to control my interrupts with more precision and fewer clock cycles.
I readed AD7768 DS more carefully and found that it can srnd four channels data to one DOUT pin. So, I talking again about serial audio interface (SAI).
If you can lower DCLK frequency up to 2.5MHz than you can lower sample with ratio 1:8 (as ratio 2.5 MHz to 20 MHz) irt sample rate at full ADC clock.
If you route all 4 channels to one output DOUT0 you slow down sample rate just in ratio 1:4.
AD7768-4 DS
page 53
On the AD7768, the interface can be configured to output conversion
data on one, two, or eight of the DOUTx pins. The DOUTx configuration
for the AD7768 is selected using the FORMATx pins (see Table 33).
page 66 table 34: (for AD7768-4)
page 67 figure 98:
FORMAT0 = 1 All channels output on the DOUT0 pin, in TDM output. Only DOUT0 is in use.
You can use SAI with FS = DRDY and four slots, 32 bits/slot

No DPDK packet fragmentation supported in Mellanox ConnectX-3?

Hello Stackoverflow Experts,
I am using DPDK on Mellanox NIC, but am struggling with applying the packet
fragmentation in DPDK application.
sungho#c3n24:~$ lspci | grep Mellanox
81:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
the dpdk application(l3fwd, ip-fragmentation, ip-assemble) did not
recognized the received packet as the ipv4 header.
At first, I have crafted my own packets when sending ipv4 headers so I
assumed that I was crafting the packets in a wrong way.
So I have used DPDK-pktgen but dpdk-application (l3fwd, ip-fragmentation,
ip-assemble) did not recognized the ipv4 header.
As the last resort, I have tested the dpdk-testpmd, and found out this in
the status info.
********************* Infos for port 1 *********************
MAC address: E4:1D:2D:D9:CB:81
Driver name: net_mlx4
Connect to socket: 1
memory allocation on the socket: 1
Link status: up
Link speed: 10000 Mbps
Link duplex: full-duplex
MTU: 1500
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 127
Maximum number of MAC addresses of hash filtering: 0
VLAN offload:
strip on
filter on
qinq(extend) off
No flow type is supported.
Max possible RX queues: 65408
Max possible number of RXDs per queue: 65535
Min possible number of RXDs per queue: 0
RXDs number alignment: 1
Max possible TX queues: 65408
Max possible number of TXDs per queue: 65535
Min possible number of TXDs per queue: 0
TXDs number alignment: 1
testpmd> show port
According to DPDK documentation.
in the flow type of the info status of port 1 should show, but mine shows
that no flow type is supported.
The below example should be the one that needs to be displayed in flow types:
Supported flow types:
ipv4-frag
ipv4-tcp
ipv4-udp
ipv4-sctp
ipv4-other
ipv6-frag
ipv6-tcp
ipv6-udp
ipv6-sctp
ipv6-other
l2_payload
port
vxlan
geneve
nvgre
So Is my NIC, Mellanox Connect X-3 does not support DPDK IP fragmentation? Or is
there additional configuration that needs to be done before trying out the packet fragmentation?
-- [EDIT]
So I have checked the packets from DPDK-PKTGEN and the packets received by DPDK application.
The packets that I receive is the exact one that I have sent from the application. (I get the correct data)
The problem begins at the code
struct rte_mbuf *pkt
RTE_ETH_IS_IPV4_HDR(pkt->packet_type)
This determines the whether the packet is ipv4 or not.
and the value of pkt->packet_type is both zero from DPDK-PKTGEN and DPDK application. and if the pkt-packet_type is zero then the DPDK application reviews this packet as NOT IPV4 header.
This basic type checker is wrong from the start.
So what I believe is that either the DPDK sample is wrong or the NIC cannot support ipv4 for some reason.
The data I received have some pattern at the beginning I receive the correct message but after that sequence of packets have different data between the MAC address and the data offset
So what I assume is they are interpreting the data differently, and getting the wrong result.
I am pretty sure any NIC, including Mellanox ConnectX-3 MUST support ip fragments.
The flow type you are referring is for the Flow Director, i.e. mapping specific flows to specific RX queues. Even if your NIC does not support flow director, it does not matter for the IP fragmentation.
I guess there is an error in the setup or in the app. You wrote:
the dpdk application did not recognized the received packet as the ipv4 header.
I would look into this more closely. Try to dump those packets with dpdk-pdump or even by simply dumping the receiving packet on the console with rte_pktmbuf_dump()
If you still suspect the NIC, the best option would be to temporary substitute it with another brand or a virtual device. Just to confirm it is the NIC indeed.
EDIT:
Have a look at mlx4_ptype_table for fragmented IPv4 packets it should return packet_type set to RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4_EXT_UNKNOWN | RTE_PTYPE_L4_FRAG
Please note the functionality was added in DPDK 17.11.
I suggest you to dump pkt->packet_type on console to make sure it is zero indeed. Also make sure you have the latest libmlx4 installed.

100 Hz Data From Serial

I have a sensor which uses RS422 to spit out messages over serial. (I think thats the right terminology.) Anyways, I made my wiring harness, and hooked it up to my rs422 to usb convertor and tada, I got data in hyperterminal. Good stuff.
Now the sensor has an odd baud rate, 1500kbps. I am doing this in Windows, so it actually wasn't that hard to set that baud rate. Initially, at power on, the sensor sends out a 69 byte message every 10hz. I see this message, the correct bytes are read, and the message is very accurate (it includes a timestamp, which wait for it, increases by 0.1 s every message!) MOST IMPORTANTLY, I get the message on its boundary, in other words, every read was a new message.
Anyways things are going good so far, so I took the next step, I sent a write command over the serial port, to activate a sensor data message. This message is 76 bytes large, and is sent out at 100hz. Success again, more data begins appearing in reads. However, I am not getting it at 100hz, I get blocks of 3968 bytes. If I lower my buffer, I get three very very very quick reads of 1024, then immediately a read of 896. (3968 bytes again). (Note that I am now receiving two messages, one at 10 hz with size 69, and one at 100hz with size 76, note that no combination of the two messages evenly divides 3968.)
My question is, somewhere something is buffering my 100hz messages, and I am not getting them as they're being received. I would like to change that but I do not know what I'm looking for. I don't need that 100hz message on its boundary, I just don't want it at 2 Hz. I would be happy with 30hz or even 20hz.
Below I include my Serial Port Set up code:
Port Open
serial_port_ = CreateFile(L"COM6", GENERIC_READ | GENERIC_WRITE, 0, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
CommState and Timeouts
COMMTIMEOUTS comm_timeouts;
ZeroMemory(&comm_timeouts, sizeof(COMMTIMEOUTS));
//comm_timeouts.ReadIntervalTimeout = MAXDWORD; //Instant Read, still get 3968 chunks
comm_timeouts.ReadIntervalTimeout = 1; //1ms timeout
comm_timeouts.ReadTotalTimeoutConstant = 1000; //Derp?
comm_timeouts.WriteTotalTimeoutConstant = 5000; //Derp.
SetCommTimeouts(serial_port_, &comm_timeouts);
DCB dcb_configuration;
ZeroMemory(&dcb_configuration, sizeof(DCB));
dcb_configuration.DCBlength = sizeof(dcb_configuration);
dcb_configuration.BaudRate = 1500000;
dcb_configuration.ByteSize = 8;
dcb_configuration.StopBits = ONESTOPBIT;
dcb_configuration.Parity = ODDPARITY;
if(!SetCommState(serial_port_, &dcb_configuration))
My Read
if(!ReadFile(serial_port_, read_buffer_, 1024, &bytes_read, NULL))
I would suspect your serial->usb convertor to do the buffering. Since the usb is packet based, it needs to do some buffering. In rate 10Hz, there are probably big enough delays, to flush buffer after every message. But at 100Hz the messages are coming so far, that it is flushing buffer by some other logic.
Does that make sense?