Improving SPI transfer speed - c++

I am trying to send constant and large amount of bytes with SPI from a Linux embedded system -- am335x (Beaglebone: pocketbeagle version). Thing is, I am trying to increase its transfer rate.
Currently, I am accessing SPI through user-space with the following configuration on spidev (ioctl calls):
// SPI init -- (const char *device, int mode, int bits, int speed)
retv = spi_bus.spi_init(LCD_SPI_DEVICE,0,8,100000000);
It's slow! I haven't measured the mb/s, but even at 100MHz it's not sending the right amount.
I read somewhere that the DMA is automatically called when using the mcSPI driver. However, I am not really sure if it applies to user-space drivers such as spidev.
My question is: How can I increase the MB/s transfer data for SPI?
Things I have thought about so far:
1) Look for a kernel=space driver instead of using spidev.
2) Increase the word length.
Not sure though, what would you recommend to significantly increase the transfer rate of SPI?

Related

Is HAL_UARTEx_RxEventCallback Size parameter calculated programmatically or by hardware

I'm realizing UART-DMA with STM_HAL library and I want to know if message size is counted by hardware (counting clock ticks till line is idle for example) or by some program method(something like strlen). So if Size in
HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size)
is counted by hardware, I can send data in pure HEX format, but if it is calculated by something like strline, I may recieve problems if data is 0x00 and have to send data in ASCII.
I've tried to make some research in generated code in Keil but failed (maybe I didn't try hard enough) so maybe somebody can help me.
If you are using UART DMA, it is calculated by hardware.
If you check the call hierarchy of HAL_UARTEx_RxEventCallback using your ide, you can see how the Size variable is calculated.
The function is executed in the following flow.(Depending on the version of HAL Driver, it may be slightly different)
UART Idle Interrupt occur
Call HAL_UART_IRQHandler()
If DMA mod is enabled, Call HAL_UARTEx_RxEventCallback(huart, (huart->RxXferSize - huart->RxXferCount))
Therefore, Size variable is calculated as (huart->RxXferSize - huart->RxXferCount)
huart->RxXferSize is a set value when initializing RX DMA.
huart->RxXferCount is (huart->hdmarx)->Instance->NDTR
NDTR is a value calculated by hardware as the size of the buffer remaining after DMA transfer data to memory!!

Interrupt to trigger SPI read of external ADC (MCP3464)

I have been trying to speed up the sampling rate on a SPI controlled ADC for an ECG device and have been stumping my head against the wall.
I am using an external ADC (MCP3464 datasheet) to read ADC data over 8 channels.
Here is how I am reading the adc:
// SPI full duplex transfer
digitalWrite(adcChipSelectPin,LOW);
SPI.transfer(readConversionData);
adcReading = (SPI.transfer(0) << 8);
adcReading += SPI.transfer(0);
digitalWrite(adcChipSelectPin, HIGH);
I tried single conversion mode, but this was slow speeds of 3200 samples per second. I believe the ESP32 can flash SPI at 80MHz, so surprised it's that slow.
Instead, I have set the MCP3464 to SCAN mode (see page 89 of datasheet) where it completes continuous conversions and you can set delays between each channel and each cycle (8 channels). That way you don't have to write an additional 2 SPI commands to change the channel and start the next conversion.
From the datasheet:
Each conversion within the SCAN cycle leads to a data ready interrupt and to an update of the ADCDATA register as soon as the current conversion is finished. In SCAN mode, each result has to be read when it is available and before it is overwritten by the next conversion result
Hence the data ready interrupt needs to be detected by the ESP32 to immediately read the latest value from the channel, before a new conversion to the next channel occurs.
I have tried waiting for IRQ to be LOW (I am multi-threading so I can afford to blocking wait) :
// wait for IRQ to trigger active LOW indicating data ready state
while(digitalRead(interruptPin));
*(packetLocation + ((ch+2) + (ts*(interface.numOfCh + 2)))) = adc.read();
*(unsigned long*)(packetLocation + (ts*(interface.numOfCh + 2))) = micros();
However this results in slow speeds of around 2640 samples per second and I am worried the synchronicity of reading the correct channel will collapse if just one active LOW is undetected.
I have also tried attaching an interrupt:
attachInterrupt(interruptPin, packageADC, FALLING);
but I am unable to link a non-static method to an IRQ.
Any ideas how to either get the interrupt to work quickly and reliably in SCAN mode or speed up the ESP32 SPI communication in Single shot mode?
Sorry, quite a lot to take in but any suggestions much appreciated!
Thanks in advance,
Will

Speed up data logging code

I have a device that outputs 64 bits of binary data at a rate of 1KHz. I am reading the device over USB via a 3rd party DLL, converting the binary data into a float, timestamping it, and writing to file.
I have the following setup at the moment:
int main(int argc, char* argv[])
{
unsigned char Message_Rx[64];
USHORT Bytes_Read=0;
std::ofstream out(argv[1]);
do
{
Result = Comms.USBRead(&Message_Rx[0],&Bytes_Read);
unsigned long now = getTickCount(start);
if(Result != 0)
{
uint16_t msb (Message_Rx[11] & 0xff) \\leftshited 8;
uint16_t lsb (Message_Rx[12] & 0xff);
uint16_t rate = msb | lsb;
char outstring[1024];
sprintf(outstring, "%d\t%.7f", now, (float)rate*0.03125);
out << outstring << "\n";
}
}while(!kbhit());
out.close();
}
(Sorry, formatting gets messed up with >> or <<).
This produces perfectly good results on my desktop. There doesn't appear to be any data missing and the timestamps are continuous and 1ms apart.
143379582 -0.5937500
143379583 -1.5312500
143379584 -1.6250000
143379585 -1.4062500
143379586 -1.1875000
143379587 -1.3437500
143379588 -1.3125000
143379589 -1.3125000
143379590 -1.1562500
But when I run this on the old laptop that I need to use I get timestamps that appear in blocks and it looks like there must be some data missing:
143379582 -0.5937500
143379582 -1.5312500
143379582 -1.6250000
143379582 -1.4062500
143379582 -1.1875000
143379593 -1.3437500
143379593 -1.3125000
143379593 -1.3125000
143379593 -1.1562500
Is there a way to achieve a speedup of my code so that I won't lose data?
To say this loud and clear: for any PC that is not a Intel 486SX, 64kb/s is a utmost laughable rate. Getting a few Mb/s over USB is very doable with small, Dollar-a-piece microcontrollers without any optimization.
Whatever goes wrong needs investigation much more than your code does.
I don't know the Comms library, but that's where I'd look for the place where time is spent.
Other than that, your printing stuff to the screen should take orders of magnitude more time than your processing, but still shouldn't be a problem. As mentioned, 1kS/s * 64 b/S is nothing for modern (read: last twenty years) PC hardware.
I recommend storing the raw data until the key is hit. After the key is pressed, output the data.
You want to remove formatting and output from high performance code areas.
Paraphrasing a song, There will be time enough for printing when the data's done.
Edit 1:
An array-based circular queue is a good data structure to hold the incoming data. This gives you the last N data samples.
Whenever you have issues with performance, your first step should be to profile the code to see what parts of it are taking up time.
However, for your code, I would say that the printing and string handling are unnecessary for the main loop. I would have a separate array of timestamps and within my main loop only acquire data.
After a key is hit, you no longer have timing restrictions and can deal with the somewhat expensive operation of file I/O and building up of the strings.
A final note is that your OS might be stealing CPU cycles from you. You may want to try to run your code with higher priorities to rule out scheduling.
With all that said, as was mentioned above, your data rate should be sustainable unless you're running on some really vintage hardware.

Serial communication protocol design issues

This is an embedded solution using C++, im reading the changes of brightness from a cellphone screen, from very bright (white) to dark (black).
Using JavaScript and a very simple script im changing the background of a webpage from white to black on 100 milliseconds intervals and reading the result on my brightness sensor, as expected the browser is not very precise on timing, some times it does 100ms sometimes less and sometimes more with a huge deviation at times.
var syncinterval = setInterval(function(){
bytes = "010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101";
bit = bytes[i];
output_bit(bit);
i += 1;
if(i > bytes.length) {
clearInterval(syncinterval);
i = 0;
for (i=0; i < input.length; i++) {
tbits = input[i].charCodeAt(0).toString(2);
while (tbits.length < 8) tbits = '0' + tbits;
bytes += tbits;
}
console.log(bytes);
}
}, sync_speed);
My initial idea, before knowing how the timing was on the browser was to use asynchronous serial communication, with some know "word" to sync the stream of data as RS232 does with his start bit, but on RS232 the clocks are very precise.
I could use a second sensor to read a different part of the screen as a clock, in this case even if the monitor or the browser "decides" to go faster or slower my system will only read when there is a clock signal (this is a similar application were they swipe the sensors instead of making the screen flicks as i need), but this require a more complex hardware system, i would like not to complicate things before searching for a software solution.
I don't need high speeds, the data im trying to send is just about 8 Bytes as much.
With any kind of asynchronous communications, you rely on transmitter sending a new 'bit' of data at a fixed time interval, and the receiver sampling the data at the same (fixed) interval. If the browser isn't accurate on timings, you'll just need to slow the bitrate down until its good enough.
There are a few tricks you can use to help you improve the reliability:-
a : While sending, calculate the required 'start transmit time' of each 'bit' in advance, and modify the delay after each bit has been 'sent', based on current time vs. required time. This means you'll avoid cumulative errors (i.e. if Bit 1 is sent a little 'late', the delay to bit 2 will be reduced to compensate), rather than delaying a constant N microseconds per bit.
b: While receiving, you must sample the incoming data much faster than you expect changes. (UARTS normally use a 16x oversample) This means you can resynchronize with the 'start bit' (the initial change from 1 to 0 in your diagram) and you can then sample each bit at the expected 'centre' of its time period.
In other words, if you're sending data at 1000us intervals, you sample data at ~62us intervals, and when you detect a 'start bit, you wait 500us to put you in the centre of the time period, then take 8 single-bit samples at 1000us intervals to form an 8-bit byte.
You might consider not using a fixed-rate encoding, where each bit is represented as a sequence of the same length, and instead go for a variable-rate encoding:
Time: 0 1 2 3 4
0: _/▔\_
1: _/▔▔▔▔▔\_
This means that when decoding, all you need to do is to measure the time the screen is lit. Short pulses are 0s, long pulses are 1s. It's woefully inefficient, but doesn't require accurate clocking and should be relatively resistant to inaccurate timing. By using some synchronisation pulses (say, an 010 sequence) between bytes you can automatically detect the length of the pulses and so end up not needing a fixed clock at all.

How to use ALSA's snd_pcm_writei()?

Can someone explain how snd_pcm_writei
snd_pcm_sframes_t snd_pcm_writei(snd_pcm_t *pcm, const void *buffer,
snd_pcm_uframes_t size)
works?
I have used it like so:
for (int i = 0; i < 1; i++) {
f = snd_pcm_writei(handle, buffer, frames);
...
}
Full source code at http://pastebin.com/m2f28b578
Does this mean, that I shouldn't give snd_pcm_writei() the number of
all the frames in buffer, but only
sample_rate * latency = frames
?
So if I e.g. have:
sample_rate = 44100
latency = 0.5 [s]
all_frames = 100000
The number of frames that I should give to snd_pcm_writei() would be
sample_rate * latency = frames
44100*0.5 = 22050
and the number of iterations the for-loop should be?:
(int) 100000/22050 = 4; with frames=22050
and one extra, but only with
100000 mod 22050 = 11800
frames?
Is that how it works?
Louise
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html#gf13067c0ebde29118ca05af76e5b17a9
frames should be the number of frames (samples) you want to write from the buffer. Your system's sound driver will start transferring those samples to the sound card right away, and they will be played at a constant rate.
The latency is introduced in several places. There's latency from the data buffered by the driver while waiting to be transferred to the card. There's at least one buffer full of data that's being transferred to the card at any given moment, and there's buffering on the application side, which is what you seem to be concerned about.
To reduce latency on the application side you need to write the smallest buffer that will work for you. If your application performs a DSP task, that's typically one window's worth of data.
There's no advantage in writing small buffers in a loop - just go ahead and write everything in one go - but there's an important point to understand: to minimize latency, your application should write to the driver no faster than the driver is writing data to the sound card, or you'll end up piling up more data and accumulating more and more latency.
For a design that makes producing data in lockstep with the sound driver relatively easy, look at jack (http://jackaudio.org/) which is based on registering a callback function with the sound playback engine. In fact, you're probably just better off using jack instead of trying to do it yourself if you're really concerned about latency.
I think the reason for the "premature" device closure is that you need to call snd_pcm_drain(handle); prior to snd_pcm_close(handle); to ensure that all data is played before the device is closed.
I did some testing to determine why snd_pcm_writei() didn't seem to work for me using several examples I found in the ALSA tutorials and what I concluded was that the simple examples were doing a snd_pcm_close () before the sound device could play the complete stream sent it to it.
I set the rate to 11025, used a 128 byte random buffer, and for looped snd_pcm_writei() for 11025/128 for each second of sound. Two seconds required 86*2 calls snd_pcm_write() to get two seconds of sound.
In order to give the device sufficient time to convert the data to audio, I put used a for loop after the snd_pcm_writei() loop to delay execution of the snd_pcm_close() function.
After testing, I had to conclude that the sample code didn't supply enough samples to overcome the device latency before the snd_pcm_close function was called which implies that the close function has less latency than the snd_pcm_write() function.
If the ALSA driver's start threshold is not set properly (if in your case it is about 2s), then you will need to call snd_pcm_start() to start the data rendering immediately after snd_pcm_writei().
Or you may set appropriate threshold in the SW params of ALSA device.
ref:
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m___s_w___params.html