I have been trying to speed up the sampling rate on a SPI controlled ADC for an ECG device and have been stumping my head against the wall.
I am using an external ADC (MCP3464 datasheet) to read ADC data over 8 channels.
Here is how I am reading the adc:
// SPI full duplex transfer
digitalWrite(adcChipSelectPin,LOW);
SPI.transfer(readConversionData);
adcReading = (SPI.transfer(0) << 8);
adcReading += SPI.transfer(0);
digitalWrite(adcChipSelectPin, HIGH);
I tried single conversion mode, but this was slow speeds of 3200 samples per second. I believe the ESP32 can flash SPI at 80MHz, so surprised it's that slow.
Instead, I have set the MCP3464 to SCAN mode (see page 89 of datasheet) where it completes continuous conversions and you can set delays between each channel and each cycle (8 channels). That way you don't have to write an additional 2 SPI commands to change the channel and start the next conversion.
From the datasheet:
Each conversion within the SCAN cycle leads to a data ready interrupt and to an update of the ADCDATA register as soon as the current conversion is finished. In SCAN mode, each result has to be read when it is available and before it is overwritten by the next conversion result
Hence the data ready interrupt needs to be detected by the ESP32 to immediately read the latest value from the channel, before a new conversion to the next channel occurs.
I have tried waiting for IRQ to be LOW (I am multi-threading so I can afford to blocking wait) :
// wait for IRQ to trigger active LOW indicating data ready state
while(digitalRead(interruptPin));
*(packetLocation + ((ch+2) + (ts*(interface.numOfCh + 2)))) = adc.read();
*(unsigned long*)(packetLocation + (ts*(interface.numOfCh + 2))) = micros();
However this results in slow speeds of around 2640 samples per second and I am worried the synchronicity of reading the correct channel will collapse if just one active LOW is undetected.
I have also tried attaching an interrupt:
attachInterrupt(interruptPin, packageADC, FALLING);
but I am unable to link a non-static method to an IRQ.
Any ideas how to either get the interrupt to work quickly and reliably in SCAN mode or speed up the ESP32 SPI communication in Single shot mode?
Sorry, quite a lot to take in but any suggestions much appreciated!
Thanks in advance,
Will
Related
The setup that I'm working with is a Nucleo L432KC connected to 8 different MAX31865 ADCs to get temperature readings from RTDs (resistive thermal devices). Each of the 8 chip selects is connected to its own pin, but the SDI/SDO of all chips are connected to the same bus, since I only read from one at a time (only 1 chip select is enabled at a time). For now, I am using a 100 ohm base resistor in the Kelvin connection, not an RTD, just to ensure an accurate resistance reading. The read from an RTD comes from calling the function rtd.read_all(). When I read from one and only one RTD, I get an accurate reading and an accurate SPI waveform (pasted below):
correct SPI reading for 1 ADC
(yellow is chip enable, green is clock, blue is miso, purple is mosi)
However, when I read from 2 or more sequentially, the SPI clock for some reason gains an additional unwanted cycle at the start of the read that throws off the transmitted values. It's been having the effect of shifting the clock to the right and bit-shifting my resistance values to the left by 1.
Logic analyzer reading of SPI; clock has additional cycle at start
What could be causing this extra clock cycle? I'm programming in C++ using mbed. I can access the SPI.h file but I can't see the implementation so I'm not sure what might be causing this extra clock cycle at the start. If I need to add the code too, let me know and I'll edit/comment.
rtd.read_all() function:
uint8_t MAX31865_RTD::read_all( )
{
uint16_t combined_bytes;
//SPI clock polarity/phase (CPOL & CPHA) is set to 11 in spi.format (bit 1 = polarity, bit 0 = phase, see SPI.h)
//polarity of 1 indicates that the SPI reading idles high (default setting is 1; polarity of 0 means idle is 0)
//phase of 1 indicates that data is read on the first edge/low-to-high leg (as opposed to phase 0,
//which reads data on the second edge/high-to-low transition)
//see https://deardevices.com/2020/05/31/spi-cpol-cpha/ to understand CPOL/CPHA
//chip select is negative logic, idles at 1
//When chip select is set to 0, the chip is then waiting for a value to be written over SPI
//That value represents the first register that it reads from
//registers to read from are from addresses 00h to 07h (h = hex, so 0x01, 0x02, etc)
//00 = configuration register, 01 = MSBs of resistance value, 02 = LSBs of
//Registers available on datasheet at https://datasheets.maximintegrated.com/en/ds/MAX31865.pdf
//The chip then automatically increments to read from the next register
/* Start the read operation. */
nss = 0; //tell the MAX31865 we want to start reading, waiting for starting address to be written
/* Tell the MAX31865 that we want to read, starting at register 0. */
spi.write( 0x00 ); //start reading values starting at register 00h
/* Read the MAX31865 registers in the following order:
Configuration (00)
RTD (01 = MSBs, 02 = LSBs)
High Fault Threshold (03 = MSBs, 04 = LSBs)
Low Fault Threshold (05 = MSBs, 06 = LSBs)
Fault Status (07) */
this->measured_resistance = 0;
this->measured_configuration = spi.write( 0x00 ); //read from register 00
//automatic increment to register 01
combined_bytes = spi.write( 0x00 ) << 8; //8 bit value from register 01, bit shifted 8 left
//automatic increment to register 02, OR with previous bit shifted value to get complete 16 bit value
combined_bytes |= spi.write( 0x00 );
//bit 0 of LSB is a fault bit, DOES NOT REPRESENT RESISTANCE VALUE
//bit shift 16-bit value 1 right to remove fault bit and get complete 15 bit raw resistance reading
this->measured_resistance = combined_bytes >> 1;
//high fault threshold
combined_bytes = spi.write( 0x00 ) << 8;
combined_bytes |= spi.write( 0x00 );
this->measured_high_threshold = combined_bytes >> 1;
//low fault threshold
combined_bytes = spi.write( 0x00 ) << 8;
combined_bytes |= spi.write( 0x00 );
this->measured_low_threshold = combined_bytes >> 1;
//fault status
this->measured_status = spi.write( 0x00 );
//set chip select to 1; chip stops incrementing registers when chip select is high; ends read cycle
nss = 1;
/* Reset the configuration if the measured resistance is
zero or a fault occurred. */
if( ( this->measured_resistance == 0 )
|| ( this->measured_status != 0 ) )
{
//reconfigure( );
// extra clock cycle causes measured_status to be non-zero, so chip will reconfigure even though it doesn't need to. reconfigure commented out for now.
}
return( status( ) );
}
Background:
I took a look at the entire implementation of the MAX31865_RTD class and the thing I find "troubling" is that a MAX31865_RTD instance creates its own SPI instance on construction. If you create multiple instances of this MAX31865_RTD class then there will be a separate SPI instance created and initialized for each of these.
If you have 8 of these chips and you create 8 separate MAX31865_RTD instances to provide one for each of your chips then this also creates 8 SPI instances that all point to the same physical SPI device of the microcontroller.
The problem:
When you call the read_all function on your MAX31865_RTD instance it in turn calls the SPI write functions (as seen in the code you provided). But digging deeper in the call chain you will eventually find that the code of the SPI write method (and others as well) is written in a way that it assumes that there can be multiple SPI instances that are using the same SPI hardware with different parameters (frequency, word length, etc...). In order to actually use the SPI hardware, the SPI class instance must first take ownership of the hardware if it does not have it yet. To do this it "acquires" the hardware for itself which basically means that it reconfigures the SPI hardware to the frequency and word length and mode that this particular SPI instance was set to (This happens regardless of the fact that every instance is set to the same parameters. They don't know about each other. They just see the fact that they have lost ownership and thus have to reacquire it and they also automatically assume that the settings are to be restored.). And this frequency (= clock) reinitialization is the reason that your clock is having a weird artefact/glitch on it. Each time you call the read_all on a different MAX31865_RTD instance the SPI instance of that instance will have to do an acquire (because they steal the ownership from each other on each read_all call) and it will make the clock behave weird.
Why it works if you only have one device:
Because when you have one and only one MAX31865_RTD instance then it has only one SPI instance which is the sole "owner" of the SPI hardware. So no-one is stealing the ownership on each turn. Which means that it does not have to re-acquire it on every read_all call. So in that case the SPI hardware is not reinitialized every time. So you don't get the weird clock pulse and everything works as intended.
My proposed solution #1:
I propose that you change the implementation of the read_all method.
If the version of the SPI class that you use has the select method, then add the
spi.select();
line just before pulling the chip select (nss) low. Basically add the line above this block:
/* Start the read operation. */
nss = 0;
If there is no select function, then just add a
spi.write(0x00);
line in the same place instead of the line with the select.
In essence both of the proposed lines just force the acquire (and the accompanying clock glitch) before the chip select line is asserted. So by the time the chip select is pulled low and the actual data is being written the SPI instance already has ownership and the write methods will not trigger an acquire (nor the clock glitch).
My proposed solution #2:
Another solution is to modify the MAX31865_RTD class to use an SPI instance reference and provide that reference through its constructor. That way you can create one SPI instance explicitly and provide this same SPI instance to all your MAX31865_RTD instances at construction. Now since all of your MAX31865_RTD instances are using a reference to the same and only SPI instance, the SPI hardware ownership never changes since there is only one SPI class instance that is using it. Thus the hardware is never reconfigured and the glitch never happens. I would prefer this solution since it is less of a workaround.
My proposed solution #3:
You could also modify the MAX31865_RTD class to have a "setter" for the nss (= chip select) pin. That way, you could have only one MAX31865_RTD instance for all your 8 devices and only change the nss pin before addressing the next device. Since there is only one MAX31865_RTD instance then there is only one SPI instance which also solves the ownership issue and since no re-acquisition has to be made then no glitch will be triggered.
And of-course there can be any number of other ways to fix this knowing the reason of the problem.
Hope this helps in solving your issue.
I'm attempting to create a proper SPI slave interface for an AD7768-4 ADC. The ADC has a SPI interface, but it doesn't output the conversions via SPI. Instead, there are data outputs that are clocked out on individual GPIO pins. So I basically need to bit-bang data, and output to SPI to get a proper slave SPI interface. Please don't ask why I'm doing it this way, it was assigned to me.
The issue I'm having is with the interrupts. I'm using the STM32F767ZI processor - it runs at 216 MHz, and my ADC data MUST BE clocked out at 20MHz. I've set up my NMIs but what I'm not seeing is where the system calls or points to the interrupt handler.
I used the STMCubeMX software to assign pins and generate the setup code, and in the stm32F7xx.c file, it shows the NMI_Handler() function, but I don't see a pointer to it anywhere in the system files. I also found void HAL_GPIO_EXTI_IRQHandler() function in STM32F7xx_hal_gpio.c, which appears to check if the pin is asserted, and clears any pending bits, but it doesn't reset the interrupt flag, or check it, and again, I see no pointer to this function.
To more thoroughly complicate things, I have 10 clock cycles to determine which flag is set (1 of two at a time), reset it, incerment a variable, and move data from the GPIO registers. I believe this is possible, but again, I'm uncertain of what the system is doing as soon as the interrupt is tripped.
Does anyone have any experience in working with external interrupts on this processor that could shed some light on how this particular system handles things? Again - 10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
EDIT:
We changed the DCLK speed to 5.12 MHz (20.48 MHz MCLK/4) because at 2.56 MHz we had exactly 12.5 microseconds to pipe data out and set up for the next DRDY pulse, and 80 kHz speed gives us exactly zero margin. At 5.12 MHz, I have 41 clock cycles to run the interrupt routine, which I can reduce slightly if I skip checking the second flag and just handle incoming data. But I feel I must use the DRDY flag check at least, and use the routine to enable the second interrupt otherwise I'll be constantly interrupting because DCLK on the ADC is always running. This allows me 6.12 microseconds to read in the data, and 6.25 microseconds to shuffle it out before the next DRDY pulse. I should be able to do that at 32 MHz SPI clock (slave) but will most likely do it at 50MHz. This is my current interrupt code:
void NMI_Handler(void)
{
if(__HAL_GPIO_EXTI_GET_IT(GPIO_PIN_0) != RESET)
{
count = 0;
__HAL_GPIO_EXTI_CLEAR_IT(GPIO_PIN_0);
HAL_GPIO_EXTI_Callback(GPIO_PIN_0);
// __HAL_GPIO_EXTI_CLEAR_FLAG(GPIO_PIN_0);
HAL_NVIC_EnableIRQ(GPIO_PIN_1);
}
else
{
if(__HAL_GPIO_EXTI_GET_IT(GPIO_PIN_1) != RESET)
{
data_pad[count] = GPIOF->IDR;
count++;
if (count == 31)
{
data_send = !data_send;
HAL_NVIC_DisableIRQ(GPIO_PIN_1);
}
__ HAL_GPIO_EXTI_CLEAR_IT(GPIO_PIN_1);
HAL_GPIO_EXTI_Callback(GPIO_PIN_1);
// __HAL_GPIO_EXTI_CLEAR_FLAG(GPIO_PIN_0);
}
}
}
I am still concerned about clock cycles, and I believe I can get away with only checking the DRDY flag if I operate on the presumption that the only other EXTI flag that will trip is for the clock pin. Although I question how this will work if SYS_TICK is running in the background... I'll have to find out.
We're investigating a faster processor to handle the bit-banging, but right now, it looks like the PI3 won't be able to handle it if it's running Linux, and I'm unaware of too many faster processors that run either a very small reliable RTOS, or can be bare metal programmed in a pinch...
10 clock cycles to do what I need to... moving data should only take me 1-2 clock cycles, leaving me 8 to handle interrupts...
No way. Interrupt entry (pushing registers, fetching the vector and filling the pipeline) takes 10-12 cycles even on a Cortex-M7. Then consider a very simple interrupt handler, just moving the input data bits to a buffer and clearing the interrupt flag:
uint32_t *p;
void handler(void) {
*p++ = GPIOA->IDR;
EXTI->PR = 0x10;
}
it gets translated to something like this
handler:
ldr r0, .addr_of_idr // load &GPIOA->IDR
ldr r1, [r0] // load GPIOA->IDR
ldr r2, .addr_ofr_p // load &p
ldr r3, [r2] // load p
str r1, [r3] // store the value from IDR to *p
adds r3, r3, #4 // increment p
str r3, [r2] // store p
ldr r0, .addr_of_pr // load &EXTI->PR
movs r1, #0x10
str r1, [r0] // store 0x10 to EXTI->PR
bx lr
.addr_of_p:
.word p
.addr_of_idr
.word 0x40020010
.addr_of_pr
.word 0x40013C14
So it's 11 instructions, each taking at least one cycle, after interrupt entry. That's assuming the code, vector table, and the stack are all in the fastest RAM region. I'm not sure whether literal pools work in ITCM at all, using immediate literals would add 3 more cycles. Forget it.
This has to be solved with hardware.
The controller has 6 SPI interfaces, pick 4 of them. Connect DRDY to all four NSS pins, DCLK to all SCK pins, and each DOUT pin to one MISO pin. Now each SPI interface handles a single channel, and can collect up to 32 bits in its internal FIFO.
Then I'd set an interrupt on a rising edge on one of the NSS pins (EXTI still works even if the pin is in alternate function mode), and read all data at once.
EDIT
It turns out that the STM32 SPI requres an inordinate amount of delay between NSS falling and SCK rising, which the AD7768 does not provide, so it will not work.
Sigma-Delta interface
The STM32F767 has a DFSDM peripheral, designed to receive data from external ADCs. It can receive up to 8 channels of serial data with 20 MHz, and it can even do some preprocessing that your application might need.
The problem is that the DFSDM has no DRDY input, I don't exactly know how could the data transfer be synchronized. It might work by asserting the START# singal to reset the communication.
If that doesn't work, then you can try starting the DFSDM channels using a timer and DMA. Connect DRDY to the external trigger of TIM1 or TIM8 (other timers won't work, because they are connected to the slower APB1 bus and the other DMA controller), start it on the rising edge of ETR, and let it generate a DMA request after ~20 ns. Then let the DMA write the value needed to start the channel to the DFSDM channel configuration register. Repeat for the oher three channels.
There's a startup file generated before compile: startup_stm32f767xx.s - which contains all the pointers to functions.
Under the marker g_pfnVectors: is .word NMI_Handler pointing to a function for handling the non-masked interrupts, and two other pointers, .word EXTI0_IRQHandler and .word EXTI1_IRQHandler as vectors to the external interrupt handlers. Further down in the same file, is the following compiler directives:
.weak NMI_Handler
.thumb_set NMI_Handler,Default_Handler
.weak EXTI0_IRQHandler
.thumb_set EXTI0_IRQHandler,Default_Handler
.weak EXTI1_IRQHandler
.thumb_set EXTI1_IRQHandler,Default_Handler
This was the info I was looking for to be able to control my interrupts with more precision and fewer clock cycles.
I readed AD7768 DS more carefully and found that it can srnd four channels data to one DOUT pin. So, I talking again about serial audio interface (SAI).
If you can lower DCLK frequency up to 2.5MHz than you can lower sample with ratio 1:8 (as ratio 2.5 MHz to 20 MHz) irt sample rate at full ADC clock.
If you route all 4 channels to one output DOUT0 you slow down sample rate just in ratio 1:4.
AD7768-4 DS
page 53
On the AD7768, the interface can be configured to output conversion
data on one, two, or eight of the DOUTx pins. The DOUTx configuration
for the AD7768 is selected using the FORMATx pins (see Table 33).
page 66 table 34: (for AD7768-4)
page 67 figure 98:
FORMAT0 = 1 All channels output on the DOUT0 pin, in TDM output. Only DOUT0 is in use.
You can use SAI with FS = DRDY and four slots, 32 bits/slot
I try to generate quadrature signal but with the lowest operation possible. I use a STM32 and GPIO pin B8 and B9 for sending the signal.
couple of pin 8 and 9 have four possible options which are in clock wise:
0/0 1/0 1/1 and 0/1
and counter clockwise
0/0 0/1 1/1 1/0
I can't find the way with bitwise to be able to quickly set or reset the bit for the selected pin.
Moreover, I must be able to go clock or counterclokwise and change sense whenever I want like if it was a rotary or linear encoder.
Thank you for your help
Bit-banging
Bitwise thinking, B9 gets the previous value of B8, and B8 gets the inverse of B9, or the other way round when counting down. You swap the two bits, and exclusive-or with 0x100 or 0x200 depending on the direction.
inline void incB89(int down) {
uint32_t temp;
/* read the current output state */
temp = GPIOB->ODR;
/* modifying the significant bit-pair
don't care about overflow */
temp = (((temp & 0x100) << 1) | ((temp & 0x200) >> 1)) ^ (0x100 << down);
/* Setting the reset bits BR8 and BR9. This has the effect that
bits 8 and 9 will be copied into the ODR, and the rest will
be left alone */
temp |= ((1 << 24) | (1 << 25));
GPIOB->BSRR = temp;
}
Using a timer (or two)
On most STM32 series controllers, TIM4 channels 3 and 4 outputs can be mapped to PB8 and PB9. If you have one of these, this timer can control the outputs autonomously, unaffected by code, memory, or interrupt latency.
Set the GPIO mode and alternate function registers according to the reference manual of your controller.
Configure both channel 3 and 4 to toggle mode, set the OC1M and OC2M bits in TIM4->CCMR1 to 0b011.
Set the input clock, prescaler PSC and reload ARR to achieve twice the desired frequency, because each output will be toggled once in every timer cycle.
Set TIM4->CCR3=0 and TIM4->CCR4=(TIM4->ARR+1)/2 for counting in one direction. Swap them (while the counter is stopped) to reverse direction.
Enable the outputs in TIM4->CCER.
You can start and stop counting by setting or resetting the CEN bit of TIM4->CR1.
To count the cycles, you can to configure an interrupt for toggle or update events in TIM4->DIER, or use another timer as a slave to TIM4.
To use e.g. TIM3 to count:
Set the MMS bits in TIM4->CR2 to 0b010 to output a trigger pulse on each overflow.
Configure TIM3->SMCR to External Clock Mode 1, and select the internal trigger of TIM4.
Set TIM3->ARR to the required number of half-cycles - 1.
Configure an interrupt on update.
Start the counter.
There are some more tricks possible with timers, like using DMA bursts triggered by the slave to update the ARR and CCR registers of the master timer from a table of "wawelength" values.
This is an embedded solution using C++, im reading the changes of brightness from a cellphone screen, from very bright (white) to dark (black).
Using JavaScript and a very simple script im changing the background of a webpage from white to black on 100 milliseconds intervals and reading the result on my brightness sensor, as expected the browser is not very precise on timing, some times it does 100ms sometimes less and sometimes more with a huge deviation at times.
var syncinterval = setInterval(function(){
bytes = "010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101";
bit = bytes[i];
output_bit(bit);
i += 1;
if(i > bytes.length) {
clearInterval(syncinterval);
i = 0;
for (i=0; i < input.length; i++) {
tbits = input[i].charCodeAt(0).toString(2);
while (tbits.length < 8) tbits = '0' + tbits;
bytes += tbits;
}
console.log(bytes);
}
}, sync_speed);
My initial idea, before knowing how the timing was on the browser was to use asynchronous serial communication, with some know "word" to sync the stream of data as RS232 does with his start bit, but on RS232 the clocks are very precise.
I could use a second sensor to read a different part of the screen as a clock, in this case even if the monitor or the browser "decides" to go faster or slower my system will only read when there is a clock signal (this is a similar application were they swipe the sensors instead of making the screen flicks as i need), but this require a more complex hardware system, i would like not to complicate things before searching for a software solution.
I don't need high speeds, the data im trying to send is just about 8 Bytes as much.
With any kind of asynchronous communications, you rely on transmitter sending a new 'bit' of data at a fixed time interval, and the receiver sampling the data at the same (fixed) interval. If the browser isn't accurate on timings, you'll just need to slow the bitrate down until its good enough.
There are a few tricks you can use to help you improve the reliability:-
a : While sending, calculate the required 'start transmit time' of each 'bit' in advance, and modify the delay after each bit has been 'sent', based on current time vs. required time. This means you'll avoid cumulative errors (i.e. if Bit 1 is sent a little 'late', the delay to bit 2 will be reduced to compensate), rather than delaying a constant N microseconds per bit.
b: While receiving, you must sample the incoming data much faster than you expect changes. (UARTS normally use a 16x oversample) This means you can resynchronize with the 'start bit' (the initial change from 1 to 0 in your diagram) and you can then sample each bit at the expected 'centre' of its time period.
In other words, if you're sending data at 1000us intervals, you sample data at ~62us intervals, and when you detect a 'start bit, you wait 500us to put you in the centre of the time period, then take 8 single-bit samples at 1000us intervals to form an 8-bit byte.
You might consider not using a fixed-rate encoding, where each bit is represented as a sequence of the same length, and instead go for a variable-rate encoding:
Time: 0 1 2 3 4
0: _/▔\_
1: _/▔▔▔▔▔\_
This means that when decoding, all you need to do is to measure the time the screen is lit. Short pulses are 0s, long pulses are 1s. It's woefully inefficient, but doesn't require accurate clocking and should be relatively resistant to inaccurate timing. By using some synchronisation pulses (say, an 010 sequence) between bytes you can automatically detect the length of the pulses and so end up not needing a fixed clock at all.
Can someone explain how snd_pcm_writei
snd_pcm_sframes_t snd_pcm_writei(snd_pcm_t *pcm, const void *buffer,
snd_pcm_uframes_t size)
works?
I have used it like so:
for (int i = 0; i < 1; i++) {
f = snd_pcm_writei(handle, buffer, frames);
...
}
Full source code at http://pastebin.com/m2f28b578
Does this mean, that I shouldn't give snd_pcm_writei() the number of
all the frames in buffer, but only
sample_rate * latency = frames
?
So if I e.g. have:
sample_rate = 44100
latency = 0.5 [s]
all_frames = 100000
The number of frames that I should give to snd_pcm_writei() would be
sample_rate * latency = frames
44100*0.5 = 22050
and the number of iterations the for-loop should be?:
(int) 100000/22050 = 4; with frames=22050
and one extra, but only with
100000 mod 22050 = 11800
frames?
Is that how it works?
Louise
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html#gf13067c0ebde29118ca05af76e5b17a9
frames should be the number of frames (samples) you want to write from the buffer. Your system's sound driver will start transferring those samples to the sound card right away, and they will be played at a constant rate.
The latency is introduced in several places. There's latency from the data buffered by the driver while waiting to be transferred to the card. There's at least one buffer full of data that's being transferred to the card at any given moment, and there's buffering on the application side, which is what you seem to be concerned about.
To reduce latency on the application side you need to write the smallest buffer that will work for you. If your application performs a DSP task, that's typically one window's worth of data.
There's no advantage in writing small buffers in a loop - just go ahead and write everything in one go - but there's an important point to understand: to minimize latency, your application should write to the driver no faster than the driver is writing data to the sound card, or you'll end up piling up more data and accumulating more and more latency.
For a design that makes producing data in lockstep with the sound driver relatively easy, look at jack (http://jackaudio.org/) which is based on registering a callback function with the sound playback engine. In fact, you're probably just better off using jack instead of trying to do it yourself if you're really concerned about latency.
I think the reason for the "premature" device closure is that you need to call snd_pcm_drain(handle); prior to snd_pcm_close(handle); to ensure that all data is played before the device is closed.
I did some testing to determine why snd_pcm_writei() didn't seem to work for me using several examples I found in the ALSA tutorials and what I concluded was that the simple examples were doing a snd_pcm_close () before the sound device could play the complete stream sent it to it.
I set the rate to 11025, used a 128 byte random buffer, and for looped snd_pcm_writei() for 11025/128 for each second of sound. Two seconds required 86*2 calls snd_pcm_write() to get two seconds of sound.
In order to give the device sufficient time to convert the data to audio, I put used a for loop after the snd_pcm_writei() loop to delay execution of the snd_pcm_close() function.
After testing, I had to conclude that the sample code didn't supply enough samples to overcome the device latency before the snd_pcm_close function was called which implies that the close function has less latency than the snd_pcm_write() function.
If the ALSA driver's start threshold is not set properly (if in your case it is about 2s), then you will need to call snd_pcm_start() to start the data rendering immediately after snd_pcm_writei().
Or you may set appropriate threshold in the SW params of ALSA device.
ref:
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m.html
http://www.alsa-project.org/alsa-doc/alsa-lib/group___p_c_m___s_w___params.html