nesC (C-like) question - nesc

This is the code from file in nesC language:
#define BUFFERLEN 32768
uint32_t gBuffer[BUFFERLEN] __attribute__((aligned(32)));
uint32_t gNumSamples = BUFFERLEN/4;
event void Audio.ready(result_t success)
call Audio.audioRecord(gBuffer,gNumSamples));
The buffer gBuffer is used to store sound recording samples. Samples are 16-bit stereo samples packed into a 32-bit word. Left samples are in the low 16 bits. Right samples are in the high 16 bits.
What makes me confused is the number of samples gNumSamples. As I understand, gNumSamples should be BUFFERLEN since gBuffer[i] is 32-bit word (16 bits for left channel + 16 for right channel). Am I right? (I changed gNumSamples = BUFFERLEN and it didn't work).
Thanks for your help.
This is how gBuffer is used:
command result_t Audio.audioRecord(uint32_t *buffer, uint32_t numSamples){
uint32_t *pBuf;
uint32_t bufpos;
bool initPlay;
initPlay = gInitPlay;
if(initPlay == TRUE){
//gate the acceptance of a record command until we signal audio.ready();
return FAIL;
pBuf = gRxBuffer;
bufpos = gRxBufferPos;
if( (bufpos != 0) || (pBuf != NULL)){
//gate acceptance due to ongoing record command
return FAIL;
gRxBuffer = buffer;
gRxBufferPos = 0;
gRxNumBytes = numSamples * 4;
call BulkTxRx.BulkReceive((uint8_t *)buffer, ((numSamples*4) > 8188)? 8188: (numSamples*4));
return SUCCESS;

I just came across this question when looking for nesC. Just answering it for whatever it's worth.
If you look at the audioRecord function, they are multiplying numSamples by 4 to compensate for the division by 4 (BUFFERLEN/4) earlier. Without the full context, I cannot tell why they have to divide it in the first place. My guess would be gBuffer is divided into 4 parts, each part storing numSamples, so when the producer is writing to one part, the consumer can read from another part.


8b10b encoder with byte stream output (bits carry): faster bitwise algorithm?

I have written a 8b10b encoder that generates a stream of bytes intended to be sent to a serial transmitter which sends the bytes as-is LSb first.
What I'm doing here is basically lay down groups of 10 bits (encoded from the input stream of bytes) on groups of 8, so a varying number of bits get carried over from one output byte to the next - kind of like in music/rhythm.
The program has been successfully tested, but it is about 4-5x too slow for my application. I think it comes from the fact that every bit has to be looked up in an array. My guts tell me we could make that faster by having some sort of rolling mask but I can't yet see how to do that even by swapping out the 3d array of booleans to a 2D array of integers.
Any pointer or other idea?
Here is the code. Please ignore most of the macros and some of the code related to deciding which byte is to be written as this is application-specific.
#include <stdint.h> //for standard portable types such as uint16_t
#define MAX_USB_TRANSFER_SIZE 1016 //Bytes, size of the max payload in a USB transaction. Determined using FT4222_GetMaxTRansferSize()
#define MAX_USB_PACKET_SIZE 62 //Bytes, max size of the payload of a single USB packet
#define MANDATORY_TX_PACKET_BLOCK 5 //Bytes, constant - equal to the minimum number of bytes of TX packet necessary to exactly transfer blocks of 10 bits of encoded data (LCF of 8 and 10)
#define SYNC_CHARS_MAX_INTERVAL 172 //Target number of payload bytes between sync chars. Max is 188 before desynchronisation
#define ROUND_UP(N, S) ((((N) + (S) - 1) / (S)) * (S)) //Macro to round up the integer N to the largest multiple of the integer S
#define ROUND_DOWN(N,S) ((N / S) * S) //Same rounding down
#define N_SYNC_CHAR_PAIRS_IN_PCKT(pcktSz) (ROUND_UP((pcktSz*1000/(SYNC_CHARS_MAX_INTERVAL+2)),1000)/1000) //Number of sync (K28.5) character/byte pairs in a given packet
#define TX_PAYLOAD_SIZE(pcktSz) ((pcktSz*4/5)-2*N_SYNC_CHAR_PAIRS_IN_PCKT(pcktSz)) //Size in bytes of the payload data before encoding in a single TX packet
#define DEFAULT_TX_PACKET_SIZE (MAX_TX_PACKET_SIZE-MAX_USB_PACKET_SIZE*MANDATORY_TX_PACKET_BLOCK) //Default size in bytes of a TX packet with some margin
#define MAX_TX_PAYLOAD_SIZE (TX_PAYLOAD_SIZE(MAX_TX_PACKET_SIZE)) //Maximum size in bytes of the payload in a TX packet
#define DEFAULT_TX_PAYLOAD_SIZE (TX_PAYLOAD_SIZE(DEFAULT_TX_PACKET_SIZE))//Default size in bytes of the payload in a TX packet with some margin
//See string descriptors below for definitions. Error codes are individual bits so can be combined.
enum ErrCode
NO_ERR = 0,
char const * const ERR_CODE_DESC[] = {
"No error",
"Invalid size of input data",
"Invalid size of output buffer",
"Input data pointer is NULL",
"Output buffer pointer is NULL"
/** #brief Generates the bytestream to the transmitter by encoding the incoming data using 8b10b encoding
and inserting K28.5 synchronisation characters to maintain the synchronisation with the demodulator (LVDS passthrough mode)
#arg din is a pointer to an allocated array of bytes which contains the data to encode
#arg dinSize is the size of din in bytes. This size must be equal to TX_PAYLOAD_SIZE(doutSize)
#arg dout is a pointer to an allocated array of bytes which is intended to contain the output bytestream to the transmitter
#arg doutSize is the size of dout in bytes. This size must meet the conditions at the top of this function's implementation. Use DEFAULT_TX_PACKET_SIZE if in doubt.
#return error code (c.f. ErrCode) **/
int TX_gen_bytestream(uint8_t *din, uint16_t dinSize, uint8_t *dout, uint16_t doutSize);
Source file:
#include "TX_bytestream_gen.h"
#include <cstddef> //NULL
#define N_BYTE_VALUES (256+1) //256 possible data values + 1 special character (only accessible to this module)
#define N_ENCODED_BITS 10 //Number of bits corresponding to the 8b10b encoding of a byte
//Map the current running disparity, the desired value to encode to the array of encoded bits for 8b10b encoding.
//The Last value is the K28.5 sync character, only accessible to this module
//Notation = MSb to LSb
bool const encodedBits[2][N_BYTE_VALUES][N_ENCODED_BITS] =
//Long table (see appendix)
//New value of the running disparity after encoding with the specified previous running disparity and requested byte value (c.f. above)
bool const encodingDisparity[2][N_BYTE_VALUES] =
//Long table (see appendix)
int TX_gen_bytestream(uint8_t *din, uint16_t dinSize, uint8_t *dout, uint16_t doutSize)
static bool RDp = false; //Running disparity is initially negative
int ret = 0;
//If the output buffer size is not a multiple of the mandatory payload block or of the USB packet size, or if it cannot be held in a single USB transaction
//return an invalid output buffer size error
if(doutSize == 0 || (doutSize % MANDATORY_TX_PACKET_BLOCK) || (doutSize % MAX_USB_PACKET_SIZE) || (doutSize > MAX_TX_PACKET_SIZE)) //Temp
//If the input data size is not consistent with the output buffer size, return the appropriate error code
if(dinSize == 0 || dinSize != TX_PAYLOAD_SIZE(doutSize))
if(din == NULL)
ret |= NULL_DIN_PTR;
if(dout == NULL)
//If everything checks out, carry on
if(ret == NO_ERR)
uint16_t iByteIn = 0; //Index of the byte of input data currently being processed
uint16_t iByteOut = 0; //Index of the output byte currently being written to
uint8_t iBitOut = 0; //Starts with LSb
int16_t nBytesUntilSync = 0; //Countdown of bytes until a sync marker needs to be sent. Cyclic.
//For all output bytes to generate
while(iByteOut < doutSize)
bool sync = false; //Initially this byte is not considered a sync byte (in which case the next byte of data will be processed)
//If the maximum interval between sync characters has been reached, mark the two next bytes as sync bytes and reset the counter
if(nBytesUntilSync <= 0)
sync = true;
if(nBytesUntilSync == -1) //After the second SYNC is written, the counter is reset
//Append bit by bit the encoded data of the byte to write to the output bitstream (carried over from byte to byte) - LSb first
//The byte to write is either the last byte of the encodedBits map (the sync character K28.5) if sync is set, or the next byte of
//input data if it isn't
uint16_t const byteToWrite = (sync?(N_BYTE_VALUES-1):din[iByteIn]);
for(int8_t iEncodedBit = N_ENCODED_BITS-1 ; iEncodedBit >= 0 ; --iEncodedBit, iBitOut++)
//If the current output byte is complete, reset the bit index and select the next one
if(iBitOut >= 8)
iBitOut = 0;
//Effectively sets the iBitOut'th bit of the iByteOut'th byte out to the encoded value of the byte to write
bool bitToWrite = encodedBits[RDp][byteToWrite][iEncodedBit]; //Temp
dout[iByteOut] ^= (-bitToWrite ^ dout[iByteOut]) & (1 << iBitOut);
//The running disparity is also updated as per the standard (to achieve DC balance)
RDp = encodingDisparity[RDp][byteToWrite]; //Update the running disparity
//If sync was not set, this means a byte of the input data has been processed, in which case take the next one in
//Also decrement the synchronisation counter
if(!sync) {
//In any case, decrease the synchronisation counter. Even sync characters decrease it (c.f. top of while loop)
return ret;
#include <iostream>
#include "TX_bytestream_gen.h"
#define PACKET_DURATION 0.000992 //In seconds, time of continuous data stream corresponding to one packet (5MHz output, default packet size)
#define TIME_TO_SIMULATE 10 //In seconds
#include <chrono>
using namespace std;
//Testbench: measure the time taken to simulate TIME_TO_SIMULATE seconds of continuous encoding
int main()
uint8_t toEncode[PAYLOAD_SIZE] = {100}; //Dummy data, doesn't matter
uint8_t out[PACKET_SIZE] = {0};
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
for(unsigned int i = 0 ; i < N_ITERATIONS ; i++)
TX_gen_bytestream(toEncode, PAYLOAD_SIZE, out, PACKET_SIZE);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "Task execution time: " << elapsed_seconds.count()/TIME_TO_SIMULATE*100 << "% (for " << TIME_TO_SIMULATE << "s simulated)\n";
return 0;
Appendix: lookup tables. I don't have enough characters to paste it here, but it looks like so:
bool const encodedBits[2][N_BYTE_VALUES][N_ENCODED_BITS] =
//Running disparity = RD-
//Running disparity = RD+
bool const encodingDisparity[2][N_BYTE_VALUES] =
//Previous running disparity was RD-
//Previous running disparity was RD+
This will be a lot faster if you do everything a byte at time instead of a bit at a time.
First change the way you store your lookup tables. You should have something like:
// conversion from (RD, byte) to (RD, 10-bit code)
// in each word, the lower 10 bits are the code,
// and bit 10 (the 11th bit) is the new RD
// The first 256 values are for RD -1, the next
// for RD 1
static const uint16_t BYTE_TO_CODE[512] = {
Then you need to change our encoding loop to write a byte at a time. You can use a uint16_t to store the leftover bits from each byte you output.
Something like this (I didn't figure out your sync byte logic, but presumably you can put that in the input or output byte loop):
// returns next isRD1
bool TX_gen_bytestream(uint8_t *dest, const uint8_t *src, size_t src_len, bool isRD1)
// bits generated, but not yet written, LSB first
uint16_t bits = 0;
// number of bits in bits
unsigned numbits = 0;
// current RD, either 0 or 256
uint16_t rd = isRD1 ? 256 : 0;
for (const uint8_t *end = src + src_len; src < end; ++src) {
// lookup code and next rd
uint16_t code = BYTE_TO_CODE[rd + *src];
// new rd from code bit 10
rd = (code>>2) & 256;
// store bits
bits |= (code & (uint16_t)0x03FF) << numbits;
// write out any complete bytes
while(numbits >= 8) {
*dest++ = (uint8_t)bits;
bits >>=8;
// If src_len isn't divisible by 4, then we have some extra bits
if (numbits) {
*dest = (uint8_t)bits;
return !!rd;

MODBUS (RTU mode) CRC calculation... what's wrong? it's a misprint of the DPS5020 user manual?

I am analyzing the MODBUS protocol (rs232 com port) used in the DPS5020 power supply module and I cannot understand the CRC calculation method in RTU mode (page 3)
In the first example on page 4 for sending bytes 1, 3,0,2,0,2 the value CRC = 65CB (Hex) is indicated (2 byte swapped).
I've also tried several CRC calculators online but can't find the right value.
I also did a step-by-step diagram of the calculation and the right rotation of the bits, but the values ​​do not return to me.
Is it necessary to use all the bytes of the frame (6) for the calculation or only the data values ​​(4)? I have tried both without success...
Could you kindly put a little diagram of how the calculation is done and the return values ​​step by step (16 bit xor with A001 value, rotate right yes / no ... etc)?
I know that in the end you have to swap the 2 bytes between them but the single values ​​do not come back to me anyway.
Or is it simply a misprint of the manual?
All bytes in the frame are used in the CRC calculation.
Here is a C implementation of the CRC, which should answer your question about exactly what to shift and exclusive-or when:
#include <stddef.h>
#include <stdint.h>
uint16_t crc16modbus_bit(uint16_t crc, void const *mem, size_t len) {
unsigned char const *data = mem;
if (data == NULL)
return 0xffff;
for (size_t i = 0; i < len; i++) {
crc ^= data[i];
for (unsigned k = 0; k < 8; k++) {
crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
return crc;
(The initial CRC value is returned when called with mem equal to NULL.)

How to exploit double buffering for reading digital inputs state?

I have following situation. I have a microcontroller which communicates with two external I/O expander chips via one SPI peripheral. Each of the chips has eight digital inputs and is equiped with the latch input which ensures that both bytes of the digital inputs can be sampled at one instant in time. To communicate the state of both the bytes into my microcontroller I need to do two SPI transactions. At the same time I need to ensure that the software in my microcontroller will work with consistent state of both the bytes.
My first idea how to solve this problem was to use sort of double buffer. Below is a pseudocode describing my idea.
uint8_t di_array_01[2] = {0};
uint8_t di_array_02[2] = {0};
uint8_t *ready_data = di_array_01;
uint8_t *shadow_data = di_array_02;
uint8_t *temp;
if(chip_0_data_received) {
*shadow_data = di_state_chip_0;
chip_0_data_received = false;
} else if(chip_1_data_received) {
*(shadow_data + 1) = di_state_chip_1;
temp = ready_data;
ready_data = shadow_data;
shadow_data = temp;
chip_1_data_received = false;
The higher software layer will always work with the content of the array pointed by the ready_data pointer. My intention is that setting of the boolean flags chip_0_data_received (chip_1_data_received) will be done in the "end of transaction" interrupt and the code below will be invoked from the background loop along with code for starting of the SPI transaction.
Does anybody see any potential problem which I have omitted?
If your data is only 16 bits in total you can read and write it atomically.
uint16_t combined_data;
// in reading function
if (chip_0_data_received && chip_1_data_received)
combined_data = (((uint16_t)di_state_chip_1 << 8) | di_state_chip_0);
// in using function
uint16_t get_combined_data = combined_data;
uint8_t data_chip_1 = ((get_combined_data >> 8) & 0xFF);
uint8_t data_chip_0 = ((get_combined_data >> 0) & 0xFF);

DMA write to SD card (SSP) doesn't write bytes

I'm currently working on replacing a blocking busy-wait implementation of an SD card driver over SSP with a non-blocking DMA implementation. However, there are no bytes actually written, even though everything seems to go according to plan (no error conditions are ever found).
First some code (C++):
(Disclaimer: I'm still a beginner in embedded programming so code is probably subpar)
namespace SD {
bool initialize() {
//Setup SSP and detect SD card
//... (removed since not relevant for question)
//Setup DMA
LPC_SC->PCONP |= (1UL << 29);
LPC_GPDMA->Config = 0x01;
//Enable DMA interrupts
NVIC_SetPriority(DMA_IRQn, 4);
//enable SSP interrupts
NVIC_SetPriority(SSP2_IRQn, 4);
bool write (size_t block, uint8_t const * data, size_t blocks) {
//TODO: support more than one block
ASSERT(blocks == 1);
printf("Request sd semaphore (write)\n");
printf("Writing to block " ANSI_BLUE "%d" ANSI_RESET "\n", block);
memcpy(SD::write_buffer, data, BLOCKSIZE);
//Start the write
uint8_t argument[4];
pack_argument(argument, block);
if (!send_command(CMD::WRITE_BLOCK, CMD_RESPONSE_SIZE::WRITE_BLOCK, response, argument)){
return fail();
//needs 8 clock cycles
//reset pending interrupts
LPC_GPDMA->IntTCClear = 0x01 << SD_DMACH_NR;
LPC_GPDMA->IntErrClr = 0x01 << SD_DMACH_NR;
//Prepare channel
SD_DMACH->CSrcAddr = (uint32_t)SD::write_buffer;
SD_DMACH->CDestAddr = (uint32_t)&SD_SSP->DR;
SD_DMACH->CControl = (uint32_t)BLOCKSIZE
| 0x01 << 26 //source increment
| 0x01 << 31; //Terminal count interrupt
SD_SSP->DMACR = 0x02; //Enable ssp write dma
SD_DMACH->CConfig = 0x1 //enable
| 0x1 << 11 //mem to peripheral
| 0x1 << 14 //enable error interrupt
| 0x1 << 15; //enable terminal count interrupt
return true;
extern "C" __attribute__ ((interrupt)) void DMA_IRQHandler(void) {
printf("dma irq\n");
uint8_t channelBit = 1 << SD_DMACH_NR;
if (LPC_GPDMA->IntStat & channelBit) {
if (LPC_GPDMA->IntTCStat & channelBit) {
printf(ANSI_GREEN "terminal count interrupt\n" ANSI_RESET);
LPC_GPDMA->IntTCClear = channelBit;
if (LPC_GPDMA->IntErrStat & channelBit) {
printf(ANSI_RED "error interrupt\n" ANSI_RESET);
LPC_GPDMA->IntErrClr = channelBit;
SD_DMACH->CConfig = 0;
SD_SSP->IMSC = (1 << 3);
extern "C" __attribute__ ((interrupt)) void SSP2_IRQHandler(void) {
if (SD_SSP->MIS & (1 << 3)) {
SD_SSP->IMSC &= ~(1 << 3);
printf("waiting until idle\n");
while(SD_SSP->SR & (1UL << 4));
//Stop transfer token
//I'm not sure if the part below up until deassert_cs is necessary.
//Adding or removing it made no difference.
uint8_t response;
unsigned int timeout = 4096;
do {
response = SPI::receive();
} while(response != 0x00 && --timeout);
if (timeout == 0){
//Now wait until the device isn't busy anymore
uint8_t response;
unsigned int timeout = 4096;
do {
response = SPI::receive();
} while(response != 0xFF && --timeout);
if (timeout == 0){
A few remarks about the code and setup:
Written for the lpc4088 with FreeRTOS
All SD_xxx defines are conditional defines to select the right pins (I need to use SSP2 in my dev setup, SSP0 for the final product)
All external function that are not defined in this snippet (e.g. pack_argument, send_command, semaphore.take() etc.) are known to be working correctly (most of these come from the working busy-wait SD implementation. I can't of course guarantee 100% that they are bugless, but they seem to be working right.).
Since I'm in the process of debugging this there are a lot of printfs and hardcoded SSP2 variables. These are of course temporarily.
I mostly used this as example code.
Now I have already tried the following things:
Write without DMA using busy-wait over SSP. As mentioned before I started with a working implementation of this, so I know the problem has to be in the DMA implementation and not somewhere else.
Write from mem->mem instead of mem->sd to eliminate the SSP peripheral. mem->mem worked fine, so the problem must be in the SSP part of the DMA setup.
Checked if the ISRs are called. They are: first the DMA IRS is called for the terminal count interrupt, and then the SSP2 IRS is called. So the IRSs are (probably) setup correctly.
Made a binary dump of the entire sd content to see if it the content might have been written to the wrong location. Result: the content send over DMA was not present anywhere on the SD card (I did this with any change I made to the code. None of it got the data on the SD card).
Added a long (~1-2 seconds) timeout in the SSP IRS by repeatedly requesting bytes from the SD card to make sure that there wasn't a timeout issue (e.g. that I tried to read the bytes before the SD card had the chance to process everything). This didn't change the outcome at all.
Unfortunately due to lack of hardware tools I haven't been able yet to verify if the bytes are actually send over the data lines.
What is wrong with my code, or where can I look to find the cause of this problem? After spending way more hours on this then I'd like to admit I really have no idea how to get this working and any help is appreciated!
UPDATE: I did a lot more testing, and thus I got some more results. The results below I got by writing 4 blocks of 512 bytes. Each block contains constantly increasing numbers module 256. Thus each block contains 2 sequences going from 0 to 255. Results:
Data is actually written to the SD card. However, it seems that the first block written is lost. I suppose there is some setup done in the write function that needs to be done earlier.
The bytes are put in a very weird (and wrong) order: I basically get alternating all even numbers followed by all odd numbers. Thus I first get even numbers 0x00 - 0xFE and then all odd numbers 0x01 - 0xFF (total number of written bytes seems to be correct, with the exception of the missing first block). However, there's even one exception in this sequence: each block contains 2 of these sequences (sequence is 256 bytes, block is 512), but the first sequence in each block has 0xfe and 0xff "swapped". That is, 0xFF is the end of the even numbers and 0xFE is the end of the odd series. I have no idea what kind of black magic is going on here. Just in case I've done something dumb here's the snippet that writes the bytes:
uint8_t block[512];
for (int i = 0; i < 512; i++) {
block[i] = (uint8_t)(i % 256);
if (!SD::write(10240, block, 1)) { //this one isn't actually written
WARN("noWrite", proc);
if (!SD::write(10241, block, 1)) {
WARN("noWrite", proc);
if (!SD::write(10242, block, 1)) {
WARN("noWrite", proc);
if (!SD::write(10243, block, 1)) {
WARN("noWrite", proc);
And here is the raw binary dump. Note that this exact pattern is fully reproducible: so far each time I tried this I got this exact same pattern.
Update2: Not sure if it's relevant, but I use sdram for memory.
When I finally got my hands on a logic analyzer I got a lot more information and was able to solve these problems.
There were a few small bugs in my code, but the bug that caused this behaviour was that I didn't send the "start block" token (0xFE) before the block and I didn't send the 16 bit (dummy) crc after the block. When I added these to the transfer buffer everything was written successfully!
So this fix was as followed:
bool write (size_t block, uint8_t const * data, size_t blocks) {
//TODO: support more than one block
ASSERT(blocks == 1);
printf("Request sd semaphore (write)\n");
printf("Writing to block " ANSI_BLUE "%d" ANSI_RESET "\n", block);
SD::write_buffer[0] = 0xFE; //start block
memcpy(&SD::write_buffer[1], data, BLOCKSIZE);
SD::write_buffer[BLOCKSIZE + 1] = 0; //dummy crc
SD::write_buffer[BLOCKSIZE + 2] = 0;
As a side note, the reason why the first block wasn't written was simply because I didn't wait until the device was ready before sending the first block. Doing so fixed the problem.

Designing a fast "rolling window" file reader

I'm writing an algorithm in C++ that scans a file with a "sliding window," meaning it will scan bytes 0 to n, do something, then scan bytes 1 to n+1, do something, and so forth, until the end is reached.
My first algorithm was to read the first n bytes, do something, dump one byte, read a new byte, and repeat. This was very slow because to "ReadFile" from HDD one byte at a time was inefficient. (About 100kB/s)
My second algorithm involves reading a chunk of the file (perhaps n*1000 bytes, meaning the whole file if it's not too large) into a buffer and reading individual bytes off the buffer. Now I get about 10MB/s (decent SSD + Core i5, 1.6GHz laptop).
My question: Do you have suggestions for even faster models?
edit: My big buffer (relative to the window size) is implemented as follows:
- for a rolling window of 5kB, the buffer is initialized to 5MB
- read the first 5MB of the file into the buffer
- the window pointer starts at the beginning of the buffer
- upon shifting, the window pointer is incremented
- when the window pointer nears the end of the 5MB buffer, (say at 4.99MB), copy the remaining 0.01MB to the beginning of the buffer, reset the window pointer to the beginning, and read an additional 4.99MB into the buffer.
- repeat
edit 2 - the actual implementation (removed)
Thank you all for many insightful response. It was hard to select a "best answer"; they were all excellent and helped with my coding.
I use a sliding window in one of my apps (actually, several layers of sliding windows working on top of each other, but that is outside the scope of this discussion). The window uses a memory-mapped file view via CreateFileMapping() and MapViewOfFile(), then I have an an abstraction layer on top of that. I ask the abstraction layer for any range of bytes I need, and it ensures that the file mapping and file view are adjusted accordingly so those bytes are in memory. Every time a new range of bytes is requested, the file view is adjusted only if needed.
The file view is positioned and sized on page boundaries that are even multiples of the system granularity as reported by GetSystemInfo(). Just because a scan reaches the end of a given byte range does not necessarily mean it has reached the end of a page boundary yet, so the next scan may not need to alter the file view at all, the next bytes are already in memory. If the first requested byte of a range exceeds the right-hand boundary of a mapped page, the left edge of the file view is adjusted to the left-hand boundary of the requested page and any pages to the left are unmapped. If the last requested byte in the range exceeds the right-hand boundary of the right-most mapped page, a new page is mapped and added to the file view.
It sounds more complex than it really is to implement once you get into the coding of it:
Creating a View Within a File
It sounds like you are scanning bytes in fixed-sized blocks, so this approach is very fast and very efficient for that. Based on this technique, I can sequentially scan multi-GIGBYTE files from start to end fairly quickly, usually a minute or less on my slowest machine. If your files are smaller then the system granularity, or even just a few megabytes, you will hardly notice any time elapsed at all (unless your scans themselves are slow).
Update: here is a simplified variation of what I use:
class FileView
DWORD m_AllocGran;
DWORD m_PageSize;
HANDLE m_File;
unsigned __int64 m_FileSize;
unsigned __int64 m_MapSize;
LPBYTE m_View;
unsigned __int64 m_ViewOffset;
DWORD m_ViewSize;
void CloseMap()
if (m_Map != NULL)
m_Map = NULL;
m_MapSize = 0;
void CloseView()
if (m_View != NULL)
m_View = NULL;
m_ViewOffset = 0;
m_ViewSize = 0;
bool EnsureMap(unsigned __int64 Size)
// do not exceed EOF or else the file on disk will grow!
Size = min(Size, m_FileSize);
if ((m_Map == NULL) ||
(m_MapSize != Size))
// a new map is needed...
ul.QuadPart = Size;
m_Map = CreateFileMapping(m_File, NULL, PAGE_READONLY, ul.HighPart, ul.LowPart, NULL);
if (m_Map == NULL)
return false;
m_MapSize = Size;
return true;
bool EnsureView(unsigned __int64 Offset, DWORD Size)
if ((m_View == NULL) ||
(Offset < m_ViewOffset) ||
((Offset + Size) > (m_ViewOffset + m_ViewSize)))
// the requested range is not already in view...
// round down the offset to the nearest allocation boundary
unsigned __int64 ulNewOffset = ((Offset / m_AllocGran) * m_AllocGran);
// round up the size to the next page boundary
DWORD dwNewSize = ((((Offset - ulNewOffset) + Size) + (m_PageSize-1)) & ~(m_PageSize-1));
// if the new view will exceed EOF, truncate it
unsigned __int64 ulOffsetInFile = (ulNewOffset + dwNewSize);
if (ulOffsetInFile > m_FileSize)
dwNewViewSize -= (ulOffsetInFile - m_FileSize);
if ((m_View == NULL) ||
(m_ViewOffset != ulNewOffset) ||
(m_ViewSize != ulNewSize))
// a new view is needed...
// make sure the memory map is large enough to contain the entire view
if (!EnsureMap(ulNewOffset + dwNewSize))
return false;
ul.QuadPart = ulNewOffset;
m_View = (LPBYTE) MapViewOfFile(m_Map, FILE_MAP_READ, ul.HighPart, ul.LowPart, dwNewSize);
if (m_View == NULL)
return false;
m_ViewOffset = ulNewOffset;
m_ViewSize = dwNewSize;
return true;
FileView() :
// map views need to be positioned on even multiples
// of the system allocation granularity. let's size
// them on even multiples of the system page size...
SYSTEM_INFO si = {0};
if (GetSystemInfo(&si))
m_AllocGran = si.dwAllocationGranularity;
m_PageSize = si.dwPageSize;
bool OpenFile(LPTSTR FileName)
if ((m_AllocGran == 0) || (m_PageSize == 0))
return false;
return false;
ul.LowPart = GetFileSize(hFile, &ul.HighPart);
if ((ul.LowPart == INVALID_FILE_SIZE) && (GetLastError() != 0))
return false;
m_File = hFile;
m_FileSize = ul.QuadPart;
return true;
void CloseFile()
m_FileSize = 0;
bool AccessBytes(unsigned __int64 Offset, DWORD Size, LPBYTE *Bytes, DWORD *Available)
if (Bytes) *Bytes = NULL;
if (Available) *Available = 0;
if ((m_FileSize != 0) && (offset < m_FileSize))
// make sure the requested range is in view
if (!EnsureView(Offset, Size))
return false;
// near EOF, the available bytes may be less than requested
DWORD dwOffsetInView = (Offset - m_ViewOffset);
if (Bytes) *Bytes = &m_View[dwOffsetInView];
if (Available) *Available = min(m_ViewSize - dwOffsetInView, Size);
return true;
FileView fv;
if (fv.OpenFile(TEXT("C:\\path\\file.ext")))
LPBYTE data;
DWORD len;
unsigned __int64 offset = 0, filesize = fv.FileSize();
while (offset < filesize)
if (!fv.AccessBytes(offset, some size here, &data, &len))
break; // error
if (len == 0)
break; // unexpected EOF
// use data up to len bytes as needed...
offset += len;
This code is designed to allow random jumping anywhere in the file at any data size. Since you are reading bytes sequentially, some of the logic can be simplified as needed.
Your new algorithm only pays 0.1% of the I/O inefficiencies... not worth worrying about.
To get further throughput improvement, you should take a closer look at the "do something" step. See whether you can reuse part of the result from an overlapping window. Check cache behavior. Check if there's a better algorithm for the same computation.
You have the basic I/O technique down. The easiest improvement you can make now is to pick a good buffer size. With some experimentation, you'll find that read performance increases quickly with buffer size until you hit about 16k, then performance begins to level out.
Your next task is probably to profile your code, and see where it is spending its time. When dealing with performance, it is always best to measure rather than guess. You don't mention what OS you're using, so I won't make any profiler recommendations.
You can also try to reduce the amount of copying/moving of data between your buffer and your workspace. Less copying is generally better. If you can process your data in-place instead of moving it to a new location, that's a win. (I see from your edits you're already doing this.)
Finally, if you're processing many gigabytes of archived information then you should consider keeping your data compressed. It will come as a surprise to many people that it is faster to read compressed data and then decompress it than it is to just read decompressed data. My favorite algorithm for this purpose is LZO which doesn't compress as well as some other algorithms, but decompresses impressively fast. This kind of setup is only worth the engineering effort if:
Your job is I/O bound.
You are reading many G of data.
You're running the program frequently, so it saves you a lot of time to make it run