I am implementing a file system on SPI flash memory using a w25qxx chip and an STM32F4xx on STM32CubeIDE. I have successfully created the basic i/o for the w25 over SPI, being able to write and read sectors at a time.
In my user_diskio.c I have implemented all of the needed i/o methods and have verified that they are properly linked and being called.
in my main.cpp I go to format the drive using f_mkfs(), then get the free space, and finally open and close a file. However, f_mkfs() keeps returning FR_MKFS_ABORTED. (FF_MAX_SS is set to 16384)
fresult = FR_NO_FILESYSTEM;
if (fresult == FR_NO_FILESYSTEM)
{
BYTE work[FF_MAX_SS]; // Formats the drive if it has yet to be formatted
fresult = f_mkfs("0:", FM_ANY, 0, work, sizeof work);
}
f_getfree("", &fre_clust, &pfs);
total = (uint32_t)((pfs->n_fatent - 2) * pfs->csize * 0.5);
free_space = (uint32_t)(fre_clust * pfs->csize * 0.5);
fresult = f_open(&fil, "file67.txt", FA_OPEN_ALWAYS | FA_READ | FA_WRITE);
f_puts("This data is from the FILE1.txt. And it was written using ...f_puts... ", &fil);
fresult = f_close(&fil);
fresult = f_open(&fil, "file67.txt", FA_READ);
f_gets(buffer, f_size(&fil), &fil);
f_close(&fil);
Upon investigating my ff.c, it seems that the code is halting on line 5617:
if (fmt == FS_FAT12 && n_clst > MAX_FAT12) return FR_MKFS_ABORTED; /* Too many clusters for FAT12 */
n_clst is calculated a few lines up before some conditional logic, on line 5594:
n_clst = (sz_vol - sz_rsv - sz_fat * n_fats - sz_dir) / pau;
Here is what the debugger reads the variables going in as:
This results in n_clst being set to 4294935040, as it is unsigned, though the actual result of doing the calculations would be -32256 if the variable was signed. As you can imagine, this does not seem to be an accurate calculation.
The device I am using has 16M-bit (2MB) of storage organized in 512 sectors of 4kb in size. The minimum erasable block size is 32kb. If you would need more info on the flash chip I am using, page 5 of this pdf outlines all of the specs.
This is what my USER_ioctl() looks like:
DRESULT USER_ioctl (
BYTE pdrv, /* Physical drive nmuber (0..) */
BYTE cmd, /* Control code */
void *buff /* Buffer to send/receive control data */
)
{
/* USER CODE BEGIN IOCTL */
UINT* result = (UINT*)buff;
HAL_GPIO_WritePin(GPIOE, GPIO_PIN_11, GPIO_PIN_SET);
switch (cmd) {
case GET_SECTOR_COUNT:
result[0] = 512; // Sector and block sizes of
return RES_OK;
case GET_SECTOR_SIZE:
result[0] = 4096;
return RES_OK;
case GET_BLOCK_SIZE:
result[0] = 32768;
return RES_OK;
}
return RES_ERROR;
/* USER CODE END IOCTL */
}
I have tried monkeying around with the parameters to f_mkfs(), swapping FM_ANY out for FM_FAT, FM_FAT32, and FM_EXFAT (along with enabling exFat in my ffconf.h. I have also tried using several values for au rather than the default. For a deeper documentation on the f_mkfs() method I am using, check here, there are a few variations of this method floating around out there.
Here:
fresult = f_mkfs("0:", FM_ANY, 0, work, sizeof work);
The second argument is not valid. It should be a pointer to a MKFS_PARM structure or NULL for default options, as described at http://elm-chan.org/fsw/ff/doc/mkfs.html.
You should have something like:
MKFS_PARM fmt_opt = {FM_ANY, 0, 0, 0, 0};
fresult = f_mkfs("0:", &fmt_opt, 0, work, sizeof work);
except that it is unlikely for your media (SPI flash) that the default option are appropriate - the filesystem cannot obtain formatting parameters from the media as it would for SD card for example. You have to provide the necessary formatting information.
Given your erase block size I would guess:
MKFS_PARM fmt_opt = {FM_ANY, 0, 32768, 0, 0};
but to be clear I have never used the ELM FatFS (which STM32Cube incorporates) with SPI flash - there may be additional issues. I also do not use STM32CubeMX - it is possible I suppose that the version has a different interface, but I would recommend using the latest code from ELM rather than ST's possibly fossilised version.
Another consideration is that FatFs is not particularly suitable for your media due to wear-levelling issues. Also ELM FatFs has not journalling or check/repair function, so is not power fail safe. That is particularly important for non-removable media that you cannot easily back-up or repair.
You might consider a file system specifically designed for SPI NOR flash such as SPIFFS, or the power-fail safe LittleFS. Here is an example of LittleFS in STM32: https://uimeter.com/2018-04-12-Try-LittleFS-on-STM32-and-SPI-Flash/
Ok, I think the real problem was that the IOCTL call GET_BLOCK_SIZE to get the block size was returning the sector size instead of the number of sectors in the block. Which is usually 1 for SPI Flash.
Related
I'm developing my own dpdk application and I wish received packets to go through several threads in series. Each individual thread has it's own duty of inspecting packets and generating some metadata for each individual packet. It appears to be the easiest and most efficient way to transfer packets between threads is using rte rings. However I need to transfer the metadata generated by each thread to the next thread as well. I have tried doing this using array of structures for metadata and parsing a pointer to next thread. However this method proved to be inefficient since I got lot of cache misses.
As a solution I came up with idea of putting metadata generated by each thread into mbufs themselves. It seems to be doable with "Dynamic fields" of mbufs. However documentation of this feature seems to be very limited. For my application I wish to use a metadata field inside dynamic field something like this,
typedef struct {
uint32_t packet_id;
uint64_t time_stamp;
uint8_t ip_v;
uint32_t length;
.........
.........
} my_metadata_field;
What I don't understand is how much space I can use for dynamic field? The only thing mentioned about this on dpdk documentation is,
"10.6.1. Dynamic fields and flags
The size of the mbuf is constrained and limited; while the amount of
metadata to save for each packet is quite unlimited. The most basic
networking information already find their place in the existing mbuf
fields and flags.
If new features need to be added, the new fields and flags should fit
in the “dynamic space”, by registering some room in the mbuf
structure:
dynamic field -
named area in the mbuf structure, with a given size (at least 1 byte) and alignment constraint."
which doesn't make much sense for me. How much memory I have for this field? If it's almost unlimited, what are the tradeoffs I have to deal with if I use a large metadata field? (performance wise)
I use dpdk 20.08
Edit:
After some digging I have abandoned the idea of using dynamic field for metadata since lack of documentation and it doesn't appears to be able to hold more than 64bits.
I am looking for an easy way to embed my metadata inside cache aligned mbufs (preferably using a struct like above) so I can use rte rings to share them between threads. I'm looking for any documentation or reference project for me to begin with.
There are a couple of ways to carry metadata along with MBUF. Following are the options to do the same
in function rte_mempool_create instead of passing private_data_size as 0 pass the size as custom metadata size.
in function rte_pktmbuf_pool_create instead of passing priv_size as 0 pass the size as custom metadata size
if size of metadata is less than 128 Bytes, use typecast to access memory area right after rte_mbuf
If there are no external buffer used in DPDK application, update rte_mbuf shinfo or next
Solution 1: rte_mempool_create("FIPS_SESS_PRIV_MEMPOOL", 16, sess_sz, 0, sizeof(my_metadata_field), NULL, NULL, NULL, NULL, rte_socket_id(), 0);
Solution 2: rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS * nb_ports, MBUF_CACHE_SIZE, sizeof(my_metadata_field), RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
Solution 3:
struct rte_mbuf *bufs[BURST_SIZE];
const uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);
if (unlikely(nb_rx == 0))
continue;
for (int index = 0; index < nb_rx; index++)
{
assert(sizeof(my_metadata_field) <= RTE_CACHE_LINE_SIZE);
my_metadata_field *ptr = bufs[index] + 1;
...
...
...
}
Solution 4:
privdata_ptr = rte_mempool_create("METADATA_POOL", 16 * 1024, sizeof(my_metadata_field), 0, 0,
NULL, NULL, NULL, NULL, rte_socket_id(), 0);
struct rte_mbuf *bufs[BURST_SIZE];
const uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);
if (unlikely(nb_rx == 0))
continue;
for (int index = 0; index < nb_rx; index++)
{
void *msg = NULL;
if (0 == rte_mempool_get(privdata_ptr, &msg))
{
assert(msg != NULL);
bufs[index]->shinfo = msg;
continue;
}
/* free the mbuf as we are not able to retrieve the private data */
}
/* before transmit or pkt free ensure to release object back to mempool via rte_mempool_put */
I'm building a graphics engine, and I need to write te result image to a .bmp file. I'm storing the pixels in a vector<Color>. While also saving the width and the heigth of the image. Currently I'm writing the image as follows(I didn't write this code myself):
std::ostream &img::operator<<(std::ostream &out, EasyImage const &image) {
//temporaryily enable exceptions on output stream
enable_exceptions(out, std::ios::badbit | std::ios::failbit);
//declare some struct-vars we're going to need:
bmpfile_magic magic;
bmpfile_header file_header;
bmp_header header;
uint8_t padding[] =
{0, 0, 0, 0};
//calculate the total size of the pixel data
unsigned int line_width = image.get_width() * 3; //3 bytes per pixel
unsigned int line_padding = 0;
if (line_width % 4 != 0) {
line_padding = 4 - (line_width % 4);
}
//lines must be aligned to a multiple of 4 bytes
line_width += line_padding;
unsigned int pixel_size = image.get_height() * line_width;
//start filling the headers
magic.magic[0] = 'B';
magic.magic[1] = 'M';
file_header.file_size = to_little_endian(pixel_size + sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.bmp_offset = to_little_endian(sizeof(file_header) + sizeof(header) + sizeof(magic));
file_header.reserved_1 = 0;
file_header.reserved_2 = 0;
header.header_size = to_little_endian(sizeof(header));
header.width = to_little_endian(image.get_width());
header.height = to_little_endian(image.get_height());
header.nplanes = to_little_endian(1);
header.bits_per_pixel = to_little_endian(24);//3bytes or 24 bits per pixel
header.compress_type = 0; //no compression
header.pixel_size = pixel_size;
header.hres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.vres = to_little_endian(11811); //11811 pixels/meter or 300dpi
header.ncolors = 0; //no color palette
header.nimpcolors = 0;//no important colors
//okay that should be all the header stuff: let's write it to the stream
out.write((char *) &magic, sizeof(magic));
out.write((char *) &file_header, sizeof(file_header));
out.write((char *) &header, sizeof(header));
//okay let's write the pixels themselves:
//they are arranged left->right, bottom->top, b,g,r
// this is the main bottleneck
for (unsigned int i = 0; i < image.get_height(); i++) {
//loop over all lines
for (unsigned int j = 0; j < image.get_width(); j++) {
//loop over all pixels in a line
//we cast &color to char*. since the color fields are ordered blue,green,red they should be written automatically
//in the right order
out.write((char *) &image(j, i), 3 * sizeof(uint8_t));
}
if (line_padding > 0)
out.write((char *) padding, line_padding);
}
//okay we should be done
return out;
}
As you can see, the pixels are being written one by one. This is quite slow, I put some timers in my program, and found that the writing was my main bottleneck.
I tried to write entire (horizontal) lines, but I did not find how to do it(best I found was this.
Secondly, I wanted to write to the file using multithreading(not sure if I need to use threading or processing). using openMP. But that means I need to specify which byte address to write to, I think, which I couldn't solve.
Latstly, I thought about immidiatly writing to the file whenever I drew an object, but then I had the same issue with writing to specific locations in the file.
So, my question is: what's the best(fastest) way to tackle this problem. (Compiling this for windows and linux)
The fastest method to write to a file is to use hardware assist. Write your output to memory (a.k.a. buffer), then tell the hardware device to transfer from memory to the file (disk).
The next fastest method is to write all the data to a buffer then block write the data to the file. If you want other tasks or threads to execute during your writing, then create a thread that writes the buffer to the file.
When writing to a file, the more data per transaction, the more efficient the write will be. For example, 1 write of 1024 bytes is faster than 1024 writes of one byte.
The idea is to keep the data streaming. Slowing down the transfer rate may be faster than a burst write, delay, burst write, delay, etc.
Remember that the disk is essentially a serial device (unless you have a special hard drive). Bits are laid down on the platters using a bit stream. Writing data in parallel will have adverse effects because the head will have to be moved between the parallel activities.
Remember that if you use more than one core, there will be more traffic on the data bus. The transfer to the file will have to pause while other threads/tasks are using the data bus. So, if you can, block all tasks, then transfer your data. :-)
I've written programs that copy from slow memory to fast memory, then transferred from fast memory to the hard drive. That was also using interrupts (threads).
Summary
Fast writing to a file involves:
Keep the data streaming; minimize the pauses.
Write in binary mode (no translations, please).
Write in blocks (format into memory as necessary before writing the block).
Maximize the data in a transaction.
Use separate writing thread, if you want other tasks running "concurrently".
The hard drive is a serial device, not parallel. Bits are written to the platters in a serial stream.
I've noticed that as apparently documented IMFTransform::ProcessOutput() for a resampler can only output one sample per call! I guess its more orientated at large frame size video coding. Given all the code I have been looking at as reference for related audio playback allocates one IMFMediaBuffer per call of ProcessOutput, this seems a little insane and terrible architecture - unless I am missing something?
It is especially bad from the point of view of media buffer usage. For example a SourceReader decoding my test MP3 gives me chunks of about 64KB in one sample with one buffer. Which is sensible. But GetOutputStreamInfo() is requesting a media buffer of just 24 bytes per call for ProcessOutput().
64KB chunks => chopped into many 24B chunks => to further processing, seems very daft overhead (the resampler would be doing a lot of overhead per every 24 bytes, and enforcing that overhead later down the pipeline if its not consolidated).
From https://learn.microsoft.com/en-us/windows/win32/api/mftransform/nf-mftransform-imftransform-processoutput
Its says:
The MFT cannot return more than one sample per stream in a single call to ProcessOutput
The MFT writes the output data to the start of the buffer, overwriting any data that already exists in the buffer
So it's not even the case it can append to the end of partially full buffer attached to the sample.
I could create my own pooling object that supports the media buffers interface but pointer bumps into a vanilla locked media buffer I guess. The only other option seemingly being to lock/copy those 24 bytes to another larger buffer for processing. But this all seems excessive, and at the wrong granularity.
What is the best way to deal with this?
Here is a simplified sketch of my test so far:
...
status = transform->ProcessInput(0, sample, 0);
sample->Release();
while(1)
{
MFT_OUTPUT_STREAM_INFO outDetails{};
MFT_OUTPUT_DATA_BUFFER outData{};
IMFMediaBuffer* outBuffer;
IMFSample* outSample;
DWORD outStatus;
status = transform->GetOutputStreamInfo(0, &outDetails);
status = MFCreateAlignedMemoryBuffer(outDetails.cbSize, outDetails.cbAlignment, &outBuffer);
status = MFCreateSample(&outSample);
status = outSample->AddBuffer(outBuffer);
outBuffer->Release();
outData.pSample = outSample;
status = transform->ProcessOutput(0, 1, &outData, &outStatus);
if (status == MF_E_TRANSFORM_NEED_MORE_INPUT)
break;
...
}
I wrote some code for you to prove that audio resamper is capable to process large audio blocks at once. It is good, efficient processing style:
winrt::com_ptr<IMFTransform> Transform;
winrt::check_hresult(CoCreateInstance(CLSID_CResamplerMediaObject, nullptr, CLSCTX_ALL, IID_PPV_ARGS(Transform.put())));
WAVEFORMATEX InputWaveFormatEx { WAVE_FORMAT_PCM, 1, 44100, 44100 * 2, 2, 16 };
WAVEFORMATEX OutputWaveFormatEx { WAVE_FORMAT_PCM, 1, 48000, 48000 * 2, 2, 16 };
winrt::com_ptr<IMFMediaType> InputMediaType;
winrt::check_hresult(MFCreateMediaType(InputMediaType.put()));
winrt::check_hresult(MFInitMediaTypeFromWaveFormatEx(InputMediaType.get(), &InputWaveFormatEx, sizeof InputWaveFormatEx));
winrt::com_ptr<IMFMediaType> OutputMediaType;
winrt::check_hresult(MFCreateMediaType(OutputMediaType.put()));
winrt::check_hresult(MFInitMediaTypeFromWaveFormatEx(OutputMediaType.get(), &OutputWaveFormatEx, sizeof OutputWaveFormatEx));
winrt::check_hresult(Transform->SetInputType(0, InputMediaType.get(), 0));
winrt::check_hresult(Transform->SetOutputType(0, OutputMediaType.get(), 0));
MFT_OUTPUT_STREAM_INFO OutputStreamInfo { };
winrt::check_hresult(Transform->GetOutputStreamInfo(0, &OutputStreamInfo));
_A(!(OutputStreamInfo.dwFlags & MFT_OUTPUT_STREAM_SINGLE_SAMPLE_PER_BUFFER));
DWORD const InputMediaBufferSize = InputWaveFormatEx.nAvgBytesPerSec;
winrt::com_ptr<IMFMediaBuffer> InputMediaBuffer;
winrt::check_hresult(MFCreateMemoryBuffer(InputMediaBufferSize, InputMediaBuffer.put()));
winrt::check_hresult(InputMediaBuffer->SetCurrentLength(InputMediaBufferSize));
winrt::com_ptr<IMFSample> InputSample;
winrt::check_hresult(MFCreateSample(InputSample.put()));
winrt::check_hresult(InputSample->AddBuffer(InputMediaBuffer.get()));
winrt::check_hresult(Transform->ProcessInput(0, InputSample.get(), 0));
DWORD const OutputMediaBufferCapacity = OutputWaveFormatEx.nAvgBytesPerSec;
winrt::com_ptr<IMFMediaBuffer> OutputMediaBuffer;
winrt::check_hresult(MFCreateMemoryBuffer(OutputMediaBufferCapacity, OutputMediaBuffer.put()));
winrt::check_hresult(OutputMediaBuffer->SetCurrentLength(0));
winrt::com_ptr<IMFSample> OutputSample;
winrt::check_hresult(MFCreateSample(OutputSample.put()));
winrt::check_hresult(OutputSample->AddBuffer(OutputMediaBuffer.get()));
MFT_OUTPUT_DATA_BUFFER OutputDataBuffer { 0, OutputSample.get() };
DWORD Status;
winrt::check_hresult(Transform->ProcessOutput(0, 1, &OutputDataBuffer, &Status));
DWORD OutputMediaBufferSize = 0;
winrt::check_hresult(OutputMediaBuffer->GetCurrentLength(&OutputMediaBufferSize));
You can see that after feeding one second of input, the output holds [almost] one second of data as expected.
Currently, I am working on real time interface with Visual Studio C++.
I faced problem is, when buffer is running for data store, that time .exe is not responding at the point data store in buffer. I collect data as 130Hz from motion sensor. I have tried to increase virtual memory of computer, but problem was not solved.
Code Structure:
int main(){
int no_data = 0;
float x_abs;
float y_abs;
int sensorID = 0;
while (1){
// Define Buffer
char before_trial_output_data[][8 * 4][128] = { { { 0, }, }, };
// Collect Real Time Data
x_abs = abs(inchtocm * record[sensorID].y);
y_abs = abs(inchtocm * record[sensorID].x);
//Save in buffer
sprintf(before_trial_output_data[no_data][sensorID], "%d %8.3f %8.3f\n",no_data,x_abs,y_abs);
//Increment point
no_data++;
// Break While loop, Press ESc key
if (GetAsyncKeyState(VK_ESCAPE)){
break;
}
}
//Data Save in File
printf("\nSaving results to 'RecordData.txt'..\n");
FILE *fp3 = fopen("RecordData.dat", "w");
for (i = 0; i<no_data-1; i++)
fprintf(fp3, output_data[i][sensorID]);
fclose(fp3);
printf("Complete...\n");
}
The code you posted doesn't show how you allocate more memory for your before_trial_output_data buffer when needed. Do you want me to guess? I guess you are using some flavor of realloc(), which needs to allocate ever-increasing amount of memory, fragmenting your heap terribly.
However, in order for you to save that data to a file later on, it doesn't need to be in continuous memory, so some kind of list will work way better than an array.
Also, there is no provision in your "pseudo" code for a 130Hz reading; it processes records as fast as possible, and my guess is - much faster.
Is your prinf() call also a "pseudo code"? Otherwise you are looking for trouble by having mismatch of the % format specifications and number and type of parameters passed in.
I'm trying to use the Linux system call sendfile() to copy a file using threads.
I'm interested in optimizing these parts of the code:
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
/* ... */
fwrite(buff, 1, len, fout);
Code:
void* FileOperate::FileCpThread::threadCp(void *param)
{
Info *ft = (Info *)param;
FILE *fin = fopen(ft->fromfile, "r+");
FILE *fout = fopen(ft->tofile, "w+");
int size = getFileSize(ft->fromfile);
int number = ft->num;
fseek(fin, size * (number) / MAX_THREADS, SEEK_SET);
fseek(fout, size * (number) / MAX_THREADS, SEEK_SET);
char buff[1024] = {'\0'};
int len = 0;
int total = 0;
while((len = fread(buff, 1, sizeof(buff), fin)) > 0)
{
fwrite(buff, 1, len, fout);
total += len;
if(total > size/MAX_THREADS)
{
break;
}
}
fclose(fin);
fclose(fout);
}
File copying is not CPU bound; if it were you're likely to find that the limitation is at the kernel level and nothing you can do at the user leve would parallelize it.
Such "improvements" done on mechanical drives will in fact degrade the throughput. You're wasting time seeking along the file instead of reading and writing it.
If the file is long and you don't expect to need the read or written data anytime soon, it might be tempting to use the O_DIRECT flag on open. That's a bad idea, since the O_DIRECT API is essentially broken by design.
Instead, you should use posix_fadvise on both source and destination files, with POSIX_FADV_SEQUENTIAL and POSIX_FADV_NOREUSE flags. After the write (or sendfile) call is finished, you need to advise that the data is not needed anymore - pass POSIX_FADV_DONTNEED. That way the page cache will only be used to the extent needed to keep the data flowing, and the pages will be recycled as soon as the data has been consumed (written to disk).
The sendfile will not push file data over to the user space, so it further relaxes some of the pressure from memory and processor cache. That's about the only other sensible improvement you can make for copying of files that's not device-specific.
Choosing a sensible chunk size is also desirable. Given that modern drives push over a 100Mbytes/s, you might want to push a megabyte at a time, and always a multiple of the 4096 byte page size - thus (4096*256) is a decent starting chunk size to handle in a single sendfile or read/write calls.
Read parallelization, as you propose it, only makes sense on RAID 0 volumes, and only when both the input and output files straddle the physical disks. You can then have one thread per the lesser of the number of source and destination volume physical disks straddled by the file. That's only necessary if you're not using asynchronous file I/O. With async I/O you wouldn't need more than one thread anyway, especially not if the chunk sizes are large (megabyte+) and the single-thread latency penalty is negligible.
There's no sense for parallelization of a single file copy on SSDs, unless you were on some very odd system indeed.