What's the smallest possible file size on disk? - c++

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.

the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.

Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'

You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.

on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

Related

Most efficient file type for random data access

I am writing a password generation program. I have collected a list of around 30,000 English words and plan on picking from them at random by index.
Currently, I have all the words in a .txt file each separated by a newline character and organized by length.
My current plan is to write the program in C++ because that is the language I am most comfortable in so I could just load the entire file into memory, but that seems incredibly sloppy.
What would be a more efficient way (or file type like JSON if necessary) to do this? Thanks
30,000 words sounds like an insignificant amount of data to load. Even if it's ~50-500MB just load it in and forget about it.
On a modern system this will take a fraction of a second to accomplish the first time, any SSD can do ~600MB/s+, and even less once it's in the OS disk buffer.
You'd only concern yourself with not loading it if you've got a file too big to fit in memory.

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.
On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

Remove beginning of file without rewriting the whole file

I have an embedded Linux system, that stores data in a very large file, appending new data to the end. As the file size grows near filling available storage space, I need to remove oldest data.
Problem is, I can't really accept the disruption it would take to move the massive bulk of data "up" the file, like normal - lock the file for an extended period of time just to rewrite it (plus this being a flash medium, it would cause unnecessary wear to the flash).
Probably the easiest way would be to split the file into multiple smaller ones, but this has several downsides related to how the data is handled and processed - all the 'client end' software expects single file. OTOH it can handle 'corruption' of having the first record cut in half, so the file doesn't need to be trimmed at record offsets, just 'somewhere up there', e.g. first few iNodes freed. Oldest data is obsolete anyway so even more severe corruption of the beginning of the file is completely acceptable, as long as the 'tail' remains clean, and liberties can be taken with how much exactly is removed - 'roughly several first megabytes' is okay, no need for 'first 4096KB exactly' precision.
Is there some method, API, trick, hack to truncate beginning of file like that?
You can achieve the goal with Linux kernel v3.15 above for ext4/xfs file system.
int ret = fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, 4096);
See here
Truncating the first 100MB of a file in linux
The easiest solution for your old applications would be a FUSE filesystem which gives them access to the underlying file, but with the offset cyclically shifted. This would allow you to implement a ringbuffer at the physical level. The FUSE layer would be fairly trivial as it only needs to adjust all filepositions by a constant, modulo filesize.
What about setting up a separate process that renames the output file when it reaches a predefined size (for instance by adding the linux time at the end of the file name).
This would allow you to keep the old data and the main process will recreate the output file the next time it writes to it.
Another cron job may remove the old file every now and then.

FAT, optimize performance when retrieve a file

I have an implementation of database with one file per record, and I have about 10000 records.
I'm trying to optimize the performance of access to file, and I have a little doubt.
Is split files into folders better then keep all in single folder, for quick access to the files? ex: from 0 to 999 in folder 0, from 1000 to 1999 in 2 etc...
What is better for this, FAT16 or FAT32?
If you are accessing the files directly, then you won't have any performance drop. If you are searching for a particular file on the disk, it would be faster to store them in folders. This way folders would emulate db indexes. But as #blow mentioned, why don't you use something like Sqlite?
When you retrieve a file by filename you most likely do a linear search in the directory containing that file, you skip all directory entries until you find the one that matches the given filename.
This search operation may be slow if you do it every time for every file, there are many files in the directory and reads are slow (if your CPU is slow you lose even more).
You may want to build some sort of an index, a compact array of pairs filename+location sorted by filename, which you can keep in memory to quickly find files w/o rereading the directory entries.
Things can be greatly simplified if there's a constant number of files and they have the same length or are padded to the same length. In that case you don't need any search as you can calculate the location of each file directly from the filename, provided, of course, that the order of the files is fixed.
The only practical difference between FAT1x and FAT32 in this context is the size of the file allocation table, that set of linked lists/chains that tells you which clusters are free or occupied by file/directory data and tells you which cluster is the next in a file/directory after the given one. In FAT32, the cluster chain elements are 32-bit, 2 times larger than on FAT16. If the number of used clusters is small (less than ~64K), you are going to read twice as much data from FAT32 while traversing the cluster chains compared with FAT16. Also, finding a free cluster on FAT32 (when you create a new file/dir or grow an existing one) can be slow if there are many clusters on the disk (and there can be up to 2^28 on FAT32 AFAIR vs 2^16 of FAT16). You don't want to start searching for a free cluster from the beginning of the FAT every time. You want to keep somewhere a pointer to the last place you stopped the search and the next time search from there and then go to the beginning of the FAT when you've reached the FAT's end.
Split them across directories (the split number depending on your cluster size) and do not use LFN (LongFileName) if you can, because it will slow down your operation. I also work on embbeded systems. I did not have to access 1000s of files like you, but i avoided LFN (especially for royalty reasons).

What's the best way to write to more files than the kernel allows open at a time?

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.
How can I write to the output files effectively?
I also must use the cstdlib library due to legacy code.
The executable must also be UNIX and windows cross-platform compatible.
A couple possible approaches you might take:
keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.
Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.
Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.
It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..
If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.
You can also keep track of the last write-time for each file, and try to keep the most recently written files open.
The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.
First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.
You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.
A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.
You can do it in 2 steps.
1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.
2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.
Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.
I think in most cases it is more efficient to open and close the files more often.
The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.
I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.
The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.
Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.
Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.
You mentioned in a comment that speed was a major concern and that the naive approach is too slow.
There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.
Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).
A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...
Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.
Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.
You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.
"Fewest File Opens" Strategy:
To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.
Pseudocode for each thread:
Run through the file, collect all the unique ids.
fseek() back to the beginning of the input.
For every group of 19 IDs:
Open a file for each ID.
Run through the input file, appending matching records to the corresponding output file.
Close this group of 19 output files.
fseek() to the beginning of the input.
This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.
"Fewest File Operations" Strategy
You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:
Choose the next unread page of the input.
Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
Mark the page as unavailable.
If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
Mark the page as available, and give it the next page number.
You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?
Things to be careful about:
Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.