Delete or update a dataset in HDF5?

Delete or update a dataset in HDF5? - c++

I would like to programatically change the data associated with a dataset in an HDF5 file. I can't seem to find a way to either delete a dataset by name (allowing me to add it again with the modified data) or update a dataset by name. I'm using the C API for HDF5 1.6.x but pointers towards any HDF5 API would be useful.

According to the user guide:
HDF5 does not at this time provide an easy mechanism to remove a dataset from a file or to reclaim the storage space occupied by a deleted object.
So simple deletion appears to be out of the question. But the section continues:
Removing a dataset and reclaiming the space it used can be done with the H5Ldelete function and the h5repack utility program. With the H5Ldelete function, links to a dataset can be removed from the file structure. After all the links have been removed, the dataset becomes inaccessible to any application and is effectively removed from the file. The way to recover the space occupied by an unlinked dataset is to write all of the objects of the file into a new file. Any unlinked object is inaccessible to the application and will not be included in the new file. Writing objects to a new file can be done with a custom program or with the h5repack utility program.

If you want to delete a dataset in c++ you need the following commands:
H5File m_h5File (pathAndNameToHDF5File, H5F_ACC_RDWR); //The hdf5 c++ object.
std::string channelName = "/myGroup/myDataset";
int result = H5Ldelete(m_h5File.getId(), channelName.data(), H5P_DEFAULT);
result will be a non-negative value if successful; otherwise returns a negative value. https://support.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Delete
As #MaxLybbert said, the hard-disk space it is not recoverd. You must use the repack tool.
However, with HDF5 v.1.10 the space can be recovered. But the user's guide is not ready yet: https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesFileSpaceMgmtDocs.html

Related

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.

Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.

On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

Armadillo: Save multiple datasets in one hdf5 file

I am trying to save multiple datasets into a single hdf5 file using armadillo's new feature to give custom names to datasets (using armadillo version 8.100.1).
However, only the last saved dataset will end up in the file. Is there any way to append to an existing hdf5 file with armadillo instead of replacing it?
Here is my example code:
#define ARMA_USE_HDF5
#include <armadillo>
int main(){
arma::mat A(2,2, arma::fill::randu);
arma::mat B(3,3, arma::fill::eye);
A.save(arma::hdf5_name("multi-hdf5.mat", "dataset1"), arma::hdf5_binary);
B.save(arma::hdf5_name("multi-hdf5.mat", "dataset2"), arma::hdf5_binary);
return 0;
}
The hdf5 file is read out using the h5dump utility.

Unfortunately, I don't think you can do that. I'm an HDF5 developer, not an armadillo developer, but I took a peek at their source for you.
The save functions look like they are designed to dump a single matrix to a single file. In the function save_hdf5_binary() (diskio_meat.hpp:1255 for one version) they call H5Fcreate() with the H5F_ACC_TRUNC flag, which will clobber any existing file. There's no 'open if file exists' or clobber/non-clobber option. The only H5Fopen() calls are in the hdf5_binary_load() functions and those don't keep the file open for later writing.
This clobbering is what is happening in your case, btw. A.save() creates a file containing dataset1, then B.save() clobbers that file with a new file containing dataset2.
Also, for what it's worth, 'appending to an HDF5 file' is not really the right way to think about that. HDF5 files are not byte/character streams like a text file. Appending to a dataset, yes. Files, no. Think of it like a relational database: You might append data to a table, but you probably wouldn't say that you were appending data to the database.

The latest version of Armadillo already covers that possibility.
You have to use hdf5_opts::append in the save method so if you want to save
a matrix A then you can write
A.save(hdf5_name(filename, dataset, hdf5_opts::append) ).

Solution to work disk space not enough in sas

I have more than 50 tables running in work. Before, it worked well.
But recently, there are some errors like:
ERROR: An I/O error has occurred on file
WORK.'SASTMP-000000030'n.UTILITY. ERROR: File
WORK.'SASTMP-000000030'n.UTILITY is damaged. I/O processing did not
complete. NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully if you
allocate more space to the WORK library. ERROR: There is not enough WORK disk space to store the results of an internal sorting
phase. ERROR: An error has occurred.
Does anyone know how to solve this error?

Your disk is full. If this is running on a server, ask your system administrator to investigate the problem.
If this is your desktop, find and delete un-needed files to free up space.
Clean out old SAS Work Folders
Often, old SAS Work folders do not get cleared when SAS closes. You can get back a lot of disk space by going to the path defined for SAS Work, and deleting all the old folders.
In SAS
%put %sysfunc(pathname(work));
will show you where the current WORK library is located. One level up is where all SAS Work folders are created.
On my system, that returns:
C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\_TD9512_GXM2L12-PAZZULA_
That means that I should look in "C:\Users\dpazzula\AppData\Local\Temp\SAS Temporary Files\" to find old folders to delete.

Your work space is full.
Your SAS server uses a dedicated directory where all SAS sessions store their temporary files: All files in the work libraries, as well as temp files as used while sorting, joining etc.
Solutions:
Have more space allocated.
Make certain only to put necessary files into work/ clean up/ close old sessions.
Run less processes.

Replace interim datasets with views instead, especially if you're using large source datasets :
data master /view=master ;
set lib.monthlydata20: ; /* all datasets since Jan 2000 */
run ;
proc sql ;
create table want as
select *
from master
where ID in(select ID from lookup) ;
quit ;

try to compress all datasets using this option
OPTIONS COMPRESS=YES REUSE=YES;
this should be in the very beginning of your code. it will compress all datasets by nearly 98%.It will also make your code run faster. It will consume more CPU but will decrease size.
In some cases, this might not help if the compressed data sets exceed the hard disk space.
Also, change your work directory to the biggest drive that has disk space.

Study your code.
Create a Data Flow Diagram to determine WHEN each file is created, where it is used downstream. Find out when a data set is no longer needed and DELETE it. If you have 50 data sets, chances are numerous data sets are 'value-added' by a subsequent step, and can go away freeing up your work space. A cute trick is to REUSE some of the data set names - to keep the number of unneeded data sets in check.
Rule of thumb: leave the environment the way you found it - if there were no files in WORK to start, manually clean up after yourself. Unless it is a Stored Process, which starts a completely new SAS job, and will clean up after itself upon completion of the job.

Delete record from file C++

I'm working on a simple database console application in C++ for adding, editing and deleting records in a .dat file. I have the addition and modification down, I'm just finding it hard to understand the concept of deletion in this scenario. Below is how I write a record.
Write record
fh.seekp(num*sizeof(customerObj),ios::beg); // Move the write pointer to where rec is
fh.write((char*)&customerObj,sizeof(customerObj)); // Write updated rec
Any ideas how instead of write() I could have something equivalent to delete()... or is it not that simple?

C and C++ don't have functions to delete parts of files. Many operating systems don't either.
Possible options:
If this is the last record, truncate the file. If not, move (=copy) all records after it, overwriting it, then truncate. Alternatively you could move (=copy) the last record to it and then truncate.
Create an extra file and copy to it all records before this and after this record. Then delete the old file and rename the new file.
Mark the record as unused. When writing new records check if you have any unused locations and use them first.
Use a file per record.

Introduce marker of deleted record. So instead of movement of large chunks of file you need write 1 symbol. When you need allocate new record you could iterate over already deleted and just remove marker.

Well, deleting from files is not trivial as you cannot delete a row from a file directly.
One approach is to read the entire file (or in bulks) and write it back without the required line. (quite robust and not efficient for large files).
Maybe if you divide you record file into smaller partitioned files than doing the above will be more efficient.
Another thing you can do is just mark a row in the file as invalid (as done in memory when deleting a pointer) and overwrite it when needed which of course depends on how you write your records but I hope you get what I mean.

How do I truncate a file in Visual C++?

I have the following case. I have a big file say 1 KB. I want to read the first 100 bytes and then delete the 100 bytes read data from the file and then read next 100 bytes. To read 100 bytes is ok, but how do I delete 100 bytes from the file?

This is commonly done as a multiple-step process:
Rename the original file.
Write the data you want into a new file with the original file name.
Delete the old file with the temporary name that contains the data you no longer want.
That way, if something were to go wrong, you could simply restore the original file that you renamed. Moving a file from one place to another is implemented this way, as well.
However, if you don't want to do this, the SetEndOfFile function is another viable option to truncate the contents of a file in-place. From the documentation:
Sets the physical file size for the specified file to the current position of the file pointer.
The physical file size is also referred to as the end of the file. The SetEndOfFile function can be used to truncate or extend a file.

That wouldn't be called truncating; that term refers to removing data from the end, not the beginning. I'm not aware of any operating system where this is possible, other than by copying the contents of the file to a new file, starting at the 100th byte.

Deleting data that has been processed in a file is time consuming and in most cases not necessary.
Deleting data near the top or middle of the file requires writing a new file, which takes time and disk space. Most applications will read and process the entire file then rename the file (with a backup extension). This is useful for debugging purposes. Deleting an entire file is often a faster operation that writing a new file without processed data.
Deletions should only take place when necessary. For files, one can store an offset of where the valid data begins, thus reducing the need to delete data from a file. For secure purposes, overwriting data in the file is often faster then creating a new file without the processed data.
First try writing your program to not delete data in the file. Only delete as necessary, after the program is robust and working correctly. Many people would suggest to only delete files when there is no more space on drive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js