Armadillo: Save multiple datasets in one hdf5 file - c++

I am trying to save multiple datasets into a single hdf5 file using armadillo's new feature to give custom names to datasets (using armadillo version 8.100.1).
However, only the last saved dataset will end up in the file. Is there any way to append to an existing hdf5 file with armadillo instead of replacing it?
Here is my example code:
#define ARMA_USE_HDF5
#include <armadillo>
int main(){
arma::mat A(2,2, arma::fill::randu);
arma::mat B(3,3, arma::fill::eye);
A.save(arma::hdf5_name("multi-hdf5.mat", "dataset1"), arma::hdf5_binary);
B.save(arma::hdf5_name("multi-hdf5.mat", "dataset2"), arma::hdf5_binary);
return 0;
}
The hdf5 file is read out using the h5dump utility.

Unfortunately, I don't think you can do that. I'm an HDF5 developer, not an armadillo developer, but I took a peek at their source for you.
The save functions look like they are designed to dump a single matrix to a single file. In the function save_hdf5_binary() (diskio_meat.hpp:1255 for one version) they call H5Fcreate() with the H5F_ACC_TRUNC flag, which will clobber any existing file. There's no 'open if file exists' or clobber/non-clobber option. The only H5Fopen() calls are in the hdf5_binary_load() functions and those don't keep the file open for later writing.
This clobbering is what is happening in your case, btw. A.save() creates a file containing dataset1, then B.save() clobbers that file with a new file containing dataset2.
Also, for what it's worth, 'appending to an HDF5 file' is not really the right way to think about that. HDF5 files are not byte/character streams like a text file. Appending to a dataset, yes. Files, no. Think of it like a relational database: You might append data to a table, but you probably wouldn't say that you were appending data to the database.

The latest version of Armadillo already covers that possibility.
You have to use hdf5_opts::append in the save method so if you want to save
a matrix A then you can write
A.save(hdf5_name(filename, dataset, hdf5_opts::append) ).

Related

save Large RasterBrick to file for later use

I have a Large RasterBrick, created through compiling a large number of .nc files and then manipulating in a few ways (cropping, collapsing, naming layers). I want to save this brick to a file on my laptop, so that I can access it without having to import all data and manipulate anew.
How do I do this? I think it should involve writeRaster, but I'm not sure how to specify the options.
My RasterBrick is 18 by 25, with 14975 layers, each named with the relevant date.
I tried this code from Save multi layer RasterBrick to harddisk:
outfile <- writeRaster(windstack_mn, filename='dailywindgrid.tif', format="GTiff", overwrite=TRUE,options=c("INTERLEAVE=BAND","COMPRESS=LZW"))
However, this code produce a tif file that holds a single 18 by 25 layer. I think it saved only the 1st layer of my RasterBrick, because if I bring in the saved .tif file and plot it, it looks identical to plotting the 1st layer of the original RasterBrick.
Did you look at outfile? Can you show it to us?
You should show what you do to "bring in the saved .tif". I am guessing that you do
raster('dailywindgrid.tif')
whereas you should be doing
brick('dailywindgrid.tif')
The comment/answer fr/ Robert solves my issue, with the one addition that one needs to specify the raster format. So I am now saving the file with this code:
writeRaster(StackName, filename='FileNAme.grd', format="raster", overwrite=TRUE,options=c("INTERLEAVE=BAND","COMPRESS=LZW"))
And that .grd file can later be opened using this code:
ImportName <- brick("FileNAme.grd")

How to read Text file and returns additional input field using TextIO?

I have a PCollection of KV where key is filename and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g.,
KV("gs://bucket1/dir1/X1.dat", "SourceX"),
KV("gs://bucket1/dir2/Y1.dat", "SourceY")
I need to read all lines from the files and with the "Source" field, returning as a KV PCollection.
KV(line1 from X1.dat, "SourceX")
KV(line2 from X1.dat, "SourceX")
...
KV(line1 from Y1.dat, "SourceY")
I was able to achieve this by calling FileIO.match() and followed by a DoFn in which I sequentially read the file and append the SourceX (retrieved from a map passed in SideInput).
To get the benefit of parallel reading, could I use TextIO.readAll() to achieve this? TextIO.read() returns a PCollection, without filename info. How can I join it back the map of Filename to Source mapping? Tried WithKeys transfer, but not working ...
Currently using FileIO.match() as you are doing is the best way to accomplish this, but once https://github.com/apache/beam/pull/12645 is merged you'll be able to use the new ContextualTextIO transforms.
Note that computing line numbers in a distributed manner is inherently expensive; you might want to see if you can use offsets (much esasier to compute, and ordered the same as line numbers) instead.
If I understand correctly, you want to read the file in parallel? Unfortunately, TextIO.readAll does not have this feature. You will have to use FileIO.match, and then write your DoFn to read the file in the custom way that you want.
This is because you will not be able to do a random seek into a file and preserve the count of line numbers.
Is reading files serially a bottleneck for your pipeline?

Generate dictionary file from Stata data

I know that I can create a dta file if I have dat file and dictionary dct file. However, I want to know whether the reverse is also possible. In particular, if I have a dta file, is it possible to generate dct file along with dat file (Stata has an export command that allows export as ASCII file but I haven't found a way to generate dct file). StatTransfer does generate dct and dat file, but I was wondering if it is possible without using StatTransfer.
Yes. outfile will create dictionaries as well as export data in ASCII (text) form.
If you want dictionaries and dictionaries alone, you would need to delete the data part.
If you really want two separate files, you would need to split each file produced by outfile.
Either is programmable in Stata, or you could just use your favourite text editor or scripting language.
Dictionaries are in some ways a very good idea, but they are not as important to Stata as they were in early versions.

Library for data storage and analysis

So, I have this program that collects a bunch of interesting data. I want to have a library that I can use to sort this data into columns and rows (or similar), save it to a file, and then use some other program (like OpenOffice Spreadsheet, or MATLAB since I own it, or maybe some other spreadsheet/database grapher that I don't know of) to analyse and graph the data however I want. I prefer this library to be open source, but it's not really a requirement.
Ok so my mistake, you wanted a writer. Writing a CSV is simple and apparently reading them into matlab is simple too.
http://www.mathworks.com.au/help/techdoc/ref/csvread.html
A CSV has a simple structure. For each row you seperate by newline. and each column is seperated by a comma.
0,10,15,12
4,7,0,3
So all you really need to do is grab your data, seperate it by rows then write a line out with each column seperated by a comma.
If you need a code example I can edit again but this shouldn't be too difficult.

Delete or update a dataset in HDF5?

I would like to programatically change the data associated with a dataset in an HDF5 file. I can't seem to find a way to either delete a dataset by name (allowing me to add it again with the modified data) or update a dataset by name. I'm using the C API for HDF5 1.6.x but pointers towards any HDF5 API would be useful.
According to the user guide:
HDF5 does not at this time provide an easy mechanism to remove a dataset from a file or to reclaim the storage space occupied by a deleted object.
So simple deletion appears to be out of the question. But the section continues:
Removing a dataset and reclaiming the space it used can be done with the H5Ldelete function and the h5repack utility program. With the H5Ldelete function, links to a dataset can be removed from the file structure. After all the links have been removed, the dataset becomes inaccessible to any application and is effectively removed from the file. The way to recover the space occupied by an unlinked dataset is to write all of the objects of the file into a new file. Any unlinked object is inaccessible to the application and will not be included in the new file. Writing objects to a new file can be done with a custom program or with the h5repack utility program.
If you want to delete a dataset in c++ you need the following commands:
H5File m_h5File (pathAndNameToHDF5File, H5F_ACC_RDWR); //The hdf5 c++ object.
std::string channelName = "/myGroup/myDataset";
int result = H5Ldelete(m_h5File.getId(), channelName.data(), H5P_DEFAULT);
result will be a non-negative value if successful; otherwise returns a negative value. https://support.hdfgroup.org/HDF5/doc/RM/RM_H5L.html#Link-Delete
As #MaxLybbert said, the hard-disk space it is not recoverd. You must use the repack tool.
However, with HDF5 v.1.10 the space can be recovered. But the user's guide is not ready yet: https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesFileSpaceMgmtDocs.html