Store a file metadata in an extra file - c++

I have a bunch of image files (mostly .jpg). I would like to store metadata about these files (e.g. dominant color, color distribution, maximum gradient flow field, interest points, ...). These data fields are not fixed and are not available in all images.
Right now I am storing the metadata for each file as a separate file with the same name but a different extension. The format is just text:
metadataFieldName1 metadataFieldValue1
metadataFieldName2 metadataFieldValue2
This gets me wondering, is there a better/easier way to store these metadata? I thought of ProtocolBuffer since I need to be able to read and write these information in both C++ and Python. But, how do I support the case where some metadata are not available?

I would suggest that you store such metadata within the image files themselves.
Most image formats support storing metadata. I think that .jpeg support it through Exif.
If you're on Windows you can use the WIC to store and retrieve metadata in a unified manner.

Why protocol buffers and not XML or INI files or whatever text-ish format? Just choose some format...
And what do you mean with "metadata not available"? It is up to your application to respond to such error situations...what has this to do with the format of the storage?

Look at http://www.yaml.org. YAML is less verbose than XML and more human friendly to read.
There are YAML libraries for both C++, Python and many other languages.
Example:
import yaml
data = { "field1" : "value1",
"field2" : "value2" }
serializedData = yaml.dump(data, default_flow_style=False)
open("datafile", "w").write(serializedData)

I thought long on this matter and went with ProtocolBuffer to store metadata for my images. For each image e.g. Image00012.jpg, I store the metadata in Image00012.jpg.pbmd. Once I have my .proto file setup, the Python class and C++ class got auto-generated. It works very well and require me to spend little time on parsing (clearly better than writing custom reader for YAML files).
RestRisiko brings up a good point about how I should handle metadata not available. The good thing about ProtocolBuffer is it supports optional/required fields. This solves my problem on this front.
The reason I think XML and INI are not good for this purpose is because many of my metadata are complex (color distribution, ...) and require a bit of storage customization. ProtocolBuffer allows me to nest proto declaration. Plus, the size of the metadata file and the parsing speed is clearly superior to my hand-roll XML reading/writing.

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files
You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

How to serialize a XML file to Thrift file ? (to put in HDFS)

Since many days, I inquire a lot of informations about Big Data and especially about Thrift and HDFS/Hadoop.
I have many many XML files which I want to store in a HDFS file system. (and after, make statistics etc... from the data of these files)
So I would like to serialize my XML files with Thrift. (to validate the structure and to make durable ..)
Then, stock them in HDFS.
Is it possible ? ( XML => Thrift => HDFS ) without use RPC service.
To do the test, I would like to use a linux VM (for HDFS) and PHP language (for thrift).
Thank you.
You can use the serialization part without the RPC part, yes. Look for "serializer" in the Thrift source tree, you should find some examples. If not for PHP, then for sure for some other languages.
You have to do a little work on your own, because there is not such a thing a "the" way to convert XML into Thrift structures. The steps are - roughly - as follows
define the data structures to hold the XML data as Thrift IDL constructs
generate the desired code using the Thrift Compiler
add the serializer code as needed
put together some code that
reads each XML file
builds the Thrift structures from it
serializes the data and puts them into HDFS
Depending on the layout of your XML data and on the number of XML structures used, this may need some effort. It could be an idea to generate at least the IDL file programmatically by some other tool, maybe even some of the other code needed. Thrift cannot support you with this, although it could be an option - again, depending on your current situation, language and tools available.

What is the relationship between exif and xmp? Are they interchangeable?

I am writing an application which read/write metadata for an image (it converts a raw file into a jpeg/tiff) and I need to write metadata about camera/mode/...into the generated jpeg.
I know that I can do this using exif and in windows I am using GDI to do this. But I am reading information about xmp and xmp sdk from adobe.
I am wondering which one should I use? exif or xmp?
How they are relates to each other?
Why one may select to write exif metadata and somebody else may select XMP? What is the pros/cons of selecting any of them.
I am writing in c++ on windows (visual studio 2012)
If you are writing JPEG or TIFF formats, I think you should stick with Exif.
All TIFF format readers will be able to parse Exif, as Exif is a subset of TIFF (a sub-IFD with special Exif-defined tags). Most "JPEG" readers also know how to parse Exif/TIFF.
Now, you could of course add XMP to the file as well, but I suggest you do this in addition to plain Exif, and only if you really need it.
XMP does allow richer metadata. But as the metadata you have got is from a camera RAW file, I would think they map directly to Exif tags, given Exif was developed by the digital camera industry for exactly the use you describe.
So, basically: Go with Exif because it has better tool support. Add XMP only if you need it, for richer metadata.
XMP does indeed map values to exif tags.
Both EXIF and XMP are rich sources of image metadata.
The benefit of XMP is that it is easily readable and any XMP-aware application (like all Adobe products) can manipulate these properties.
XMP derives the values from the native values in the image, so basically XMP provides a mapping between properties in the exif block in images and reconciles these values in the XMP namespaces as defined in the ADOBE XMP SDK documentation.
The benefit of using the XMP SDK to manipulate metadata is that then the responsibility of reconciling between different image metadata formats (Like exif, IPTC or XMP) while reading or writing is transferred to the XMP SDK.
If any change is made to the XMP property, it is reflected back to the exif block in image. Similarly if any non-XMP aware application has modified the exif metadata without modifying the corresponding xmp value, at the time of read operation, the XMP SDK will reconcile this change into the XMP value and while writing, this change will be saved back.
Using the XMP SDK is to manipulate metadata is basically easier as you can leave a lot of format specific detailing upto the SDK to handle.
More information on different sources of image metadata is available on the Metadata Working Group.
The complete Spec can be downloaded from here.

Creating metadata for binary file

I have a binary file I'm creating in C++, I'm tasked to create a metadata format to describe the data that it can be read in Java using the metadata.
One record in the data file has Time, then 64 bytes of data, then a CRC, then a new line delimiter. How should the metadata look to describe what is in the 64 bytes? I've never created a metadata file before.
Probably you want to generate a file which describes how many entries there are in the data file, and maybe the time range. Depending on what kind of data you have, the metadata might contain either a per-record entry (RawData, ImageData, etc.) or one global entry (data stored as float.)
It totally depends on what the Java-code is supposed to do, and what use-cases you have. If you want to know whether to open the file at all depending on date, that should be part of the metadata, etc.
I think that maybe you have the design backwards.
First, think about the end.
What result do you want to see? A Java program will create some kind of .csv file?
What kind(s) of file(s)?
What information will be needed to do this?
Then design the metadata to provide the information that is needed to perform the necessary tasks (and any extra tasks you anticipate).
Try to make the metadata extensible so that adding extra metadata in the future will not break the programs that you are writing now. e.g. if the Java program finds metadata it doesn't understand, it just skips it.

XML usage for c++ application

I have a couple of questions about XML.
Can XML be used for normal c++ application instead of using a text file ?
If so, does this method have advantages?
and finally, how can I use XML to store data? what tools are needed?
Regards.
You can use XML for storing information - it's less Human readable than a text file, but can be more easily communicated with other systems and coding languages.
If all you need is a few text/numeric properties, stick to a property file.
If you need a mix of configuration options, and you want to use validation (can be accomplished using XML schema), automatic modification (e.g. XSL transformations) or communicate it easily with Web Services, than XML is useful.
If you want to store binary data, XML is probably not that answer. Though you can store it in a filesystem and use the XML for the metadata (i.e. where each file is located).
Take a look at Apache Xerces-C for C++ XML code - http://xerces.apache.org/xerces-c/
XML can be parsed as a text file by your application. There are libraries available.
Advantage: the files can be exchanged with other applications more easily, especially if you provide an XML-schema file.
Storing data in XML can be done with boost.serialization
It depends of the kind of data you want to read/write, but XML is generally a good way to go for storing structured and hierarchical datas.
You can use librairies such as TinyXML to easily parse and write XML files in C++.
The main drawback is that XML is verbose ; that's why you can also use an alternative such as JSON to store your datas.