Use Arrow schema with parquet StreamWriter

Use Arrow schema with parquet StreamWriter - c++

I am attempting to use the C++ StreamWriter class provided by Apache Arrow.
The only example of using StreamWriter uses the low-level Parquet API i.e.
parquet::schema::NodeVector fields;
fields.push_back(parquet::schema::PrimitiveNode::Make(
"string_field", parquet::Repetition::OPTIONAL, parquet::Type::BYTE_ARRAY,
parquet::ConvertedType::UTF8));
fields.push_back(parquet::schema::PrimitiveNode::Make(
"char_field", parquet::Repetition::REQUIRED, parquet::Type::FIXED_LEN_BYTE_ARRAY,
parquet::ConvertedType::NONE, 1));
auto node = std::static_pointer_cast<parquet::schema::GroupNode>(
parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, fields));
The end result is a std::shared_ptr<parquet::schema::GroupNode> that can then be passed into StreamWriter.
Is it possible to construct and use "high-level" Arrow schemas with StreamWriter? They are supported when using the WriteTable function (non-streaming) but I've found no examples of its use with the streaming API.
I, of course, can resort to using the low-level API but it is extremely verbose when creating large and complicated schema and I would prefer (but not need) to use the high-level Arrow schema mechanisms.
For example,
std::shared_ptr<arrow::io::FileOutputStream> outfile_;
PARQUET_ASSIGN_OR_THROW(outfile_, arrow::io::FileOutputStream::Open("test.parquet"));
// construct an arrow schema
auto schema = arrow::schema({arrow::field("field1", arrow::int64()),
arrow::field("field2", arrow::float64()),
arrow::field("field3", arrow::float64())});
// build the writer properties
parquet::WriterProperties::Builder builder;
auto properties = builder.build()
// my current best attempt at converting the Arrow schema to a Parquet schema
std::shared_ptr<parquet::SchemaDescriptor> parquet_schema;
parquet::arrow::ToParquetSchema(schema.get(), *properties, &parquet_schema); // parquet_schema is now populated
// and now I try and build the writer - this fails
auto writer = parquet::ParquetFileWriter::Open(outfile_, parquet_schema->group_node(), properties);
The last line fails because parquet_schema->group_node() (which is the only way I'm aware of to get access to the GroupNode for the schema) returns a const GroupNode* whereas ParquetFileWriter::Open) needs a std::shared_ptr<GroupNode>.
I'm not sure casting-away the constness of the returned group node and forcing it into the ::Open() call is an officially supported (or correct) usage of StreamWriter.
Is what I want to do possible?

It seems that you need to use low-level api for StreamWriter.
a very tricky way:
auto writer = parquet::ParquetFileWriter::Open(outfile_, std::shared_ptr<parquet::schema::GroupNode>(const_cast<parquet::schema::GroupNode *>(parquet_schema->group_node())), properties);
you may need to converter schema manually.
source code cpp/src/parquet/arrow/schema.cc may help you.
PARQUET_EXPORT
::arrow::Status ToParquetSchema(const ::arrow::Schema* arrow_schema,
const WriterProperties& properties,
std::shared_ptr<SchemaDescriptor>* out);

Related

Serializing a FlatBuffer object to JSON without it's schema file

I've been working with FlatBuffers as a solution for various things in my project, one of them specifically being JSON support. However, while FB natively supports JSON generation, the documentation for flatbuffers is poor, and the process is somewhat cumbersome. Right now, I am working in the Object->JSON direction. The issue I am having doesn't really arise the other way around (I think).
I currently have JSON generation working per an example I found here (line 630, JsonEnumsTest()) - by parsing a .fbs file into a flattbuffers::Parser, building and packaging my flatbuffer object, then running GenerateText() to generate a JSON string. The code I have is simpler than the example in test.cpp, and looks vaguely like this:
bool MyFBSchemaWrapper::asJson(std::string& jsonOutput)
{
//**This is the section I don't like having to do
std::string schemaFile;
if (flatbuffers::LoadFile((std::string(getenv("FBS_FILE_PATH")) + "MyFBSchema.fbs").c_str(), false, &schemaFile))
{
flatbuffers::Parser parser;
const char *includePaths[] = { getenv("FBS_FILE_PATH");
parser.Parse(schemaFile.c_str(), includePaths);
//**End bad section
parser.opts.strict_json = true;
flatbuffers::FlatBufferBuilder fbBuilder;
auto testItem1 = fbBuilder.CreateString("test1");
auto testItem2 = fbBuilder.CreateString("test2");
MyFBSchemaBuilder myBuilder(fbBuilder);
myBuilder.add_item1(testItem1);
myBuilder.add_item2(testItem2);
FinishMyFBSchemaBuffer(fbBuilder, myBuilder.finish());
auto result = GenerateText(parser, fbBuilder.GetBufferPointer(), &jsonOutput);
return true;
}
return false;
}
Here's my issue: I'd like to avoid having to include the .fbs files to set up my Parser. I don't want to clutter an already large monolithic program by adding even more random folders, directories, environment variables, etc. I'd like to be able to generate JSON from the compiled FlatBuffer schemas, and not have to search for a file to do so.
Is there a way for me to avoid having to read back in my .fbs schemas into the parser? My intuition is pointing to no, but the lack of documentation and community support on the topic of FlatBuffers & JSON is telling me there might be a way. I'm hoping that there's a way to use the already generated MyFBSchema_generated.h to create a JSON string.

Yes, see Mini Reflection in the documentation: http://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html

boost::log - Using independent severity levels inside a library/plugin

This is a kind of follow-up of another question I asked (here) where I was made aware that using the same backend with multiple sinks is not a safe approach.
What I am trying to obtain is to "decouple" the severity levels inside a library/plugin from the applications using them, while being able to write the different logs to the same output (may it be stdout or, more likely, a file or a remote logger); this for the following reasons:
I wouldn't like to tie the severity levels inside my library/plugin to those of the applications using them because the change (for whatever reason) of the severity levels list in one of the applications using the library/plugin would cause the library to be "updated" with the new severities and, as a waterfall, of all other applications which use the library/plugin
I would like to be able to use library specific severity levels (which, for being correctly displayed in the log messaged should be supplied to the sink formatter - thus my need of using different sinks)
Which is the best way to obtain this?
Some afterthoughts: As per Andrey's reply to my previous question the "problem" is that the backend is not synchronized to receive data from multiple sources (sinks); thus the solution might seem to be to create a synchronized version of the backends (e.g. wrapping the writes to the backend in a boost::asio post)...
Is this the only solution?
Edit/Update
I update the question after Andrey's awesome reply, mainly for sake of completeness: the libraries/plugins are meant to be used with internally developed applications only, thus it is assumed that there will be a common API we can shape for defining the log structure and behaviour.
Plus, most applications are meant to run mainly "unmanned", i.e. with really minimal, if not null, user/runtime interaction, so the basic idea is to have the log level set in some plugin specific configuration file, read at startup (or set to be reloaded upon a specific application API command from the application).

First, I'd like to address this premise:
for being correctly displayed in the log messaged should be supplied to the sink formatter - thus my need of using different sinks
You don't need different sinks to be able to filter or format different types of severity levels. Your filters and formatters have to deal with that, not the sink itself. Only create multiple sinks if you need multiple log targets. So to answer your question, you should focus on the protocol of setting up filters and formatters rather than sinks.
The exact way to do that is difficult to suggest because you didn't specify the design of your application/plugin system. What I mean by that is that there must be some common API that must be shared by both the application and the libraries, and the way you set up logging will depend on where that API belongs. Severity levels, among other things, must be a part of that API. For example:
If you're writing plugins for a specific application (e.g. plugins for a media player) then the application is the one that defines the plugin API, including the severity levels and even possibly the attribute names the plugins must use. The application configures sinks, including filters and formatters, using the attributes mandated by the API, and plugins never do any configuration and only emit log records. Note that the API may include some attributes that allow to distinguish plugins from each other (e.g. a channel name), which would allow the application to process logs from different plugins differently (e.g. write to different files).
If you're writing both plugins and application(s) to adhere some common API, possibly defined by a third party, then logging protocol must still be defined by that API. If it's not, then you cannot assume that any other application or plugin not written by you supports logging of any kind, even that it uses Boost.Log at all. In this case every plugin and the application itself must deal with logging independently, which is the worst case scenario because the plugins and the application may affect each other in unpredictable ways. It is also difficult to manage the system like that because every component will have to be configured separately by the user.
If you're writing an application that must be compatible with multiple libraries, each having its own API, then it is the application who should be aware of logging convention taken in each an every library it uses, there's no way around it. This may include setting up callbacks in the libraries, intercepting file output and translating between library's log severity levels and the application severity levels. If the libraries use Boost.Log to emit log records then they should document the attributes they use, including the severity levels, so that the application is able to setup the logging properly.
So, in order to take one approach or the other, you should first decide how your application and plugins interface each other and what API they share and how that API defines logging. The best case scenario is when you define the API, so you can also set the logging conventions you want. In that case, although possible, it is not advisable or typical to have arbitrary severity levels allowed by the API because it significantly complicates implementation and configuration of the system.
However, just in case if for some reason you do need to support arbitrary severity levels and there's no way around that, you can define an API for the library to provide, which can help the application to set up filters and formatters. For example, each plugin can provide API like this:
// Returns the filter that the plugin wishes to use for its records
boost::log::filter get_filter();
// The function extracts log severity from the log record
// and converts it to a string
typedef std::function<
std::string(boost::log::record_view const&)
> severity_formatter;
// Returns the severity formatter, specific for the plugin
severity_formatter get_severity_formatter();
Then the application can use a special filter that will make use of this API.
struct plugin_filters
{
std::shared_mutex mutex;
// Plugin-specific filters
std::vector< boost::log::filter > filters;
};
// Custom filter
bool check_plugin_filters(
boost::log::attribute_value_set const& values,
std::shared_ptr< plugin_filters > const& p)
{
// Filters can be called in parallel, we need to synchronize
std::shared_lock< std::shared_mutex > lock(p->mutex);
for (auto const& f : p->filters)
{
// Call each of the plugin's filter and pass the record
// if any of the filters passes
if (f(values))
return true;
}
// Suppress the record by default
return false;
}
std::shared_ptr< plugin_filters > pf = std::make_shared< plugin_filters >();
// Set the filter
sink->set_filter(std::bind(&check_plugin_filters, std::placeholders::_1, pf));
// Add filters from plugins
std::unique_lock< std::shared_mutex > lock(pf->mutex);
pf->filters.push_back(plugin1->get_filter());
pf->filters.push_back(plugin2->get_filter());
...
And a similar formatter:
struct plugin_formatters
{
std::shared_mutex mutex;
// Plugin-specific severity formatters
std::vector< severity_formatter > severity_formatters;
};
// Custom severity formatter
std::string plugin_severity_formatter(
boost::log::record_view const& rec,
std::shared_ptr< plugin_formatters > const& p)
{
std::shared_lock< std::shared_mutex > lock(p->mutex);
for (auto const& f : p->severity_formatters)
{
// Call each of the plugin's formatter and return the result
// if any of the formatters is able to extract the severity
std::string str = f(rec);
if (!str.empty())
return str;
}
// By default return an empty string
return std::string();
}
std::shared_ptr< plugin_formatters > pf =
std::make_shared< plugin_formatters >();
// Set the formatter
sink->set_formatter(
boost::log::expressions::stream << "["
<< boost::phoenix::bind(&plugin_severity_formatter,
boost::log::expressions::record, pf)
<< "] " << boost::log::expressions::message);
// Add formatters from plugins
std::unique_lock< std::shared_mutex > lock(pf->mutex);
pf->severity_formatters.push_back(plugin1->get_severity_formatter());
pf->severity_formatters.push_back(plugin2->get_severity_formatter());
...
Note, however, that at least with regard to filters, this approach is flawed because you allow the plugins to define the filters. Normally, it should be the application who selects which records are being logged. And for that there must be a way to translate library-specific severity levels to some common, probably defined by the application levels.

How to call sink->imbue when using boost::log::init_from_settings?

How to call sink->imbue for text file sink when using init_from_settings?
I checked the source code and didn't find a way to re-access those sinks.
Seems that register_sink_factory is the extension, but the default factories are all in init_from_settings.cpp, so I'm not able to use an a decorator pattern to implement it easily.
I tried was set global locale, but it breaks RotationSize param (which doesn't accept int with decimal point)
Another way is:
auto previousLocale = std::locale::global(boost::locale::generator()("zh_CN.UTF-8"));
logging::init_from_settings(settings);
logging::add_common_attributes();
std::locale::global(previousLocale);
Any better ideas?

You can register a sink factory that will configure the sink the way you need. You can find an example here.

How to exchange custom data between Ops in Nuke?

This questions is addressed to developers using C++ and the NDK of Nuke.
Context: Assume a custom Op which implements the interfaces of DD::Image::NoIop and
DD::Image::Executable. The node iterates of a range of frames extracting information at
each frame, which is stored in a custom data structure. An custom knob, which is a member
variable of the above Op (but invisible in the UI), handles the loading and saving
(serialization) of the data structure.
Now I want to exchange that data structure between Ops.
So far I have come up with the following ideas:
Expression linking
Knobs can share information (matrices, etc.) using expression linking.
Can this feature be exploited for custom data as well?
Serialization to image data
The custom data would be serialized and written into a (new) channel. A
node further down the processing tree could grab that and de-serialize
again. Of course, the channel must not be altered between serialization
and de-serialization or else ... this is a hack, I know, but, hey, any port
in a storm!
GeoOp + renderer
In cases where the custom data is purely point-based (which, unfortunately,
it isn't in my case), I could turn the above node into a 3D node and pass
point data to other 3D nodes. At some point a render node would be required
to come back to 2D.
I am going into the correct direction with this? If not, what is a sensible
approach to make this data structure available to other nodes, which rely on the
information contained in it?

This question has been answered on the Nuke-dev mailing list:
If you know the actual class of your Op's input, it's possible to cast the
input to that class type and access it directly. A simple example could be
this snippet below:
//! #file DownstreamOp.cpp
#include "UpstreamOp.h" // The Op that contains your custom data.
// ...
UpstreamOp * upstreamOp = dynamic_cast< UpstreamOp * >( input( 0 ) );
if ( upstreamOp )
{
YourCustomData * data = yourOp->getData();
// ...
}
// ...
UPDATE
Update with reference to a question that I received via email:
I am trying to do this exact same thing, pass custom data from one Iop
plugin to another.
But these two plugins are defined in different dso/dll files.
How did you get this to work ?
Short answer:
Compile your Ops into a single shared object.
Long answer:
Say
UpstreamOp.cpp
DownstreamOp.cpp
define the depending Ops.
In a first attempt I compiled the first plugin using only UpstreamOp.cpp,
as usual. For the second plugin I compiled both DownstreamOp.cpp and
UpstreamOp.cpp into that plugin.
Strangely enough that worked (on Linux; didn't test Windows).
However, by overriding
bool Op::test_input( int input, Op * op ) const;
things will break. Creating and saving a Comp using the above plugins still
works. But loading that same Comp again breaks the connection in the node graph
between UpstreamOp and DownstreamOp and it is no longer possible to connect
them again.
My hypothesis is this: since both plugins contain symbols for UpstreamOp it
depends on the load order of the plugins if a node uses instances of UpstreamOp
from the first or from the second plugin. So, if UpstreamOp from the first plugin
is used then any dynamic_cast in Op::test_input() will fail and the two Op cannot
be connected anymore.
It is still surprising that Nuke would even bother to start at all with the above
configuration, since it can be rather picky about symbols from plugins, e.g if they
are missing.
Anyway, to get around this problem I did the following:
compile both Ops into a single shared object, e.g. myplugins.so, and
add TCL script or Python script (init.py/menu.py)which instructs Nuke how to load
the Ops correctly.
An example for a TCL scripts can be found in the dev guide and the instructions
for your menu.py could be something like this
menu = nuke.menu( 'Nodes' ).addMenu( 'my-plugins' )
menu.addCommand('UpstreamOp', lambda: nuke.createNode('UpstreamOp'))
menu.addCommand('DownstreamOp', lambda: nuke.createNode('DownstreamOp'))
nuke.load('myplugins')
So far, this works reliably for us (on Linux & Windows, haven't tested Mac).

design pattern for "save"

I'm currently working on a "save" mechanism, which allows a user to save the project his working on on hard disc. The output will be a XML file containing all kinds of data.
Now our project structure is about to change and we need to write a new xml file (create a new save method).
So now here comes the challenge: When saving I want the user to be able to choose which file format he will be creating (version1 (old) or version2 (new)).
Does anyone now how to achieve that? Is there a suitable design pattern around?
Remarks:
- The data we are saving can be seen as unrelated blocks, so it would actually be easy to exchange an old block with a new one.
- The whole goal of the thing is, it should be readable again when loading an old project. (I assume this can be done by tags, and just react on tags when loading?)

This sounds like a good application for the Strategy pattern.
You would create an abstract base class FileFormat (the Strategy interface) with two virtual functions, projectToXml and xmlToProject, which are supposed to turn your internal project representation into XML or vice versa.
Then you create two implementing subclasses FileFormatNew and FileFormatLegacy (these are the concrete strategies).
Your save functions would then additionally require an instance of FileFormat, and call the corresponding method of that object to do the data conversion. Your load function could choose the strategy to use by examining the XML tree for something which tells it which version it is.
And when you ever need to support another file format, you just have to create a new class which is a subclass of FileFormat.
Addendum after the exchange in the comments
When you are going to have a lot of versions with very small differences and you still want to use the Strategy pattern, you could make the FileFormat a composite of multiple strategies: A CircleStragegy, a RectangleStrategy, a LineStrategy etc.. In that case I wouldn't use different classes for different versions of the FileFormat. I would create a static factory function for each version which returns a FileFormat with the Strategy objects used in that version.
FileFormat FileFormat::createVersion1_0() {
return new FileFormat(
new LineStrategyOld(),
new CircleStrategyOld(),
new RectangleStragegyOld()
);
}
FileFormat FileFormat::createVersion1_1() {
// the 1.1 version introduced the new way to save lines
return new FileFormat(
new LineStrategyNew(),
new CircleStrategyOld(),
new RectangleStragegyOld()
);
}
FileFormat FileFormat::createVersion1_2() {
// 1.2 uses the new format to save circles
return new FileFormat(
new LineStrategyNew(),
new CircleStrategyNew(),
new RectangleStragegyOld()
);
}
FileFormat FileFormat::createVersion1_3() {
// 1.3 uses a new format to save rectangles, but we realized that
// the new way to save lines wasn't that good after all, so we
// returned to the old way.
return new FileFormat(
new LineStrategyOld(),
new CircleStrategyNew(),
new RectangleStragegyNew()
);
}
Note: In real code you would of course use more descriptive suffixes than "Old" and "New" for your strategy class names.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Use Arrow schema with parquet StreamWriter - c++

Related

Serializing a FlatBuffer object to JSON without it's schema file

boost::log - Using independent severity levels inside a library/plugin

How to call sink->imbue when using boost::log::init_from_settings?

How to exchange custom data between Ops in Nuke?

design pattern for "save"

Categories

Resources