Apache Arrow Flight: Getting sorted data from multiple endpoints - apache-arrow

According to the document (https://arrow.apache.org/docs/dev/format/Flight.html), an Apache Arrow Flight client cannot get sorted data from multiple endpoints. It seems that this is by design.
In the introduction document (https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/), they say "While Flight streams are not necessarily ordered, we provide for application-defined metadata which can be used to serialize ordering information.". But I think the application-defined metadata is not very useful since a general client (like a BI application) that uses a wrapper - for example, Apache Arrow Flight SQL, let alone a wrapper of wrapper: Apache Arrow Flight SQL JDBC driver - does not know it.
Is there any standard way to get sorted data from multiple Apache Arrow Flight endpoints? If not, why did the designers choose not to support that feature?
Thanks.

It was not considered at the time, but you are right: it would be useful to have a way to indicate this so that various wrappers and projects building on top have a standardized way to know how to handle this.
The main idea is that if data is sorted, you should return a single endpoint. I believe the reasoning was that it would be rare to have an implementation capable of doing sorting across multiple endpoints, since that would be expensive to implement. Of course, that isn't very useful if your backend can actually sort data across multiple workers!
I (as one of the contributors to the project) am planning to put up a proposal to handle this case. If you are interested, please keep a watch on the mailing list: dev#arrow.apache.org.

Related

Separation of Runtime and History

I whould like to use separate databases for runtime and history data without implementing a custom HistoryEventHandler. Does someone know how this is possible?
I read the camunda user guides but this did not help much because it only hints the custom implementation way.
Currently, everytime I query history data (about 2mil activity entries) the performance of the system drops as it kind of blocks the runtime, too. I'd like to avoid this without loosing the ability to query historic data.
That would be a really cool feature, but it is currently not supported. You will have to disable the default history and implement a custom handler.
Camunda BPM offers Optimize, which pulls the history data from the Engine to an Elastic Search database. If you are using the Enterprise version, it may be a way to solve it.
(Based on your comments to other answers, it appears that you're interested in learning more about custom HistoryEventHandler implementations. Thus, I'm adding this answer in the hope that it will help.)
Implementing a custom History Event Handler isn't difficult, but there are a few important points to keep in mind:
Unless you want to skip the storage of history information in the standard Camunda history tables, you'll want to use their CompositeHistoryEventHandler. This simply gives you the ability to use multiple HistoryEventHandler implementations.
Any HistoryEventHandler implementations will complete in the same threads as the ones executing process instances; thus, you will want to be cognizant of the performance impacts your custom HistoryEventHandler will have.
You may want to consider publishing your history events through a message bus or messaging system to allow for reliable delivery without impacting Camunda workflow instance performance.
Finally, it may make sense to use your custom HistoryEventHandler along with Camunda's default HistoryEventHandler and their functionality for deleting process instances after a period of time. This would allow you to use their querying capabilities for some period of time without having the history stack up (and thus slowing down your system).

Getting feed from server client model

In a typical client server model, what does it mean to subscribe or unsusbscribe to a feed? Is there a generic codebase or boilerplate model or set of standard procedures or class design and functionalities involved? This is all C++ based. There's no other info other than the client is attempting to connect to the server to retrieve data based on some sort of signature. I know it's somewhat vague, but I guess this is really a question of what are things to keep in mind and what a typical subscribe or unsubscribe method might entail. Maybe something along the lines of extending a client server model like http://www.linuxhowtos.org/C_C++/socket.htm.
This is primarily an information architecture question. "Subscribing to feeds" implies that the server offers a lot of information, which may not be uniformly relevant to all clients. Feeds are a mechanism by which clients can select relevant information.
Concretely, you first need to identify the atoms of information that you have. What are the smallest chunks of data ? What properties to they have? Can new atoms replace older atoms, and if so, what identifies their relation? Are there other atom relations besides replacement?
Next, there's the mapping of those atoms to particular feeds. What are the possible combinations of atoms needed by a client? How can these combinations be bundled in two ore more feeds? It is possible to map each atom uniquely to a single feed? Or must atoms be shared between feeds? If so, is that rare enough that you can ignore it and just send duplicates?
If a client connects, how do you figure out which atoms need to be shared? Is it just live streaming (atoms are sent only when they're generated on the server), do you have a set of current atoms (sent when a client connects), or do you need some history as well? Is there client caching?
It's clear that you can't have a single off-the-shelf solution when the business side is so diverse.

Restful API - handling large amounts of data

I have written my own Restful API and am wondering about the best way to deal with large amounts of records returned from the API.
For example, if I use GET method to myapi.co.uk/messages/ this will bring back the XML for all message records, which in some cases could be 1000's. This makes using the API very sluggish.
Can anyone suggest the best way of dealing with this? Is it standard to return results in batches and to specify batch size in the request?
You can change your API to include additional parameters to limit the scope of data returned by your application.
For instance, you could add limit and offset parameters to fetch just a little part. This is how pagination can be done in accordance with REST. A request like this would result in fetching 10 resources from the messages collection, from 21st to 30th. This way you can ask for a specific portion of a huge data set:
myapi.co.uk/messages?limit=10&offset=20
Another way to decrease the payload would be to only ask for certain parts of your resources' representation. Here's how facebook does it:
/joe.smith/friends?fields=id,name,picture
Remember that while using either of these methods, you have to provide a way for the client to discover each of the resources. You can't assume they'll just look at the parameters and start changing them in search of data. That would be a violation of the REST paradigm. Provide them with the necessary hyperlinks to avoid it.
I strongly recommend viewing this presentation on RESTful API design by apigee (the screencast is called "Teach a Dog to REST"). Good practices and neat ideas to approach everyday problems are discussed there.
EDIT: The video has been updated a number of times since I posted this answer, you can check out the 3rd edition from January 2013
There are different ways in general by which one can improve the API performance including for large API sizes. Each of these topics can be explored in depth.
Reduce Size Pagination
Organizing Using Hypermedia
Exactly What a User Need With Schema Filtering
Defining Specific Responses Using The Prefer Header
Using Caching To Make Response
More Efficient More Efficiency Through Compression
Breaking Things Down With Chunked Responses
Switch To Providing More Streaming Responses
Moving Forward With HTTP/2
Source: https://apievangelist.com/2018/04/20/delivering-large-api-responses-as-efficiently-as-possible/
if you are using .net core
you have to try this magic package
Microsoft.AspNetCore.ResponseCompression
then use this line in configureservices in startup file
services.AddResponseCompression();
then in configure function
app.UseResponseCompression();

Is there a database implementation that has notifications and revisions?

I am looking for a database library that can be used within an editor to replace a custom document format. In my case the document would contain a functional program.
I want application data to be persistent even while editing, so that when the program crashes, no data is lost. I know that all databases offer that.
On top of that, I want to access and edit the document from multiple threads, processes, possibly even multiple computers.
Format: a simple key/value database would totally suffice. SQL usually needs to be wrapped, and if I can avoid pulling in a heavy ORM dependency, that would be splendid.
Revisions: I want to be able to roll back changes up to the first change to the document that has ever been made, not only in one session, but also between sessions/program runs.
I need notifications: each process must be able to be notified of changes to the document so it can update its view accordingly.
I see these requirements as rather basic, a foundation to solve the usual tough problems of an editing application: undo/redo, multiple views on the same data. Thus, the database system should be lightweight and undemanding.
Thank you for your insights in advance :)
Berkeley DB is an undemanding, light-weight key-value database that supports locking and transactions. There are bindings for it in a lot of programming languages, including C++ and python. You'll have to implement revisions and notifications yourself, but that's actually not all that difficult.
It might be a bit more power than what you ask for, but You should definitely look at CouchDB.
It is a document database with "document" being defined as a JSON record.
It stores all the changes to the documents as revisions, so you instantly get revisions.
It has powerful javascript based view engine to aggregate all the data you need from the database.
All the commits to the database are written to the end of the repository file and the writes are atomic, meaning that unsuccessful writes do not corrupt the database.
Another nice bonus You'll get is easy and flexible replication and of your database.
See the full feature list on their homepage
On the minus side (depending on Your point of view) is the fact that it is written in Erlang and (as far as I know) runs as an external process...
I don't know anything about notifications though - it seems that if you are working with replicated databases, the changes are instantly replicated/synchronized between databases. Other than that I suppose you should be able to roll your own notification schema...
Check out ZODB. It doesn't have notifications built in, so you would need a messaging system there (since you may use separate computers). But it has transactions, you can roll back forever (unless you pack the database, which removes earlier revisions), you can access it directly as an integrated part of the application, or it can run as client/server (with multiple clients of course), you can have automatic persistency, there is no ORM, etc.
It's pretty much Python-only though (it's based on Pickles).
http://en.wikipedia.org/wiki/Zope_Object_Database
http://pypi.python.org/pypi/ZODB3
http://wiki.zope.org/ZODB/guide/index.html
http://wiki.zope.org/ZODB/Documentation

Query building in a database agnostic way

In a C++ application that can use just about any relational database, what would be the best way of generating queries that can be easily extended to allow for a database engine's eccentricities?
In other words, the code may need to retrieve data in a way that is not consistent among the various database engines. What's the best way to design the code on the client side to generate queries in a way that will make supporting a new database engine a relatively painless affair.
For example, if I have (MFC)code that looks like this:
CString query = "SELECT id FROM table"
results = dbConnection->Query(query);
and we decide to support some database that uses, um, "AVEC" instead of "FROM". Now whenever the user uses that database engine, this query will fail.
Options so far:
Worst option: have the code making the query check the database type.
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
Betterer option: Create a query builder class that allows the caller to construct queries without using any SQL directly. Once the query is completed, caller can invoke a "Generate" method which returns a query string approrpriate for the active database engine
Best option: ??
Note: The database engine itself is abstracted away through some thin layers of our own creation. It's the queries themselves are the only remaining problem.
Solution:
I've decided to go with the "better" option (query "selector") for two reasons.
Debugging: As mentioned below, debugging is going to be slightly easier with the selector approach since the queries are pre-built and listed out in a readable form in code.
Flexibility: It occurred to me that there are some databases which might have vastly better and completely different ways of solving a particular query. For example, with Access I perform a complicated query on multiple tables each time because I have to, but on Sql Server I'd like to setup a view. Selecting from the view and from several tables are completely different queries (i think) and this query selector would handle it easily.
You need your own query-writing object, which can be inherited from by database-specific implementations.
So you would do something like:
DbAgnosticQueryObject query = new PostgresSQLQuery();
query.setFrom('foo');
query.setSelect('id');
// and so on
CString queryString = query.toString();
It can get pretty complicated in there once you go past simple selects from a single table. There are already ORM packages out there that deal with a lot of these nuances; it may be worth at looking at them instead of writing your own.
Best option: Pick a database, and code to it.
How often are you going to up and swap out the database on the back end of a production system? And even if you did, you'd have a lot more to worry about than just minor syntax issues. (Major stuff like join syntax, even datatypes can differ widely between databases.)
Now, if you are designing a commercial application where you want the customer to be able to use one of several back-end options when they implement it, then you may have to specify "we support Oracle, MS SQl, or MYSQL" and code to those specific options.
All of your options can be reduced to
Worst option: have the code making the query check the database type.
It's just a matter of where you're putting the logic to check the database type.
The option that I've seen work best in practice is
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
In my experience it is much easier to test queries independently from the rest of your code. It gets a lot harder if you have objects that are piecing together queries from bits of syntax, because then you have to test the query-creation code and the query itself.
If you pull all of your SQL out into separate files that are written and maintained by hand, you can have someone who is an expert in SQL write them (you can still automate the testing of these queries). If you try to write query-generating functions you'll essentially have a C++ expert writing SQL.
Choose an ORM, and start mapping.
If you are to support more than one DB, your problem is only going to get worse.
And just think of DB that are comming - cloud dbs with no (or close to no) SQL, and Object databases.
Take your queries outside the code - put them in the DB or in a resource file and allow overrides for different database engines.
If you use SPs it's potentially even easier, since the SPs abstract away your database differences.
I would think that what you would want to do, if you needed the ability to support multiple databases, would be to create a data provider interface (or abstract class) and associated concrete implementations. The data provider would need to support your standard query operators and other common, supported functionality required support your query operations (have a look at IEnumerable extension methods in .NET 3.5). Each concrete provider would then translate these into specific queries based on the target database engine.
Essentially, what you do is create a database abstraction layer and have your code interact with it. If you can find one of these for C++, it would probably be worth buying instead of writing. You may also want to look for Inversion of Control (IoC) containers for C++ that would basically do this and more. I know of several for Java and C#, but I'm not familiar with any for C++.