How to convert Calcite Logical plan to another data engine's physical plan? - apache-calcite

I am trying to design the relational table structure and standard SQL query for Apache - IoTDB (a time series database) with Calcite. Now I want to know how i can convert Calcite's logical plan to IoTDB own physical plan easily.
For example, I execute a simple query:
SELECT deviceid, temperature FROM root_ln WHERE deviceid = 'd1'
After parsing by Calcite, Logical Plan represented by RelNodes is like this:
LogicalProject(DEVICEID=[$1], TEMPERATURE=[$2])
LogicalFilter(condition=[=($1, 'd1')])
EnumerableTableScan(table=[[IoTDBSchema, ROOT_LN]])
And I want to convert it to IoTDB's own physical plan, which i need to provide:(just the simple example)
Project's path, like root.ln.d1.temperature, we execute query by these paths. I have to put tablename(root.ln), deviceid(d1), and measurement(temperature) together to get a whole path. This needs to scan the whole logical plan.
Project's datatype, like float. I can get it from paths, it's simple.
Filter's expression, I have to convert LogicalFilter's expression to IoTDB's expression again, including parse what $1 means in this example.
I think it involves too much work. When the query becomes more complex, the conversion becomes more difficult too.
So I think maybe there is a simpler way to do this work.

The standard way of creating a physical plan for a particular data source is to create an adapter for that data source. This amounts to writing rules which can convert logical operators to physical operators. This allows data sources to specify what logical operators it can implement and how.

Related

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

I need help in designing my C++ Console application

I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?
So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.

Tool for querying large numbers of csv files

We have large numbers of csv files, files/directories are partitioned by date and several other factors. For instance, files might be named /data/AAA/date/BBB.csv
There are thousands of files, some are in the GB range in size. Total data sizes are in the terabytes.
They are only ever appended to, and usually in bulk, so write performance is not that important. We don't want to load it into another system because there are several important processes that we run that rely on being able to stream the files quickly, which are written in c++.
I'm looking for tool/library that would allow sql like queries against the data directly off the data. I've started looking at hive, spark, and other big data tools, but its not clear if they can access partitioned data directly from a source, which in our case is via nfs.
Ideally, we would be able to define a table by giving a description of the columns, as well as partition information. Also, the files are compressed, so handling compression would be ideal.
Are their open source tools that do this? I've seen a product called Pivotal, which claims to do this, but we would rather write our own drivers for our data for an open source distributed query system.
Any leads would be appreciated.
Spark can be a solution. It is in memory distributed processing engine. Data can be loaded into memory on multiple nodes in the cluster and can be processed in memory. You do not need to copy data to another system.
Here are the steps for your case:
Build multiple node spark cluster
Mount NFS on to one of the nodes
Then you have to load data temporarily into memory in the form of RDD and start processing it
It provides
Support for programming languages like scala, python, java etc
Supports SQL Context and data frames. You can define structure to the data and start accessing using SQL Queries
Support for several compression algorithms
Limitations
Data has to be fit into memory to be processed by Spark
You need to use data frames to define structure on data after which you can query the data using sql embedded in programming languages like scala, python, java etc
There are subtle differences between traditional SQL in RDBMS and SQL in distributed systems like spark. You need to aware of those.
With hive, you need to have data copied to HDFS. As you do not want to copy the data to another system, hive might not be solution.

What are the steps of preprocessing anonymized data for predictive analysis?

Suppose we have a large dataset of anonymized data. Dataset consist if certain number of variables and observations. All we can learn about data is a type(numeric, char, date, etc.) of variable. We can do it by looking to data manually.
What are the best practise steps of pre-proccessing dataset for the further analysis?
Just for instance, let this data set be just one table, so we don't need to check any relations between tables.
This link gives the complete set of validations currently in practice. Still, to start with:
wherever possible, have your data written in such a way that you can parse it as fast and as easily as possible, using your preferred programming language's methods/constructors;
you can verify if all the data types match correctly - like int fields do not contain string data etc;
you can verify that your values are in acceptable range;
check if a non-nullable field has null values;
check if dates are in expected ranges;
check if data follows correct set-membership constraints wherever applicable;
if you have pattern following data like phone numbers, make sure they are in (XXX) XXX-XXXX design, if you prefer them that way;
are the zip codes at correct accuracy level (in US you may have 5 or 9 digits of accuracy);
if your data is time-series, is it complete (i.e. you have values for all dates)?
is there any unwanted duplication?
Hope this is good enough to get you started...

Underlying mechanism in firing SQL Queries in Oracle

When we fire a SQL query like
SELECT * FROM SOME_TABLE_NAME under ORACLE
What exactly happens internally? Is there any parser at work? Is it in C/C++ ?
Can any body please explain ?
Thanks in advance to all.
Short answer is yes, of course there is a parser module inside Oracle that interprets the statement text. My understanding is that the bulk of Oracle's source code is in C.
For general reference:
Any SQL statement potentially goes through three steps when Oracle is asked to execute it. Often, control is returned to the client between each of these steps, although the details can depend on the specific client being used and the manner in which calls are made.
(1) Parse -- I believe the first action is actually to check whether Oracle has a cached copy of the exact statement text. If so, it can save the work of parsing your statement again. If not, it must of course parse the text, then determine an execution plan that Oracle thinks is optimal for the statement. So conceptually at least there are two entities at work in this phase -- the parser and the optimizer.
(2) Execute -- For a SELECT statement this step would generally run just enough of the execution plan to be ready to return some rows to the client. Depending on the details of the plan, that might mean running the whole thing, or might mean doing just a very small fraction of the work. For any other kind of statement, the execute phase is when all of the work is actually done.
(3) Fetch -- This is when rows are actually returned to the client. Generally the client has a predetermined fetch array size which sets the maximum number of rows that will be returned by a single fetch call. So there may be many fetches made for a single statement. Of course if the statement is one that cannot return rows, then there is no fetch step necessary.
Manasi,
I think internally Oracle would have its own parser, which does parsing and tries compiling the query. Think its not related to C or C++.
But need to confirm.
-Justin Samuel.