Tool for querying large numbers of csv files - c++

We have large numbers of csv files, files/directories are partitioned by date and several other factors. For instance, files might be named /data/AAA/date/BBB.csv
There are thousands of files, some are in the GB range in size. Total data sizes are in the terabytes.
They are only ever appended to, and usually in bulk, so write performance is not that important. We don't want to load it into another system because there are several important processes that we run that rely on being able to stream the files quickly, which are written in c++.
I'm looking for tool/library that would allow sql like queries against the data directly off the data. I've started looking at hive, spark, and other big data tools, but its not clear if they can access partitioned data directly from a source, which in our case is via nfs.
Ideally, we would be able to define a table by giving a description of the columns, as well as partition information. Also, the files are compressed, so handling compression would be ideal.
Are their open source tools that do this? I've seen a product called Pivotal, which claims to do this, but we would rather write our own drivers for our data for an open source distributed query system.
Any leads would be appreciated.

Spark can be a solution. It is in memory distributed processing engine. Data can be loaded into memory on multiple nodes in the cluster and can be processed in memory. You do not need to copy data to another system.
Here are the steps for your case:
Build multiple node spark cluster
Mount NFS on to one of the nodes
Then you have to load data temporarily into memory in the form of RDD and start processing it
It provides
Support for programming languages like scala, python, java etc
Supports SQL Context and data frames. You can define structure to the data and start accessing using SQL Queries
Support for several compression algorithms
Limitations
Data has to be fit into memory to be processed by Spark
You need to use data frames to define structure on data after which you can query the data using sql embedded in programming languages like scala, python, java etc
There are subtle differences between traditional SQL in RDBMS and SQL in distributed systems like spark. You need to aware of those.
With hive, you need to have data copied to HDFS. As you do not want to copy the data to another system, hive might not be solution.

Related

Supporting a Delta Lake-like format in BigQuery

I usually use Parquet to load data into BigQuery as a starting point, as with the compression and support it seems to be the best fit when compared with other formats, such as JSON, CSV, Avro, and ORC (at least in our tests of it).
However, I'm wondering if it's possible to attain a sort of Delta Lake-like quality so that we can use Parquet perhaps as a starting point and then some other stored file(s) to process a transaction log of modifications to the data (Insert, Update, Delete), particularly the Update and Delete operations. We can use the streaming/storage-write API, but we'd also like the ability to re-play the data if we ever need to snapshot or rollback the data.
I suppose I'm basically looking for something like a "File-ingest" plus "CDC-log" for data ingestion. Is there a file-only architecture that could support this?
I don't think that right now there is such an option with a file-only using just BigQuery. Not too familiar with Delta Lake, but since seems to work with spark you may use something like data proc to emulate that kind of Architecture.
Here you can find a link to the implementation of DeltaLake using GCP.
https://cloud.google.com/blog/topics/developers-practitioners/how-build-open-cloud-datalake-delta-lake-presto-dataproc-metastore

Does dask S3 reading cache the data on disk/RAM?

I've been reading about dask and how it can read data from S3 and do processing from that in a way that does not need the data to completely reside in RAM.
I want to understand what dask would do if I have a very large S3 file what I am trying to read. Would it:
Load that S3 file into RAM ?
Load that S3 file and cache it in /tmp or something ?
Make multiple calls to the S3 file in parts
I am assuming here I am doing a lot of different complicated computations on the dataframe and it may need multiple passes on the data - i.e. let's say a join, group by, etc.
Also, a side question is if I am doing a select from S3 > join > groupby > filter > join - would the temporary dataframes which I am joining with be on S3 ? or on disk ? or RAM ?
I know Spark uses RAM and overflows to HDFS for such cases.
I'm mainly thinking of single machine dask at the moment.
For many file-types, e.g., CSV, parquet, the original large files on S3 can be safely split into chunks for processing. In that case, each Dask task will work on one chunk of the data at a time by making separate calls to S3. Each chunk will be in the memory of a worker while it is processing it.
When doing a computation that involves joining data from many file-chunks, preprocessing of the chunks still happens as above, but now Dask keeps temporary structures around to accumulate partial results. How much memory will depend on the chunking size of the data, which you may or may not control, depending on the data format, and exactly what computation you want to apply to it.
Yes, Dask is able to spill to disc in the case that memory usage is large. This is better handled in the distributed scheduler (which is now the recommended default even on a single machine). Use the --memory-limit and --local-directory CLI arguments, or their equivalents if using the Client()/LocalCluster(), to control how much memory each worker can use and where temporary files get put.

When to use use MapReduce in Hbase?

I want to understand MapReduce of Hbase from application point of view, Need some real use cases of it to better understand the efficient use case of writing these jobs.
If there is any link to document or examples that explains the real use cases, Please share.
I can give some example based on my use cases. If you already store your data in hbase, you can write a java program, which scans a table and do something, then write the output to hbase or somewhere else. OR you can use mapreduce to do the same. The difference is, mapreduce will run where the data is and network traffic is used only for result data. We have hourly jobs to calculate sum and average of kpis and input data is huge but output data is tiny for this task. If i did not use mapreduce, i need to move one hour of data over network which is 18gb. But mapreduce output is only 1mb and i can write it to hbase or file or somewhere else.
Also mapreduce gives you parallel task execution ability, which you can have in java but why :)
Keep in mind that YARN creates map tasks according to your hbase table's split count. So if you need more map task, split your table.
If you already store your data in hadoop hdfs, you are lucky, a mapreduce reading from hdfs is much faster than reading from hbase. Also you can still write mapreduce output to hbase, if you want.
Please look into the usecases given
1. here.
2. And a small reference here - 30.Joins
3. May be an end to end example here
In the end, it all depends on your understanding of each concept Map reduce, Hbase and use it as per your need in your project. The same task can be done with or without map reduce. Happy coding

Cloudera Impala: How does it read data from HDFS blocks?

I had a basic question in Impala. We know that Impala allows you to query data that is stored in HDFS. Now, if a file is split into multiple blocks, and let us say a line of text is spread across two blocks. In Hive/MapReduce, the RecordReader takes care of this.
How does Impala read the record in such a scenario?
Referencing my answer on the Impala user list:
When Impala finds an incomplete record (e.g. which can happen scanning certain file formats such as text or rc files), it will continue to read incrementally from the next block(s) until it has read the entire record. Note that this may require small amounts of 'remote reads' (reading from a remote datanode), but usually this is a very small amount compared to the entire block which should have been read locally (and ideally via a short circuit read).

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.