Cloudera Impala: How does it read data from HDFS blocks? - hdfs

I had a basic question in Impala. We know that Impala allows you to query data that is stored in HDFS. Now, if a file is split into multiple blocks, and let us say a line of text is spread across two blocks. In Hive/MapReduce, the RecordReader takes care of this.
How does Impala read the record in such a scenario?

Referencing my answer on the Impala user list:
When Impala finds an incomplete record (e.g. which can happen scanning certain file formats such as text or rc files), it will continue to read incrementally from the next block(s) until it has read the entire record. Note that this may require small amounts of 'remote reads' (reading from a remote datanode), but usually this is a very small amount compared to the entire block which should have been read locally (and ideally via a short circuit read).

Related

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Does dask S3 reading cache the data on disk/RAM?

I've been reading about dask and how it can read data from S3 and do processing from that in a way that does not need the data to completely reside in RAM.
I want to understand what dask would do if I have a very large S3 file what I am trying to read. Would it:
Load that S3 file into RAM ?
Load that S3 file and cache it in /tmp or something ?
Make multiple calls to the S3 file in parts
I am assuming here I am doing a lot of different complicated computations on the dataframe and it may need multiple passes on the data - i.e. let's say a join, group by, etc.
Also, a side question is if I am doing a select from S3 > join > groupby > filter > join - would the temporary dataframes which I am joining with be on S3 ? or on disk ? or RAM ?
I know Spark uses RAM and overflows to HDFS for such cases.
I'm mainly thinking of single machine dask at the moment.
For many file-types, e.g., CSV, parquet, the original large files on S3 can be safely split into chunks for processing. In that case, each Dask task will work on one chunk of the data at a time by making separate calls to S3. Each chunk will be in the memory of a worker while it is processing it.
When doing a computation that involves joining data from many file-chunks, preprocessing of the chunks still happens as above, but now Dask keeps temporary structures around to accumulate partial results. How much memory will depend on the chunking size of the data, which you may or may not control, depending on the data format, and exactly what computation you want to apply to it.
Yes, Dask is able to spill to disc in the case that memory usage is large. This is better handled in the distributed scheduler (which is now the recommended default even on a single machine). Use the --memory-limit and --local-directory CLI arguments, or their equivalents if using the Client()/LocalCluster(), to control how much memory each worker can use and where temporary files get put.

Redshift unload's file name

I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
mybucket/foo-000.txt.gz
mybucket/foo.txt.gz
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT] suffix (when the [EXT] exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
s3://mybucket/key000 6.2 GB
s3://mybucket/key001 6.2 GB
s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE is not recommended, I'll try to explain it in several points:
The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
When you unload data from Redshift while the flag PARALLEL is TRUE, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
The queries COPY and UNLOAD work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE. In the contrary, queries like SELECT, UPDATE, DELETE and INSERT, are processed by the leader node, that's why they suffer from the leader node loads.

Tool for querying large numbers of csv files

We have large numbers of csv files, files/directories are partitioned by date and several other factors. For instance, files might be named /data/AAA/date/BBB.csv
There are thousands of files, some are in the GB range in size. Total data sizes are in the terabytes.
They are only ever appended to, and usually in bulk, so write performance is not that important. We don't want to load it into another system because there are several important processes that we run that rely on being able to stream the files quickly, which are written in c++.
I'm looking for tool/library that would allow sql like queries against the data directly off the data. I've started looking at hive, spark, and other big data tools, but its not clear if they can access partitioned data directly from a source, which in our case is via nfs.
Ideally, we would be able to define a table by giving a description of the columns, as well as partition information. Also, the files are compressed, so handling compression would be ideal.
Are their open source tools that do this? I've seen a product called Pivotal, which claims to do this, but we would rather write our own drivers for our data for an open source distributed query system.
Any leads would be appreciated.
Spark can be a solution. It is in memory distributed processing engine. Data can be loaded into memory on multiple nodes in the cluster and can be processed in memory. You do not need to copy data to another system.
Here are the steps for your case:
Build multiple node spark cluster
Mount NFS on to one of the nodes
Then you have to load data temporarily into memory in the form of RDD and start processing it
It provides
Support for programming languages like scala, python, java etc
Supports SQL Context and data frames. You can define structure to the data and start accessing using SQL Queries
Support for several compression algorithms
Limitations
Data has to be fit into memory to be processed by Spark
You need to use data frames to define structure on data after which you can query the data using sql embedded in programming languages like scala, python, java etc
There are subtle differences between traditional SQL in RDBMS and SQL in distributed systems like spark. You need to aware of those.
With hive, you need to have data copied to HDFS. As you do not want to copy the data to another system, hive might not be solution.

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.