I've been reading about dask and how it can read data from S3 and do processing from that in a way that does not need the data to completely reside in RAM.
I want to understand what dask would do if I have a very large S3 file what I am trying to read. Would it:
Load that S3 file into RAM ?
Load that S3 file and cache it in /tmp or something ?
Make multiple calls to the S3 file in parts
I am assuming here I am doing a lot of different complicated computations on the dataframe and it may need multiple passes on the data - i.e. let's say a join, group by, etc.
Also, a side question is if I am doing a select from S3 > join > groupby > filter > join - would the temporary dataframes which I am joining with be on S3 ? or on disk ? or RAM ?
I know Spark uses RAM and overflows to HDFS for such cases.
I'm mainly thinking of single machine dask at the moment.

For many file-types, e.g., CSV, parquet, the original large files on S3 can be safely split into chunks for processing. In that case, each Dask task will work on one chunk of the data at a time by making separate calls to S3. Each chunk will be in the memory of a worker while it is processing it.
When doing a computation that involves joining data from many file-chunks, preprocessing of the chunks still happens as above, but now Dask keeps temporary structures around to accumulate partial results. How much memory will depend on the chunking size of the data, which you may or may not control, depending on the data format, and exactly what computation you want to apply to it.
Yes, Dask is able to spill to disc in the case that memory usage is large. This is better handled in the distributed scheduler (which is now the recommended default even on a single machine). Use the --memory-limit and --local-directory CLI arguments, or their equivalents if using the Client()/LocalCluster(), to control how much memory each worker can use and where temporary files get put.


AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Apache Spark: Regex with ReduceByKey is lot slower than GREP command

I have a file with strings (textData) and a set of regex filters (regx) that I want to apply and get count. Before we migrated to Spark, I used GREP as follows:
from subprocess import check_output
for reg in regx: # regx is a list of all the filters
result[reg] = system.exec('grep -e ' + reg + 'file.txt | wc -l')
Note: I am paraphrasing here with 'system.exec', I am actually using check_output.
I upgraded to SPARK for other things, so I want to also take the benefit of spark here. So I wrote up this code.
import re
sc = SparkContext('local[*]')
rdd = sc.textFile('file.txt') #containing the strings as before
result = rdd.flatMap(lambda line: [(reg, line) for reg in regx])
.map(lambda line: (line[0], len(re.findall(line[0], line[1]))))
.reduceByKey(lambda a,b: a+b)
I thought I was being smart but the code is actually slower. Can anyone point out any obvious errors? I am running it as
spark-submit --master local[*]
I haven't run both versions on the same exact data to check exactly how much slower. I could easily do that, if required. When I checked localhost:4040 most of the time is being taken by the reduceByKey job.
To give a sense of time taken, the number of rows in the file are 100,000 with average #chars per line of ~1000 or so. The number of filters len(regx)=20. This code has been running for 44min on an 8core processor with 128GB RAM.
EDIT: just to add, the number of regex filters and textfiles will multiply 100 folds in the final system. Plus rather than writing/reading data from text files, I would be querying for the data in rdd with an SQL statement. Hence, I thought Spark was a good choice.
I'm a quite heavy user of sort as well, and whilst Spark doesn't feel as fast in a local setup, you should consider some other things:
How big is your dataset? sort swaps records to /tmp when requiring high ammounts of RAM.
How many RAM have you assigned to your Spark app? by default it has only 1GB, that's pretty unfair in sorting vs a sort command without RAM restrictions.
Are both tasks executed on the same machine? is the Spark machine a virtual appliance running in an "auto-expand" disk file? (bad performance).
Spark Clusters will spread your tasks across multiple servers automatically. If running on Hadoop, remember that files are sliced in 128MB blocks, each block can be an RDD partition.
I.e. in a Hadoop cluster, RDD partitions could be processed in parallel. This is where you'll nottice performance.
Spark will deal with Hadoop to do its best to achieve "data locality", meaning that your processes run directly against local hard drives, otherwise the data is going to be replicated across the network, as when executing reduce-alike processes. These are the stages. Understanding stages and how data is moved across the executors will lead you nice improvements, moreover considering that sort is of type "reduce" and it triggers a new execution stage on Spark, potentially moving data across the network. Having spare resources on the same nodes where maps are being executed can save a lot of network overhead.
Otherwise it will still work frankly well, and you can't destroy a file in HDFS by mistake :-)
This is where you really get performance and safety of data and execution, by spreading the task in parallel to work against a lot of hard drives in a self-recovering execution environment.
In a local setup you simply feel it irresponsive, mostly because it takes a bit to load, launch and track back the process, but it feels quick and safe when dealing with many GBs across several nodes.
I do also love shell scripting and I deal with reasonable ammounts of GBs quite often, but you can't regex-match 5 TB of data without distributing disk IO or paying for RAM as if there was no tomorrow.

Tool for querying large numbers of csv files

We have large numbers of csv files, files/directories are partitioned by date and several other factors. For instance, files might be named /data/AAA/date/BBB.csv
There are thousands of files, some are in the GB range in size. Total data sizes are in the terabytes.
They are only ever appended to, and usually in bulk, so write performance is not that important. We don't want to load it into another system because there are several important processes that we run that rely on being able to stream the files quickly, which are written in c++.
I'm looking for tool/library that would allow sql like queries against the data directly off the data. I've started looking at hive, spark, and other big data tools, but its not clear if they can access partitioned data directly from a source, which in our case is via nfs.
Ideally, we would be able to define a table by giving a description of the columns, as well as partition information. Also, the files are compressed, so handling compression would be ideal.
Are their open source tools that do this? I've seen a product called Pivotal, which claims to do this, but we would rather write our own drivers for our data for an open source distributed query system.
Any leads would be appreciated.
Spark can be a solution. It is in memory distributed processing engine. Data can be loaded into memory on multiple nodes in the cluster and can be processed in memory. You do not need to copy data to another system.
Here are the steps for your case:
Build multiple node spark cluster
Mount NFS on to one of the nodes
Then you have to load data temporarily into memory in the form of RDD and start processing it
It provides
Support for programming languages like scala, python, java etc
Supports SQL Context and data frames. You can define structure to the data and start accessing using SQL Queries
Support for several compression algorithms
Data has to be fit into memory to be processed by Spark
You need to use data frames to define structure on data after which you can query the data using sql embedded in programming languages like scala, python, java etc
There are subtle differences between traditional SQL in RDBMS and SQL in distributed systems like spark. You need to aware of those.
With hive, you need to have data copied to HDFS. As you do not want to copy the data to another system, hive might not be solution.

When does shuffling occur in Apache Spark?

I am optimizing parameters in Spark, and would like to know exactly how Spark is shuffling data.
Precisely, I have a simple word count program, and would like to know how spark.shuffle.file.buffer.kb is affecting the run time. Right now, I only see slowdown when I make this parameter very high (I am guessing this prevents every task's buffer from fitting in memory simultaneously).
Could someone explain how Spark is performing reductions? For example, the data is read and partitioned in an RDD, and when an "action" function is called, Spark sends out tasks to the worker nodes. If the action is a reduction, how does Spark handle this, and how are shuffle files / buffers related to this process?
Question : As for your question concerning when shuffling is triggered on Spark?
Answer : Any join, cogroup, or ByKey operation involves holding objects in hashmaps or in-memory buffers to group or sort. join, cogroup, and groupByKey use these data structures in the tasks for the stages that are on the fetching side of the shuffles they trigger. reduceByKey and aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger.
Explanation : How does shuffle operation work in Spark?
The shuffle operation is implemented differently in Spark compared to Hadoop. I don't know if you are familiar with how it works with Hadoop but let's focus on Spark for now.
On the map side, each map task in Spark writes out a shuffle file (os disk buffer) for every reducer – which corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is lesser, the number of mappers (M) and reducers(R) is far higher than in Hadoop. Thus, shipping M*R files to the respective reducers could result in significant overheads.
Similar to Hadoop, Spark also provide a parameter spark.shuffle.compress to specify compression libraries to compress map outputs. In this case, it could be Snappy (by default) or LZF. Snappy uses only 33KB of buffer for each opened file and significantly reduces risk of encountering out-of-memory errors.
On the reduce side, Spark requires all shuffled data to fit into memory of the corresponding reducer task, on the contrary of Hadoop that had an option to spill this over to disk. This would of course happen only in cases where the reducer task demands all shuffled data for a GroupByKey or a ReduceByKey operation, for instance. Spark throws an out-of-memory exception in this case, which has proved quite a challenge for developers so far.
Also with Spark there is no overlapping copy phase, unlike Hadoop that has an overlapping copy phase where mappers push data to the reducers even before map is complete. This means that the shuffle is a pull operation in Spark, compared to a push operation in Hadoop. Each reducer should also maintain a network buffer to fetch map outputs. Size of this buffer is specified through the parameter spark.reducer.maxMbInFlight (by default, it is 48MB).
For more information about shuffling in Apache Spark, I suggest the following readings :
Optimizing Shuffle Performance in Spark by Aaron Davidson and Andrew Or.
SPARK-751 JIRA issue and Consolidating Shuffle files by Jason Dai.
It occurs whenever data needs to moved between executors (worker nodes)

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts :
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases