Apache Hadoop: Insert compress data into HDFS - compression

I need to upload 100 text files into HDFS to do some data transformation with Apache Pig.
In you opinion, what is the best option:
a) Compress all the text files and upload only one file,
b) Load all the text files individually?

It depends - on your files size, cluster parameters and processing methods.
If your text files are comparable in size with HDFS block size (i.e. block size = 256 MB, file size = 200 MB), it makes sense to load them as is.
If your text files are very small, there would be typical HDFS & small files problem - each file will occupy 1 hdfs block (not physically), so NameNode (which handles metadata) will suffer some overhead on managing lot of blocks. To solve this you could either merge your files into single one, use hadoop archives (HAR) or some custom file format (Sequence Files for example).
If custom format is used, you will have to do extra work with processing - it will be required to use custom input formats.
In my opinion, 100 is not that much to significantly affect NameNode performance, so both options seem to be viable.

Related

Is double compression less effective?

Let's say we have multiple packages stored as .tar.gz files and we want to combine them into one bundle. Everything I know about lossless file compression is that it attempts to find patterns in the data. From that, my intuition is that it would be able to find more patterns and therefore produce smaller bundle if I first decompress the packages into .tar files and then combine them into one bundle.tar.gz. Is my intuition correct? Or is it not worth the hassle and creating the bundle from the .tar.gz files directly would produce similar results?
I tested it with a random collection of txts (RFC 1-500 from https://www.rfc-editor.org/retrieve/bulk/) and compressing each of them individually and then creating the final .tar.gz from the compressed files yields a 15% bigger result, which supports my intuition but maybe not to an extent I expected.
total size of txts: 5.6M
total size of individually compressed txts: 2.7M
size of .tar.gz from txts: 1.4M
size of .tar.gz from compressed txts: 1.6M
I would like to understand more how it behaves in general.
Compressing something with gzip that is already compressed will generally expand the data, but only by a very small amount, multiplying the size by about 1.0003.
The fact that you are getting a 15% benefit from decompressing the pieces and recompressing the bundle means that your pieces must be relatively small in order for gzip's 32K byte matching distance to find more matches and increase the compression by that much. (You did not say how many of these individually compressed texts there were.)
By the way, it is easy to combine several .tar files into a single .tar file. Each .tar file is terminated with 1024 zero bytes. Strip that from every .tar file except the last one, and concatenate them. Then you have one .tar file to compress.

Read Large CSV from S3 using Lambda

I have multiple compressed (.gzip) csv file in S3 which I wish to parse using preferably Lambda. The largest compressed file seen so far is 80MB. On decompressing, the file size becomes 1.6GB. It is approximately that a single uncompressed file can be approximately 2GB (the file be stored in compressed in S3).
After parsing, I am interested in selected rows from the csv file. I do not expect the memory used by filtered rows to be more than 200MB.
However, given Lambda's limit on time(15 min) & memory (3GB), is using Lambda for such use case a feasible option in longer run? Any alternatives to consider?

Compress .npy data to save space in disk

l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy') in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.
The shape of each .npy file is (number_of_examples,224,224,19) and values are float.
data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy
Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.
1)Is there an efficient way to compress my dataset in order to gain some free space disk ?
2) Is there an efficient way of saving files which take less space memory than np.save() ?
Thank you
You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset containing your .npy files would be:
tar -vfcJ dataset.tar.xz dataset/
This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.

Does dask S3 reading cache the data on disk/RAM?

I've been reading about dask and how it can read data from S3 and do processing from that in a way that does not need the data to completely reside in RAM.
I want to understand what dask would do if I have a very large S3 file what I am trying to read. Would it:
Load that S3 file into RAM ?
Load that S3 file and cache it in /tmp or something ?
Make multiple calls to the S3 file in parts
I am assuming here I am doing a lot of different complicated computations on the dataframe and it may need multiple passes on the data - i.e. let's say a join, group by, etc.
Also, a side question is if I am doing a select from S3 > join > groupby > filter > join - would the temporary dataframes which I am joining with be on S3 ? or on disk ? or RAM ?
I know Spark uses RAM and overflows to HDFS for such cases.
I'm mainly thinking of single machine dask at the moment.
For many file-types, e.g., CSV, parquet, the original large files on S3 can be safely split into chunks for processing. In that case, each Dask task will work on one chunk of the data at a time by making separate calls to S3. Each chunk will be in the memory of a worker while it is processing it.
When doing a computation that involves joining data from many file-chunks, preprocessing of the chunks still happens as above, but now Dask keeps temporary structures around to accumulate partial results. How much memory will depend on the chunking size of the data, which you may or may not control, depending on the data format, and exactly what computation you want to apply to it.
Yes, Dask is able to spill to disc in the case that memory usage is large. This is better handled in the distributed scheduler (which is now the recommended default even on a single machine). Use the --memory-limit and --local-directory CLI arguments, or their equivalents if using the Client()/LocalCluster(), to control how much memory each worker can use and where temporary files get put.

Parquet partitioning and HDFS filesize

My data are in the form of relatively small Avro records, written in Parquet files (on average < 1mb).
Up to now I used my local filesystem to do some tests with Spark.
I partitioned the data using a hierarchy of directories.
I wonder if it would be better to "build" the partitioning onto the Avro record and accumulate bigger files... However I imagine that partitioned Parquet files would "map" onto HDFS partitioned files too.
What approach would be best?
Edit (clarifying based on comments):
"build the partitioning onto the Avro record": imagine that my directory structure is P1=/P2=/file.avro and that the Avro record contains fields F1 and F2. I could save all of that in a single Avro file containing the fields P1, P2, F1 and F2. Ie there is no need for a partitioning structure with directories as it is all present in the Avro records
about Parquet partitions and HDFS partitions: will HDFS split a big Parquet file on different machines, will that correspond to distinct Parquet partitions ? (I don't know if that is clarifying my question - if not that means I don't really understand)
the main reasoning behind using partitioning on folder level is that when Spark for instance reads the data and there is a filter on the partitioned column (extracted from the folder name as long as the format is path/partitionName=value) it will only read the needed folders (instead of reading everything and then applying filter). so if you want to use this mechanism use hierarchy in your folder structure (I use it often).
generally speaking I would recommend avoiding many folders with little data in them (not sure if is the case here)
about Spark input partitioning (same word different meaning), when reading from HDFS Spark will try to read files so that partitions will match files on HDFS (to prevent shuffling) so if data is partitioned by HDFS spark will match the same partitions. To my knowledge HDFS does not partition files rather it replicates them (to increase reliability) so I think a single large parquet file will translate to a single file on HDFS which will be read into a single partition unless you repartition it or define number of partition when reading (there are several ways to do it depending on Spark version. see this)