Spark writing/reading to/from S3 - Partition Size and Compression

Spark writing/reading to/from S3 - Partition Size and Compression - amazon-web-services

I am doing an experiment to understand which file size behaves best with s3 and [EMR + Spark]
Input data :
Incompressible data: Random Bytes in files
Total Data Size: 20GB
Each folder has varying input file size: From 2MB To 4GB file size.
Cluster Specifications :
1 master + 4 nodes : C3.8xls
--driver-memory 5G \
--executor-memory 3G \
--executor-cores 2 \
--num-executors 60 \
Code :
scala> def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
time: [R](block: => R)R
scala> val inputFiles = time{sc.textFile("s3://bucket/folder/2mb-10240files-20gb/*/*")};
scala> val outputFiles = time {inputFiles.saveAsTextFile("s3://bucket/folder-out/2mb-10240files-20gb/")};
Observations
2MB - 32MB: Most of the time is spent in opening file handles [Not efficient]
64MB till 1GB: Spark itself is launching 320 tasks for all these file sizes, it's no longer the no of files in that bucket with 20GB
data e.g. 512 MB files had 40 files to make 20gb data and could
just have 40 tasks to be completed but instead, there were 320
tasks each dealing with 64MB data.
4GB file size : 0 Bytes outputted [Not able to handle in-memory /Data not even splittable ???]
Questions
Any default setting that forces input size to be dealt with to be 64MB ??
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file
size?
Why is compressed file size increased after uploading via spark? The 2MB compressed input file becomes 3.6 MB in the output bucket.

Since it is not specified, I'm assuming usage of gzip and Spark 2.2 in my answer.
Any default setting that forces input size to be dealt with to be 64MB ??
Yes, there is. Spark is a Hadoop project, and therefore treats S3 to be a block based file system even though it is an object based file system.
So the real question here is: which implementation of S3 file system are you using(s3a, s3n) etc. A similar question can be found here.
Since the data I am using is random bytes and is already compressed how is it splitting this data further? If it can split this data why is it not able to split file size of 4gb object file size?
Spark docs indicate that it is capable of reading compressed files:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This means that your files were read quite easily and converted to a plaintext string for each line.
However, you are using compressed files. Assuming it is a non-splittable format such as gzip, the entire file is needed for de-compression. You are running with 3gb executors which can satisfy the needs of 4mb-1gb files quite well, but can't handle a file larger than 3gb at once (probably lesser after accounting for overhead).
Some further info can be found in this question. Details of splittable compression types can be found in this answer.
Why is compressed file size increased after uploading via spark?The 2MB compressed input file becomes 3.6 MB in output bucket.
As a corollary to the previous point, this means that spark has de-compressed the RDD while reading as plaintext. While re-uploading, it is no longer compressed. To compress, you can pass a compression codec as a parameter:
sc.saveAsTextFile("s3://path", classOf[org.apache.hadoop.io.compress.GzipCodec])
There are other compression formats available.

Related

Removing null bytes from a file results in larger output after XZ compression

I am developing a custom file format consisting of a large number of integers. I'm using XZ to compress the data as it has the best compression ratio of all compression algorithms I've tested.
All integers are stored as u32s in RAM, but they are all a maximum of 24 bits large, so I updated the serializer to skip the high byte (as it will always be 0 anyways), to try to compress the data further. However, this had the opposite effect: the compressed file is now larger than the original.
$ xzcat 24bit.xz | hexdump -e '3/1 "%02x" " "'
020000 030000 552d07 79910c [...] b92c23 c82c23
$ xzcat 32bit.xz | hexdump -e '4/1 "%02x" " "'
02000000 03000000 552d0700 79910c00 [...] b92c2300 c82c2300
$ xz -l 24bit.xz 32bit.xz
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 1 82.4 MiB 174.7 MiB 0.472 CRC64 24bit.xz
1 1 77.2 MiB 233.0 MiB 0.331 CRC64 32bit.xz
-------------------------------------------------------------------------------
2 2 159.5 MiB 407.7 MiB 0.391 CRC64 2 files
Now, I wouldn't have an issue if the size of the file had remained the same, as a perfect compressor would detect that all of those bytes are redundant anyways, and compress them all down to practically nothing. However, I do not understand how removing data from the source file can possibly result in a larger file after compression?
I've tried changing the LZMA compressor settings, and xz --lzma2=preset=9e,lc=4,pb=0 yielded a slightly smaller file at 82.2M, but this is still significantly larger than the original file.
For reference, both files can be downloaded from https://files.spaceface.dev/sW1Xif/32bit.xz and https://files.spaceface.dev/wKXEQm/24bit.xz.
The order of the integers is somewhat important, so naively sorting the entire file won't work. The file is made up of different chunks, and the numbers making up each chunk are currently sorted for slightly better compression; however, the order does not matter, just the order of the chunks themselves.
Chunk 1: 000002 000003 072d55 0c9179 148884 1e414b
Chunk 2: 00489f 0050c5 0080a6 0082f0 0086f6 0086f7 01be81 03bdb1 03be85 03bf4e 04dfe6 04dfea 0583b1 061125 062006 067499 07d7e6 08074d 0858b8 09d35d 09de04 0cfd78 0d06be 0d3869 0d5534 0ec366 0f529c 0f6d0d 0fecce 107a7e 107ab3 13bc0b 13e160 15a4f9 15ab39 1771e3 17fe9c 18137d 197a30 1a087a 1a2007 1ab3b9 1b7d3c 1ba52c 1bc031 1bcb6b 1de7d2 1f0866 1f17b6 1f300e 1f39e1 1ff426 206c51 20abbe 20cbbc 211a58 211a59 215f73 224ea8 227e3f 227eab 22f3b7 231aef 004b15 004c86 0484e7 06216e 08074d 0858b8 0962ed 0eb020 0ec366 1a62c2 1fefae 224ea8 0a2701 1e414b
Chunk 3: 000006 003b17 004b15 004b38 [...]

Apache Flume taking more time than copyFromLocal command

I have 24GB folderin my local file system. My task is to move that folder to HDFS. Two ways I did it.
1) hdfs dfs -copyFromLocal /home/data/ /home/
This took around 15mins to complete.
2) Using Flume.
Here is my agent
spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink_to_hdfs
# source
spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /home/data/
spool_dir.sources.src-1.fileHeader = false
# HDFS sinks
spool_dir.sinks.sink_to_hdfs.type = hdfs
spool_dir.sinks.sink_to_hdfs.hdfs.fileType = DataStream
spool_dir.sinks.sink_to_hdfs.hdfs.path = hdfs://192.168.1.71/home/user/flumepush
spool_dir.sinks.sink_to_hdfs.hdfs.filePrefix = customevent
spool_dir.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
spool_dir.sinks.sink_to_hdfs.hdfs.batchSize = 1000
spool_dir.channels.channel-1.type = file
spool_dir.channels.channel-1.checkpointDir = /home/user/spool_dir_checkpoint
spool_dir.channels.channel-1.dataDirs = /home/user/spool_dir_data
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink_to_hdfs.channel = channel-1
This step took almost an hour to push data to HDFS.
As per my knowledge Flume is distributed, so should not it be that Flume should load data faster than copyFromLocal command.

If you're looking simple at read and write operations flume is going to be at least 2x slower with your configuration as you're using a file channel - every file read from disk is encapsulated into a flume event (in memory) and then serialized back down to disk via the file channel. The sink then reads the event back from the file channel (disk) before pushing it up to hdfs.
You also haven't set a blob deserializer on your spoolDir source (so it's reading one line at a time from your source files, wrapping in a flume Event and then writing to the file channel), so paired with the HDFS Sink default rollXXX values, you'll be getting a file in hdfs per 10 events / 30s / 1k rather than a file per input file that you'd get with copyFromLocal.
All of these factors add up to give you slower performance. If you want to get a more comparable performance, you should use the BlobDeserializer on the spoolDir source, coupled with a memory channel (but understand that a memory channel doesn't guarantee delivery of an event in the event of the JRE being prematurely terminated.

Apache Flume is not intended for moving or copying folders from local file system to HDFS. Flume is meant for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. (Reference: Flume User Guide)
If you want to move large files or directories, you should use hdfs dfs -copyFromLocal as you have already mentioned.

Flume HDFS Sink generates lots of tiny files on HDFS

I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB.
Here is my current flume config:
a1.sources=o1
a1.sinks=i1
a1.channels=c1
#source configuration
a1.sources.o1.type=avro
a1.sources.o1.bind=0.0.0.0
a1.sources.o1.port=41414
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
#channle config
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.sources.o1.channels=c1
a1.sinks.i1.channel=c1

It is your typo in conf.
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
where in the line 'rollSize' and 'rollCount', you put il as i1.
Please try to use DEBUG, then you will find like:
[SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.shouldRotate:465) - rolling: rollSize: 1024, bytes: 1024
Due to il, default value of rollSize 1024 is being used .

HDFS Sink has a property hdfs.batchSize (default 100) which describes "number of events written to file before it is flushed to HDFS". I think that's your problem here.
Consider also checking all other properties: HDFS Sink .

This can possibly happen because of the memory channel and its capacity. I guess its dumping data to HDFS as soon as its capacity becomes full. Did you try using file channel instead of memory ?

Compression File

I have a X bytes file. And I want to compress it in block of size 32Kb, for example.
Is there any lib that Can I do this?
I used Zlib for Delphi but I just can compress a full file in new compressed file.
Tranks a lot,
Pedro

Why don't you use a simple header to determine block boundaries? Consider this:
Read fixed amount of data from input into a buffer (say 32 KiB)
Compress that buffer with a "freshly created" deflate stream (underlying compression algorithm of ZLIB).
Write compressed size to output stream
Write compressed data to output stream
Go to step 1 until you reach end-of-file.
Pros:
You can decompress any block even in multi-threaded fashion.
Data corruption only limited to corrupted block. Rest of data can be restored.
Cons:
You loss most of contextual information (similarities between data). So, you will have lower compression ratio.
You need slightly more work.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.

On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files

in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.

It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?

One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js