Amazon emr: best compression/fileformat

Amazon emr: best compression/fileformat - amazon-web-services

We currently have some files stored on an S3 server. The files are log files (.log extension but plain text content) that have been gzipped to reduce disk space.
But gzip isn't splittable and now we are looking for a few good alternatives to store/process our files on Amazon EMR.
So what is the best compression or file format to use on log files? I came across avro and SequenceFile, bzip2, LZO and snappy. It's a bit much and I am a bit overwhelmed.
So I would appreciate any insights in this matter.
Data is to be used for pig jobs (map/reduce jobs)
Kind regards

If you check the Best Practices for Amazon EMR there's a section talking about compressing the outputs:
Compress mapper outputs–Compression means less data written to disk,
which improves disk I/O. You can monitor how much data written to disk
by looking at FILE_BYTES_WRITTEN Hadoop metric. Compression can also
help with the shuffle phase where reducers pull data. Compression can
benefit your cluster HDFS data replication as well. Enable compression
by setting mapred.compress.map.output to true. When you enable
compression, you can also choose the compression algorithm. LZO has
better performance and is faster to compress and decompress.

Hi We can use following algorithms as per our use cases.
GZIP(Algorithm) : Splittable(No), Compression Ratio(High),Compress and Decompress Speed(Medium)
SNAPPY(Algorithm) : Splittable(No), Compression Ratio(LOW),Compress and Decompress Speed(Very Fast)
BZIP2(Algorithm) : Splittable(Yes), Compression Ratio(Very High),Compress and Decompress Speed(Slow)
LZO(Algorithm) : Splittable(Yes), Compression Ratio(LOW),Compress and Decompress Speed(FAST)

Related

How to process large dataset with AWS Batch Transform

I have gigabytes of data for which I want to make predictions using AWS SageMaker Endpoint. I have two main issues:
data comes as Excel files and AWS Batch Transform needs it to be in JSON format to be able to process it. Reading Excel just to save it as JSON is redundant and it's a big IO slowdown
Endpoint can only be invoked over HTTP which means a few MB payload limit - chunking into such small pieces slows things down as well
How can I tackle these issues?
Pipe Mode could be a potential solution but from I read it is used for training only. Is it possible to use Pipe Mode for inference to speed things up?

I would recommend to perform some preprocessing and ETL operations either using Glue/EMR and then use mini-batches to send data to Batch transform.
A blog regarding the same can be found here
Thanks,
Raghu

Optimal File Size of S3 Files for Hadoop Job on EMR?

I am trying to determine the ideal size for a file stored in S3 that will be used in Hadoop jobs on EMR.
Currently I have large text files around 5-10gb. I am worried about the delay in copying these large files to HDFS to run MapReduce jobs. I have the option of making these files smaller.
I know S3 files are copied in parallel to HDFS when using S3 as an input directory in MapReduce jobs. But will a single large file be copied to HDFS using single thread, or will this file be copied as multiple parts in parallel? Also, does Gzip compression affect copying a single file in multiple parts?

There are two factors to consider:
Compressed files cannot be split between tasks. For example, if you have a single, large, compressed input file, only one Mapper can read it.
Using more, smaller files makes parallel processing easier but there is more overhead when starting the Map/Reduce jobs for each file. So, fewer files are faster.
Thus, there is a trade-off between the size and quantity of files. The recommended size is listed in a few places:
The Amazon EMR FAQ recommends:
If you are using GZIP, keep your file size to 1–2 GB because GZIP files cannot be split.
The Best Practices for Amazon EMR whitepaper recommends:
That means that a single mapper (a single thread) is responsible for fetching the data from Amazon S3. Since a single thread is limited to how much data it can pull from Amazon S3 at any given time (throughput), the
process of reading the entire file from Amazon S3 into the mapper becomes the bottleneck in your data processing workflow. On the other hand, if your data files can be split, more than a single mapper can process your file. The suitable size for such data files is between 2 GB and 4 GB.
The main goal is to keep all of your nodes busy by processing as many files in parallel as possible, without introducing too much overhead.
Oh, and keep using compression. The savings in disk space and data transfer time makes it more advantageous than enabling splitting.

Is it possible to do lossless compression using Cloudfront?

I am running a cityscape and nature photography website that contains a lot of images which range from 50kb-2mb in size. I have already shrunk them down in size using a batch photo editor so I can't lose any more quality in the images without them getting too grainy.
Google page insights recommends lossless compression and I am trying to figure out how to solve this. These specific images are in s3 buckets and being served by AWS cloudfront
Losslessly compressing https://d339oe4gm47j4m.cloudfront.net/bw107.jpg could save 57.6KiB (38% reduction).
Losslessly compressing https://luminoto-misc.s3-us-west-2.amazonaws.com/bob_horsch.jpg could save 40.6KiB (42% reduction). ...... and a hundred more of the same.
Can Cloudfront do the compression before the image is server to the client? Or do I have to do some other type of compression and then reupload each file to a new s3 bucket. I am looking for a solution where cloudfront will do it.
I have searched around but haven't found a definitive answer.
Thanks,
Jeff

Update
As implicitly pointed out by Ryan Parman (+1), there are two different layers at play when it comes to compression (and/or optimization), which seem to get mixed a bit in this discussion so far:
My initial answer below has addressed lossless compression using Cloudfront as per your question title, which is concerned with the HTTP compression layer:
HTTP compression is a capability that can be built into web servers and web clients to make better use of available bandwidth, and provide greater transmission speeds between both.
[...] data is compressed before it is sent from the server: compliant browsers will announce what methods are supported to the server before downloading the correct format; browsers that do not support compliant compression method will download uncompressed data. [...]
That is, the compress/decompress operation is usually automatically handled by the server and the client to optimize bandwidth usage and transmission performance - the difference with CloudFront is, that its server implementation does indeed not handle compression automatically like most web servers, which is why you need to prepare a compressed representation yourself if desired.
This kind of compression works best with text files like HTML, CSS and JavaScript etc., but isn't useful (or even detrimental) with binary data formats that are already compressed by themselves like ZIP and other prepacked archives and esp. image formats like PNG and JPEG.
Now, your question body talks about a different compression/optimization layer all together, namely lossy JPEG_compression and specifically Lossless_editing as well as optimization via jpegoptim - this has nothing to do with how files are handled by HTTP servers and clients, rather just compressing/optimizing the files themselves to better match the performance constraints within specific use cases like web or mobile browsing, where the transmission of a digital photo in its original size wouldn't make any sense when it is simply to be viewed on a web page for example.
This kind of compression/optimization is one that is rarely offered by web servers themselves so far, even though notable efforts like Google's mod_pagespeed are available these days - usually it is the responsibility of the web designer to prepare appropriate assets, ideally optimized for and selectively delivered to the expected target audience via CSS Media queries.
Initial Answer
AWS CloudFront is capable of Serving Compressed Files, however, this is to be taken literally:
Amazon CloudFront can serve both compressed and uncompressed files
from an origin server. CloudFront relies on the origin server either
to compress the files or to have compressed and uncompressed versions
of files available; CloudFront does not perform the compression on
behalf of the origin server. With some qualifications, CloudFront can
also serve compressed content from Amazon S3. For more information,
see Choosing the File Types to Compress. [emphasis mine]
That is, you'll need to provide compressed versions yourself, but once you've set this up, this is transparent for clients - please note that the content must be compressed using gzip; other compression algorithms are not supported:
[...] If the request header includes additional content encodings, for example, deflate or sdch, CloudFront removes them before forwarding the request to the origin server. If gzip is missing from the Accept-Encoding field, CloudFront serves only the uncompressed version of the file. [...]
Details regarding the requirements and process are outlined in How CloudFront Serves Compressed Content from a Custom Origin and Serving Compressed Files from Amazon S3.

JPEGOptim doesn't do any compression -- it does optimization.
The short answer is, yes, you should always use JPEGOptim on your .jpg files to optimize them before uploading them to S3 (or whatever your source storage is). This has been a good idea since forever.
If you're talking about files which are plain text-based (e.g., CSS, JavaScript, HTML), then gzip-compression is the appropriate solution, and Steffen Opel would have had the 100% correct answer.

The only compression amazon really supports is zip or gzip. You are able to load those compressions into S3, and then do things like loads directly into resources like Redshift. So in short, no amazon does not provide you with the service you are looking for. This would be something you would have to leverage yourself...

Programmatically write files into HDFS

I am looking at options in Java programs that can write files into HDFS with the following requirements.
1) Transaction Support: Each file, when being written, either fully written successfully or failed totally without any partial file blocks written.
2) Compression Support/File Formats: Can specify compression type or file format when writing contents.
I know how to write data into a file on HDFS by opening a FSDataOutputStream shown here. Just wondering if there is some libraries of out of the box solutions that provides the support I mentioned above.
I stumbled upon Flume, which provides HDFS sink that can support transaction, compression, file rotation, etc. But it doesn't seem to provide an API to be used as a library. The features Flume provides are highly coupled with the Flume architectural components, like source, channel, and sinks and doesn't seem to be usable independently. All I need is merely on the HDFS loading part.
Does anyone have some good suggestions?

I think using Flume as "gateway" to HDFS would be good solution. Your program sends data to Flume (using one of interfaces provided by its sources), Flume writes to HDFS.
This way you don't need to support bunch of custom code for interaction with HDFS. On the other hand, you need to install and configure Flume, but in my experience it is much easier (see this comment for installation recommendations).
Finally, Flume HDFS sink is open-source component, so you are free to reuse its code under the terms of Apache license. Get the sources here: https://git-wip-us.apache.org/repos/asf?p=flume.git;a=tree;f=flume-ng-sinks/flume-hdfs-sink;h=b9414a2ebc976240005895e3eafe37b12fad4716;hb=trunk

Why HDFS is write once and read multiple times?

I am a new learner of Hadoop.
While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).
Experts comment will help me to understand HDFS in better manner.

HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times (though future versions of Hadoop may support this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for HDFS.
As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming data access is extremely important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few POSIX requirements in order to implement streaming data access.
http://www.edureka.co/blog/introduction-to-apache-hadoop-hdfs/

There are three major reasons that HDFS has the design it has,
HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only
HDFS was not originally intended for anything but batch computation
Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.
There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.
Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included
MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.
Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.
Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.
Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.
other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.

An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read.

Though this design decision does impose restrictions, HDFS was built keeping in mind efficient streaming data access.
Quoting from Hadoop - The Definitive Guide:
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, and then various analyses are performed on that dataset over time.
Each analysis will involve a large proportion, if not all, of the dataset, so the time
to read the whole dataset is more important than the latency in reading the first
record.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js