AWS S3 Block Size to calculate total number of mappers for Hive Workload

AWS S3 Block Size to calculate total number of mappers for Hive Workload - amazon-web-services

Does S3 stores the data in form of blocks? if yes, what is the default block size? Is there a way to alter the block size?

Block Size is not applicable to Amazon S3. It is an object storage system, not a virtual disk.

There is believed to be some partitioning of uploaded data into the specific blocks it was uploaded -and if you knew those values then readers may get more bandwidth. But certainly the open source hive/spark/mapreduce applications don't know the API calls to find this information out or look at these details. Instead the S3 connector takes some configuration option (for s3a: fs.s3a.block.size) to simulate blocks.
It's not so beneficial to work out that block size if it took an HTTP GET request against each file to determine the partitioning...that would slow down the (Sequential) query planning before tasks on split files were farmed out to the worker nodes. HDFS lets you get the listing and block partitioning + location in one API call (listLocatedStatus(path)); S3 only has a list call to return the list of (objects, timestamps, etags) under a prefix, (S3 List API v2) so that extra check would slow things down. If someone could fetch that data and show that there'd be benefits, maybe it'd be useful enough to implement. For now, calls to S3AFIleSystem.listLocatedStatus() against S3 just get some made up list of locations splitting of blocks by the fs.s3a.block.size value and with the location (localhost). All the apps known that location == localhost means "whatever"

Related

AWS Lambda generates large size files to S3

Currently we are having a aws lambda (java based runtime) which takes a SNS as input and then perform business logic and generate 1 XML file , store it to S3.
The implementation now is create the XML at .tmp location which we know there is space limitation of aws lambda (500mb).
Do we have any way to still use lambda but can stream XML file to S3 without using .tmp folder?
I do research but still do not find solution for it.
Thank you.

You can directly load an object to s3 from memory without having to store it locally. You can use the put object API for this. However, keep in mind that you still have time and total memory limits with lambda as well. You may run out of those too if your object size is too big.

If you can split the file into chunks and don't require to update the beginning of the file while working with its end you can use multipart upload providing a ready to go chunk and then free the memory for the next chunk.
Otherwise you still need a temporary storage for form all the parts of the XML. You can use DynamoDB or Redis and when you collect there all the parts of the XML you can start uploading it part by part, then cleanup the db (or set TTL to automate the cleanup).

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Store a 5TB file on S3 as object, every time read the object, do
the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing
Thank you

First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend:
If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)

Basics of Hadoop and MapReduce functioning

I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:
From what I Understand:
Hadoop is specifically used when there is a huge amount of data involved. When we store a file in HDFS, what happens is that, the file is split into various blocks (block size is typically 64MB or 128MB...or whatever is configured for the current system). Now, once the big file is split in to various blocks, then these blocks are stored over the cluster. This is internally handled by the hadoop environment.
The background for the question is:
Let us say there are multiple such huge files being stored in the system. Now, blocks of these different files may be stored at a data node, A(There are 3 data-nodes, A ,B and C ). And also, multiple blocks of the same file may also be stored at the same data node, A.
Scenario1:
If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen? Will there be multiple mappers assigned to these different blocks or the same mapper will process the multiple blocks?
Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing?
As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?
Scenario2
If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?

1) No. of mappers = No. of blocks of the file. That is, separate mapper for each block. Ideally, the no. of nodes in the cluster should be very high and no of two blocks of the same file stored on same machine.
2) Whenever client submits a job, the job will be executing on the whole file and not on particular the chunks.
3) When a client submits a job or stores a file inside HDFS, its upto the framework how it functions. Client should not be aware of the hadoop functionality, basically its none of his business. Client should know two things only - file and job(.jar).
4) Namenode stores all the metadata information about all the files stored inside HDFS. It stores information about within how many blocks a file gets distributed/divided. Each block of the file is stored across how many nodes/machines. On an average, for storing metadata information for each block, namenode needs 150 Bytes.
5) Scenario 2: Namenode manages such issues very well. HDFS has be defult replication factor as 3, which means every block will be stored on 3 different nodes. So, through these other nodes, HDFS manages load balancing but yes, primary goal of replication is make sure data availability. But take into consideration that there will be very less requests for reading the contents of the file. The Hadoop is meant for processing the data and not for just reading the contents.
I hope this will solve some of your doubts.

If a client request comes that requires to access the multiple blocks
of the same file on the same data node, then what will happen?
A client is not required to be a mapper, at this level we are working on HDFS and the data node will serve the same data to any client which requests them.
Will there be multiple mappers assigned to these different blocks or the > same mapper will process the multiple blocks?
Each map reduce jobs has its own mappers. More jobs which involves the same data block means more mappers which works on the same data.
Another part in the same question is, how does the client know what
blocks or lets say what part of the file will be required for
processing?
As the client doesn't know how the files are stored, how will it ask
the NameNode for block locations and etc?
Or for every such processing, ALL the blocks of the respective file
are processed? I mean to ask, what metadata is stored on the NameNode?
Clients knows about which blocks are required due to namenode. At the begin of file access, clients go to the namenode with the file name and gets back a list of blocks where data are stored together the datanode which holds them.
namenode holds the "directory information" together the block list where the data are, all these info are stored in RAM and are updated on each system startup. Also datanode sends heartbeat to namenode together block allocation informations.
EVERY datanode reports to EVERY namenode.
If there are two different requests to access blocks of different
files on the same data node, then what will happen? In this case,
there will be other data nodes with no work to do and won't there be a
bottleneck at a single data node?
Unless the datanode does not respond ( failure ) access goes always on the same datanode. The replication is not used to make things work fast, it's all about to be sure no data will be lost.
I.E: When you write to HDFS your data will be forwarded to any replication block and this makes writes very slow. We need to be sure about data are safe.

Upload files to S3 Bucket directly from a url

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.

No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.

This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.

if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js