Basics of Hadoop and MapReduce functioning - mapreduce

I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:
From what I Understand:
Hadoop is specifically used when there is a huge amount of data involved. When we store a file in HDFS, what happens is that, the file is split into various blocks (block size is typically 64MB or 128MB...or whatever is configured for the current system). Now, once the big file is split in to various blocks, then these blocks are stored over the cluster. This is internally handled by the hadoop environment.
The background for the question is:
Let us say there are multiple such huge files being stored in the system. Now, blocks of these different files may be stored at a data node, A(There are 3 data-nodes, A ,B and C ). And also, multiple blocks of the same file may also be stored at the same data node, A.
Scenario1:
If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen? Will there be multiple mappers assigned to these different blocks or the same mapper will process the multiple blocks?
Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing?
As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?
Scenario2
If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?

1) No. of mappers = No. of blocks of the file. That is, separate mapper for each block. Ideally, the no. of nodes in the cluster should be very high and no of two blocks of the same file stored on same machine.
2) Whenever client submits a job, the job will be executing on the whole file and not on particular the chunks.
3) When a client submits a job or stores a file inside HDFS, its upto the framework how it functions. Client should not be aware of the hadoop functionality, basically its none of his business. Client should know two things only - file and job(.jar).
4) Namenode stores all the metadata information about all the files stored inside HDFS. It stores information about within how many blocks a file gets distributed/divided. Each block of the file is stored across how many nodes/machines. On an average, for storing metadata information for each block, namenode needs 150 Bytes.
5) Scenario 2: Namenode manages such issues very well. HDFS has be defult replication factor as 3, which means every block will be stored on 3 different nodes. So, through these other nodes, HDFS manages load balancing but yes, primary goal of replication is make sure data availability. But take into consideration that there will be very less requests for reading the contents of the file. The Hadoop is meant for processing the data and not for just reading the contents.
I hope this will solve some of your doubts.

If a client request comes that requires to access the multiple blocks
of the same file on the same data node, then what will happen?
A client is not required to be a mapper, at this level we are working on HDFS and the data node will serve the same data to any client which requests them.
Will there be multiple mappers assigned to these different blocks or the > same mapper will process the multiple blocks?
Each map reduce jobs has its own mappers. More jobs which involves the same data block means more mappers which works on the same data.
Another part in the same question is, how does the client know what
blocks or lets say what part of the file will be required for
processing?
As the client doesn't know how the files are stored, how will it ask
the NameNode for block locations and etc?
Or for every such processing, ALL the blocks of the respective file
are processed? I mean to ask, what metadata is stored on the NameNode?
Clients knows about which blocks are required due to namenode. At the begin of file access, clients go to the namenode with the file name and gets back a list of blocks where data are stored together the datanode which holds them.
namenode holds the "directory information" together the block list where the data are, all these info are stored in RAM and are updated on each system startup. Also datanode sends heartbeat to namenode together block allocation informations.
EVERY datanode reports to EVERY namenode.
If there are two different requests to access blocks of different
files on the same data node, then what will happen? In this case,
there will be other data nodes with no work to do and won't there be a
bottleneck at a single data node?
Unless the datanode does not respond ( failure ) access goes always on the same datanode. The replication is not used to make things work fast, it's all about to be sure no data will be lost.
I.E: When you write to HDFS your data will be forwarded to any replication block and this makes writes very slow. We need to be sure about data are safe.

Related

AWS S3 Block Size to calculate total number of mappers for Hive Workload

Does S3 stores the data in form of blocks? if yes, what is the default block size? Is there a way to alter the block size?
Block Size is not applicable to Amazon S3. It is an object storage system, not a virtual disk.
There is believed to be some partitioning of uploaded data into the specific blocks it was uploaded -and if you knew those values then readers may get more bandwidth. But certainly the open source hive/spark/mapreduce applications don't know the API calls to find this information out or look at these details. Instead the S3 connector takes some configuration option (for s3a: fs.s3a.block.size) to simulate blocks.
It's not so beneficial to work out that block size if it took an HTTP GET request against each file to determine the partitioning...that would slow down the (Sequential) query planning before tasks on split files were farmed out to the worker nodes. HDFS lets you get the listing and block partitioning + location in one API call (listLocatedStatus(path)); S3 only has a list call to return the list of (objects, timestamps, etags) under a prefix, (S3 List API v2) so that extra check would slow things down. If someone could fetch that data and show that there'd be benefits, maybe it'd be useful enough to implement. For now, calls to S3AFIleSystem.listLocatedStatus() against S3 just get some made up list of locations splitting of blocks by the fs.s3a.block.size value and with the location (localhost). All the apps known that location == localhost means "whatever"

Spark Streaming with S3 vs Kinesis

I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures:
Have Spark Streaming watch an S3 prefix and pick up new objects as they
come in
Stream data from S3 to a Kinesis stream (through a Lambda function triggered as new S3 objects are created by DMS) and use the stream as input for the Spark application.
While the second solution will work, the first solution is simpler. But are there any pitfalls? Looking at this guide, I'm concerned about two specific points:
The more files under a directory, the longer it will take to scan for changes — even if no files have been modified.
We will be keeping the S3 data indefinitely. So the number of objects under the prefix being monitored is going to increase very quickly.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is opened, even before data has been completely written, it may be included in the DStream - after which updates to the file within the same window will be ignored. That is: changes may be missed, and data omitted from the stream.
I'm not sure if this applies to S3, since to my understanding objects are created atomically and cannot be updated afterwards as is the case with ordinary files.
I posted this to Spark mailing list and got a good answer from Steve Loughran.
Theres a slightly-more-optimised streaming source for cloud streams
here
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/streaming/hortonworks/CloudInputDStream.scala
Even so, the cost of scanning S3 is one LIST request per 5000 objects;
I'll leave it to you to work out how many there will be in your
application —and how much it will cost. And of course, the more LIST
calls tehre are, the longer things take, the bigger your window needs
to be.
“Full” Filesystems such as HDFS tend to set the modification time on their files as soon as the output stream is created. When a file is
opened, even before data has been completely written, it may be
included in the DStream - after which updates to the file within the
same window will be ignored. That is: changes may be missed, and data
omitted from the stream.
Objects written to S3 are't visible until the upload completes, in an
atomic operation. You can write in place and not worry.
The timestamp on S3 artifacts comes from the PUT tim. On multipart
uploads of many MB/many GB uploads, thats when the first post to
initiate the MPU is kicked off. So if the upload starts in time window
t1 and completed in window t2, the object won't be visible until t2,
but the timestamp will be of t1. Bear that in mind.
The lambda callback probably does have better scalability and
resilience; not tried it myself.
Since the number of objects in my scenario is going to be much larger than 5000 and will continue to grow very quickly, S3 to Spark doesn't seem to be a feasible option. I did consider moving/renaming processed objects in Spark Streaming, but the Spark Streaming application code seems to only receive DStreams and no information about which S3 object the data is coming from. So I'm going to go with the Lambda and Kinesis option.

How to sync a number between multiple google cloud instances using google cloud storage?

I am trying to sync an operation between multiple instances in google cloud.
In the home folder of an image from where I create new instances, I have several files that are named like this: 1.txt, 2.txt, 3.txt,... 50000.txt.
I have another file in the google cloud storage bucket named gs://bucket/current_file.txt that contains a number in a single line which indicates the latest file that is being processed by all the running the google cloud instances. Initially this file looks like this:
0
Now I am creating multiple google instances one by one. The instances have a startup script like this:
gsutil cp gs://bucket/current_file.txt /home/ubuntu/;
past_file=`tail /home/ubuntu/current_file.txt`;
current_file=$((past_file+1));
echo $current_file > /home/ubuntu/current_file.txt;
gsutil cp /home/ubuntu/current_file.txt gs://bucket/;
process.py /home.ubuntu/$current_file.txt;
So this script downloads the value of the current file that is being processed by another instance, then it increments it by 1, and starts processing the incremented file. Also gs://bucket/current_file.txt is updated so that other instances know the name of the next file they can start processing. When I have only 1 instance running, the gs://bucket/current_file.txt is updated properly, but when I am running multiple instances, sometimes the value in gs://bucket/current_file.txt goes up to a value and then erratically it falls back to a decreased value.
My assumption is somehow two different instances are trying to upload the same file at the same time and messes up the integer value inside the text file.
Is it anyway possible to lock the file so that other instances wait before one instance can overwrite the gs://bucket/currrent_file.txt?
If not, can someone suggest any other mechanism through which I can update the current_file number once it is being processed by one instance, and then can be communicated to other instances so that they can start processing the following files when they complete processing a file at hand?
You are correct. In your architecture, you need some mechanism to lock your current-file counter so that only one process at a time is able to change its value. You want to be able to apply a mutex or lock to the file, when one process opens it to increment it, so that another process is unable to increment it concurrently.
I recommend you consider alternative approaches.
Even if you are able to lock the counter, your "workers" will block, waiting their turn to increment this variable when they should be able to continue processing files. You also limit processing to one file at a time when, it may be more efficient for your processes to grab batches of files at a time.
There are various approaches for you to consider.
If your set of files is predetermined, i.e. you always have 50k. When you start, you could decide how many workers you wish to use and then give each of them part of the problem to solve. If you chose 1000 workers, the first may be assigned 1.txt..50.txt, the 2nd 51.txt..99.txt etc. If there are gaps in the files, the worker would skip the missing file.
In a more complex scenario, when the files are created in the bucket randomly and ongoing, a common practice is to queue the processing. Have a look at Task Queues and Cloud Pub/Sub. In this approach, you track files as they arrive. For each file you enqueue a job to process it. With both Tasks Queues and Pub/Sub you can create push or pull queues.
In either approach, you would write a worker that accepts jobs (files) from the queue, processes them and does something with the processed file. This approach has 2 advantages over the simpler case: The first is that you can dynamically increase|reduce the number of workers based on the queue depth (number of files to be processed). The second is that, if a worker fails, it won't take the job from the queue and so another worker can replace it and complete the file processing.
You could move processed files to a "processed" bucket to track completion. This way, if your job fails, you need only restart with the files that have not yet been processed.
Lastly, rather than creating instances one-by-one, have a look at auto-scaling using Managed Instance Groups or perhaps consider using Kubernetes. Both these technologies help you clone many similar processes from a single template. While neither of these solutions solves your coordination problem, either would help you manage all the workers.

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Store a 5TB file on S3 as object, every time read the object, do
the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing
Thank you
First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend:
If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)

Amazon S3 files access from multiple processes

How does AWS handle multiple different threads accessing the same file at the same time? For example, say I have several big (1GB+) data files stored in S3 and two different processes need to work with the file at the same time (for example, one might be copying the data to another bucket while the other process is loading the data into Redshift for analysis). I know for redundancy they keep multiple copies of every file in S3 but how does it handle multiple requests coming in for operations on the same file?
thanks
You can read the file from as many simultaneous processes as you like. S3 is transactional, so if you try to write from more than one process, the first write will complete, then the next process will overwrite the previous. Therefore be careful when writing from multiple processes.