I am new to Hadoop and started learning it. I have a question regarding distribution our data on HDFS.
Suppose we have a 100 TB of data in the form of flat file.
Where we will initially load our data? On Master Node?
Does Hadoop distribute the data by itself on the cluster? Or do we have to do it ourselves?
Hi please find the answers below,
Ans 1) Hadoop follows Master and slave architecture.
Name(Master) Node.
Data Node.
Name Nodes are Masters and Data nodes are slaves. Data stored in Data Nodes and in Master/Name node the files meta data information is stored. like block information where the file is stored, file size, file permissions and so on.
For more info you can read the documentation "Anatomy of a File Write" in Hadoop definite guide book.
Ans 2). Hadoop takes care of data distribution. In Hadoop, data distribution is done based on file size. Default block size is 128 MB which is configurable. If your file size is 256 MB. 256/128 = 2. Your file will be stored in 2 blocks in parallel and replicas are created in sequential manner.
Related
I have multiple S3 files in a bucket.
Input S3 bucket :
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data
and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.
I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.
Files 1 - < 1GB
Files 2 - < 1GB
Files 3 - < 1GB
I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :
AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.
Expected scales -> Overall file input size sum : 1 TB
What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.
Thanks!
Edit :
Based on a comment ->
Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code
List<Files> listOfFiles = ReadFromS3(key)
New file named temp.csv
for each file : listOfFiles :
append file to temp.csv
List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
for(File file : finalList)
writeToS3(finalList)
Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).
It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.
The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).
See:
Bucketing vs partitioning - Amazon Athena
Set the number or size of files for a CTAS query in Amazon Athena
If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE
Please find here an example for Glue using s3 as a source.
If you’d like to use it with Java SDK, the best starting points are:
the Glue GitHub repo
The aws Java code sample catalog for Glue
Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.
In Spark, if the data files are in AWS S3 (object store), how are the blocks of the files read by the executor, how do the executor co-ordinate i.e. executor 1 reads block1 (1-128 MB) and executor 2 reads block2 (128 to 256MB). How is the entire process managed and who manages it?
Secondly how are the objects broken down to blocks?
aws s3 api allows to read only a range of the file. Executor is splitting the file in defined size and assigns a partition to every data node. Every data node will read his partition only.
Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?
I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns
Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data
When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.
I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:
From what I Understand:
Hadoop is specifically used when there is a huge amount of data involved. When we store a file in HDFS, what happens is that, the file is split into various blocks (block size is typically 64MB or 128MB...or whatever is configured for the current system). Now, once the big file is split in to various blocks, then these blocks are stored over the cluster. This is internally handled by the hadoop environment.
The background for the question is:
Let us say there are multiple such huge files being stored in the system. Now, blocks of these different files may be stored at a data node, A(There are 3 data-nodes, A ,B and C ). And also, multiple blocks of the same file may also be stored at the same data node, A.
Scenario1:
If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen? Will there be multiple mappers assigned to these different blocks or the same mapper will process the multiple blocks?
Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing?
As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?
Scenario2
If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?
1) No. of mappers = No. of blocks of the file. That is, separate mapper for each block. Ideally, the no. of nodes in the cluster should be very high and no of two blocks of the same file stored on same machine.
2) Whenever client submits a job, the job will be executing on the whole file and not on particular the chunks.
3) When a client submits a job or stores a file inside HDFS, its upto the framework how it functions. Client should not be aware of the hadoop functionality, basically its none of his business. Client should know two things only - file and job(.jar).
4) Namenode stores all the metadata information about all the files stored inside HDFS. It stores information about within how many blocks a file gets distributed/divided. Each block of the file is stored across how many nodes/machines. On an average, for storing metadata information for each block, namenode needs 150 Bytes.
5) Scenario 2: Namenode manages such issues very well. HDFS has be defult replication factor as 3, which means every block will be stored on 3 different nodes. So, through these other nodes, HDFS manages load balancing but yes, primary goal of replication is make sure data availability. But take into consideration that there will be very less requests for reading the contents of the file. The Hadoop is meant for processing the data and not for just reading the contents.
I hope this will solve some of your doubts.
If a client request comes that requires to access the multiple blocks
of the same file on the same data node, then what will happen?
A client is not required to be a mapper, at this level we are working on HDFS and the data node will serve the same data to any client which requests them.
Will there be multiple mappers assigned to these different blocks or the > same mapper will process the multiple blocks?
Each map reduce jobs has its own mappers. More jobs which involves the same data block means more mappers which works on the same data.
Another part in the same question is, how does the client know what
blocks or lets say what part of the file will be required for
processing?
As the client doesn't know how the files are stored, how will it ask
the NameNode for block locations and etc?
Or for every such processing, ALL the blocks of the respective file
are processed? I mean to ask, what metadata is stored on the NameNode?
Clients knows about which blocks are required due to namenode. At the begin of file access, clients go to the namenode with the file name and gets back a list of blocks where data are stored together the datanode which holds them.
namenode holds the "directory information" together the block list where the data are, all these info are stored in RAM and are updated on each system startup. Also datanode sends heartbeat to namenode together block allocation informations.
EVERY datanode reports to EVERY namenode.
If there are two different requests to access blocks of different
files on the same data node, then what will happen? In this case,
there will be other data nodes with no work to do and won't there be a
bottleneck at a single data node?
Unless the datanode does not respond ( failure ) access goes always on the same datanode. The replication is not used to make things work fast, it's all about to be sure no data will be lost.
I.E: When you write to HDFS your data will be forwarded to any replication block and this makes writes very slow. We need to be sure about data are safe.
I am trying to understand few key concepts for mapreduce, particularity when it pertains to AWS EMR. Lets say for all of this i am using an external hive metastore(RDS)
My questions are as follows
As EMR has a concept of Task node(which process data but do not hold persistent data from what i understand). Is it copying the data over from HDFS(from core nodes) to the task node to do the processing and sending the end results back to core node. I guess i am little bit confused due to the fact that mapredue is all about moving "the code" to where the data is for processing. As Task node do not have any data on it, what is it processing(until and unless it copies over the data for processing as i mentioned earlier?)
how does this work when all my data is sitting in S3, again the confused due to the fact, "code moves to data"...as my data is sitting in S3, is it copying it over to the core nodes for processing and putting things back in S3?
If lets say I have all my data in S3, is it more efficient to use s3distcp to copy data over to HDFS for processing or am i better of pointing my hive tables to the S3 bucket location ?