AWS EMR format datanode periodically - amazon-web-services

I am using EMR & running a spark streaming job with yarn as resource manager and Hadoop 2.7.3-amzn-0,
I want clean periodic datanode files like /mnt/hdfs/current/BP-2030300665-192.168.0.1-1495611838265/current/finalized/subdir0/subdir230/blk_1073800835
& blk_1073800835_60012.meta
Its increase my storage and facing disk storage full issue.
Is there any way to achieve the same or any impact on my cluster if i delete the same?

Related

Why do we need HDFS on EMR when we have S3

In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.
HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.
Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.

Spark on EMR | EKS |YARN

I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.

Turn on/off AWS EMR clusters

How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.

Is it possible to specify the number of mappers-reducers while using s3-dist-cp?

I'm trying to copy data from an EMR cluster to S3 using s3-distcp. Can I specify the number of reducers to a greater value than the default so as to fasten my process?
For setting up number of reducers, you can use the property mapreduce.job.reduces similar to below:
s3-dist-cp -Dmapreduce.job.reduces=10 --src hdfs://path/to/data/ --dest s3://path/to/s3/
Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
You can call S3DistCp by adding it as a step in your existing EMR cluster. Steps can be added to a cluster at launch or to a running cluster using the console, AWS CLI, or API.
So you control the number of workers during EMR cluster creation or you can resize existing cluster. You can check exact steps in EMR docs.

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR