In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.
HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.
Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.
Related
I am trying to setup a data pipeline in AWS hopefully using serverless and hosted service.
However, one of the steps require large amount of ram (120GB) which cannot be broken down into smaller chunks.
Ideally I would also run the steps as containers since the packages requirements are a bit exotic.
So far it seems like neither AWS Glue nor MWAA handles more than 32GB of ram.
The one that does handle it is AWS data pipeline, which is being deprecated.
Am I missing some (hosted) options? Otherwise I know that I can do things like running Flyte on managed k8s.
Regards,
Niklas
For such use case where you require a containerized approach and you prefer it to be serverless, you can check out EMR Serverless:
Amazon EMR Serverless is a new deployment option for Amazon EMR. EMR
Serverless provides a serverless runtime environment that simplifies
the operation of analytics applications that use the latest open
source frameworks, such as Apache Spark and Apache Hive. With EMR
Serverless, you don’t have to configure, optimize, secure, or operate
clusters to run applications with these frameworks.
EMR Serverless helps you avoid over- or under-provisioning resources
for your data processing jobs. EMR Serverless automatically determines
the resources that the application needs, gets these resources to
process your jobs, and releases the resources when the jobs finish.
Additionally, you can build your own containers with custom images that contain your specific package requirements.
And a note: Glue can process this file too. G.2X worker type has 32 GB of memory, but it also has 128 GB of disk space, which is utilized by a worker if it needs the space (and in a shuffle operation). You can also add your custom packages per job.
I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.
How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.
I want to clear big picture about the aws Glue regarding some of the following aspects.
How AWS Glue prepare and provision its infrastructure? However it's serverless but how does it manage it?
How it's using apache spark and hadoop to solve so many ETL jobs at a time, Almost jobs of hundreds of AWS Glue customers from every region.
Thanks
AWS Glue uses EMR underneath. It spawns a new cluster with required number of executors (depending on configured DPU) when a new job starts. However, to improve cold start time they have a buffer of already provisioned EMR clusters for the most common number of DPUs. To manage all this they have a set of automated services that monitor state of each cluster, start a new ones etc.
I am using EMR & running a spark streaming job with yarn as resource manager and Hadoop 2.7.3-amzn-0,
I want clean periodic datanode files like /mnt/hdfs/current/BP-2030300665-192.168.0.1-1495611838265/current/finalized/subdir0/subdir230/blk_1073800835
& blk_1073800835_60012.meta
Its increase my storage and facing disk storage full issue.
Is there any way to achieve the same or any impact on my cluster if i delete the same?