EMR & Spark adding dependencies after cluster creation - amazon-web-services

Is it possible to install additional libs/dependencies after the cluster is already up and running?
Things I've done that are related to this:
I've already done the pre-creation bootstrapping process (this is a different solution altogether)
alternatively, SSH'd into each node and installed dependencies after the cluster comes up
I think the post-startup installation solution would involve being able to fire off a command to all the executors from the driver. Is there a way to accomplish this?

If you want additional dependencies to your steps, you can add them in the command of the step ( i.e. for Spark use the --jars option )

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.
Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

Is it possible to use a custom hadoop version with EMR?

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.
I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.
client = boto3.client("emr", region_name="us-west-1")
response = client.run_job_flow(
ReleaseLabel="emr-6.6.0",
Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)
Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.getOrCreate()
Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?
I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.
You'd have to build your own cluster from scratch in EC2.
You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.

Run command conditionally while deploying with amazon ECS

I am having a django app deployed on ECS. I need to run fixtures, a management command needed to be run inside the container on my first deployment.
I want to have access to run fixtures conditionally, not fixed for first deployment only. One way I was thinking is to maintain a variable in my env, and to run fixtures in entrypoint.sh accordingly.
Is this a good way to go about it? Also, what are some other standard ways to do the same.
Let me know if I have missed some details you might need to understand my problem.
You probably need to handle it in your entrypoint.sh script only.As far as my experience goes, you won't be able to conditionally run commands without a script in case of ECS.

Elastic Kubernetes Service AWS Deployment process to avoid down time

Its been a month I have started working on EKS AWS and up till now successfully deployed by code.
The steps which I follow for deployment are given below:
Create image from docker terminal.
Tag and push to ECR AWS.
Create the deployment "project.json" and service file "project-svc.json".
Save the above file in "kubectl/bin" path and deploy it with following commands below.
"kubectl apply -f projectname.json" and "kubectl apply -f projectname-svc.json".
So if I want to deployment the same project again with change, I push the new image on ECR and delete the existing deployment by using "kubectl delete -f projectname.json" without deleting the existing service and deploy it again using command "kubectl apply -f projectname.json" again.
Now, I'm in confusing that after I delete the existing deployment there is a downtime until I apply or create the deployment again. So, how to avoid this ? Because I don't want the downtime actually that is the reason why I started to use the EKS.
And one more thing is the process of deployment is a bit long too. I know I'm missing something can anybody guide me properly please?
The project is on .NET Core and if there is any simplified way to do deployment using Visual Studio please guide me for that also.
Thank You in advance!
There is actually no need to delete your deployment. Just need to update the desired state (the deployment configuration) and let K8s do its magic and apply the needed changes, like deploying a new version of your container.
If you have a single instance of your container, you will experience a short down time while changes are applied. If your application supports multiple replicas (HA), you can enjoy the rolling upgrade feature.
Start by reading the official Kubernetes documentation of a Performing a Rolling Update.
You only need to use the delete/apply if you are changing (And if you have) the ConfigMap attached to the Deployment.
Is the only change you do is the "image" of the deployment - you must use the "set-image" command.
Kubectl let you change the actual deployment image and it does the Rolling Updates all by itself and with 3+ pods you have the minimum chance for downtime.
Even more, if you use the --record flag, you can "rollback" to your previous image with no effort because it keep track of the changes.
You also have the possibility to specify the "Context" too, with no need to jump from contexts.
You can go like this:
kubectl set image deployment DEPLOYMENT_NAME DEPLOYMENT_NAME=IMAGE_NAME --record -n NAMESPACE
OR Specifying the Cluster
kubectl set image deployment DEPLOYEMTN_NAME DEPLOYEMTN_NAME=IMAGE_NAME_ECR -n NAMESPACE --cluster EKS_CLUSTER_NPROD --user EKS_CLUSTER --record
As an Eg:
kubectl set image deployment nginx-dep nginx-dep=ecr12345/nginx:latest -n nginx --cluster eu-central-123-prod --user eu-central-123-prod --record
The --record is what let you track all the changes, if you want to rollback just do:
kubectl rollout undo deployment.v1.apps/nginx-dep
More documentations about it here:
Updating a deployment
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
Roll Back Deployment
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-a-deployment

How to run pyspark on EC2 with IPython starting from the spark-ec2 launch process?

Three steps and I have a spark context in my IPython notebook:
1.) Launch spark on EC2 using the these instructions.
2.) Install anaconda and py4j on every node (set PATH accordingly).
3.) Login to master, cd to the spark folder, then run:
MASTER=spark://<public DNS>:7077 PYSPARK_PYTHON=~/anaconda2/bin/python PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="*"' ./bin/pyspark
This process makes the IPython notebook available on < master public DNS >:8888, which is great, but ... I am currently using a csshx style solution to accomplish step 2.
Question:
How can I set install requirements (on AWS or elsewhere) so that the spark-ec2 script spins up machines with the desired setup?
If that's not possible or simply clunky, what would you suggest? (command line only solutions are preferred)