Spark not able to access S3 when run standalone on AWS Batch - amazon-web-services

With the AWS library I can access S3, but if I try to access S3 with Spark program (build with NativePackager) this doesn't work.
I tried s3://, s3n:// and s3a://.
Let me show a few off my tests:
test 1:
If I do nothing special. Failed as described before.
test 2:
Following https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html, I did this code before invoking my code:
curl --location http://169.254.170.2/$$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI > credentials.txt
export AWS_ACCESS_KEY_ID=`cat credentials.txt | perl -MJSON::PP -E 'say decode_json(<>)->{"AccessKeyId"}'`
export AWS_SECRET_ACCESS_KEY=`cat credentials.txt | perl -MJSON::PP -E 'say decode_json(<>)->{"SecretAccessKey"}'`
The some error has before
test 3:
If I set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with my personal keys. Both AWS library and Spark Work
Considering that the test 3 works, my code works. For obvious reason, I don't like to maintain keys around. The question is:
How can I use the AWS Batch (ECS) created credentials on my Spark Job?

I had the same issue and after reading the documentation carefully I realized I needed to add this to my spark properties:
sparkConf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.DefaultAWSCredentialsProviderChain')
Hope it helps

Related

spark cluster on aws emr cant find spark-env.sh

I am playing with apache-spark on aws emr, and trying to use this to set the cluster to use python3,
I use the command as the last command in a bootstrap script
sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
When I use it the cluster crashes during the bootstrap with the following error.
sed: can't read /etc/spark/conf/spark-env.sh: No such file or
directory
How should I set it to use python3 properly?
This is not a duplicate of, My issue is that the cluster is not finding the spark-env.sh file while bootstrapping, while the other question addresses the issue of the system not finding python3
In the end I did not use that script, but Used the EMR configuration file that is available on the creation stage, It gave me the proper configurations via spark_submit (in the aws gui) If you need it to be available for pyspark scripts in a more programatic way, you can use os.environ to set the pyspark python version in the python script

passing aws creds to kitchen ec2 command line

I am trying to do chef cookbook development via Jenkinsfile pipeline. I have my jenkins server running as a container (using jenkinsci/blueocean image). As one of the stages, I am trying to do aws configure and then run kitchen test. For some reason with below code, I am getting unauthorized operation error. For some reason, my AWS creds are not sent properly to .kitchen.yml (No need to check IAM creds, because they have admin access)
stage('\u27A1 Verify Kitchen') {
steps {
sh '''mkdir -p ~/.aws/
echo 'AWS_ACCESS_KEY_ID=...' >> ~/.aws/credentials
echo 'AWS_SECRET_ACCESS_KEY=...' >> ~/.aws/credentials
cat ~/.aws/credentials
KITCHEN_LOCAL_YAML=.kitchen.yml /opt/chefdk/embedded/bin/kitchen list
KITCHEN_LOCAL_YAML=.kitchen.yml /opt/chefdk/embedded/bin/kitchen test'''
}
}
Is there anyway, I can pass AWS creds here. Also .kitchen.yml no longer supports passing AWS creds inside the file. Is there someway I can pass creds on command i.e. .kitchen.yml access_key=... secret_access_key=... /opt/chefdk/embedded/bin/kitchen test
Really appreciate your help.
You don't need to set KITCHEN_LOCAL_YAML=.kitchen.yml, that's already the primary config file.
You probably want to be using a Jenkins credential file, not hardcoding things into the job. But that said, the reason this isn't working is because the AWS credentials file is not a shell script, which is the syntax you're using there. It's an INI/TOML file and is paired with a similar config file that shares a similar structure.
You should probably just be using the environment variable support in kitchen-ec2 via the withEnv pipeline helper method or similar things for integrating with Jenkins managed credentials.

Semantic versioning with AWS CodeBuild

Currently my team is using Jenkins to manage our CI/CD workflow. As our infrastructure is entirely in AWS I have been looking into migrating to AWS CodePipeline/CodeBuild to manage this.
In current state, we are versioning our artifacts as such <major>.<minor>.<patch>-<jenkins build #> i.e. 1.1.1-987. However, CodeBuild doesn't seem to have any concept of a build number. As artifacts are stored in s3 like <bucket>/<version>/<artifact> I would really hate to lose this versioning approach.
CodeBuild does provide a few env variables that i can see here: http://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref.html#build-env-ref-env-vars
But from what is available it seems silly to try to use the build ID or anything else.
Is there anything readily available from CodeBuild that could support an incremental build #? Or is there an AWS recommended approach to semantic versioning? Searching this topic returns remarkably low results
Any help or suggestions is greatly appreciated
The suggestion to use date wasn't really going to work for our use case. We ended up creating a base version in SSM and creating a script that runs within the buildspec that grabs, increments, and updates the version back to SSM. It's easy enough to do:
Create a String/SecureString within SSM as [NAME]. For example lets say "BUILD_VERSION". The value should be in [MAJOR.MINOR.PATCH] or [MAJOR.PATCH].
Create a shell script. The one below should be taken as a basic template, you will have to modify it to your needs:
#!/bin/bash
if [ "$1" = 'next' ]; then
version=$(aws ssm get-parameter --name "BUILD_VERSION" --region 'us-east-1' --with-decryption | sed -n -e 's/.*Value\"[^\"]*//p' | sed -n -e 's/[\"\,]//gp')
majorminor=$(printf $version | grep -o ^[0-9]*\\.[0-9]*\. | tr -d '\n')
patch=$(printf $version | grep -o [0-9]*$ | tr -d '\n')
patch=$(($patch+1))
silent=$(aws ssm put-parameter --name "BUILD_VERSION" --value "$majorminor$patch" --type "SecureString" --overwrite)
echo "$majorminor$patch"
fi
Call the versioning script from within buildspec and use the output however you need.
It may be late while I post this answer, however since this feature is not yet released by AWS this may help a few people in a similar boat.
We used Jenkins build numbers for versioning and were migrating to codebuild/code-pipeline. codebuild-id did not work for us as it was very random.
So in the interim we create our own build number in buildspec file
BUILD_NUMBER=$(date +%y%m%d%H%M%S).
This way at least we are able to look at the id and know when it was deployed and have some consistency in the numbering.
So in your case, it would be 1.1.1-181120193918 instead of 1.1.1-987.
Hope this helps.
CodeBuild supports semantic versioning.
In the configuration for the CodeBuild project you need to enable semantic versioning (or set overrideArtifactName via the CLI/API).
Then in your buildspec.yml file specify a name using the Shell command language:
artifacts:
files:
- '**/*'
name: myname-$(date +%Y-%m-%d)
Caveat: I have tried lots of variations of this and cannot get it to work.

AWS CLI save output as log file

I'm new to AWS CLI (and programming), but I've looked through documentation and posted questions and can't find this addressed, I must be missing something basic?
How do I save the output? I'd like to run AWS S3 Sync to backup my data overnight, and I'd like to see a log report in the morning of what happened.
At this point, I can run AWS from a command prompt:
aws s3 sync "my local directory" s3://mybucket
I've set output format to Text in the config. But I'm only seeing the text in the command prompt. How can I export it as a log file?
Is this not possible, what am I missing?
Many thanks in advance,
Matthew
aws s3 sync "my local directory" s3://mybucket --debug 2> "local path\logname.txt"
Not only did I figure out adding > filename to the end of the command, but I also figured out that when saving this as a batch file, it won't run as a scheduled task in Windows Server 2008 r2, or Windows 7, if it contains drive mappings. UNC paths are required.
Thanks!
Matthew
this perfectly worked for me
aws cloudformation describe-stack-events --stack-name "stack name" --debug 2> "C:\Users\ravi\Desktop\CICDWORKFolder\RedshiftFolder\logname.txt"

amazon emr spark submission from S3 not working

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.
I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]
I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.