I'm trying to package a pyspark job with PEX to be run on google cloud dataproc, but I'm getting a Permission Denied error.
I've packaged my third party and local dependencies into env.pex and an entrypoint that uses those dependencies into main.py. I then gsutil cp those two files up to gs://<PATH> and run the script below.
from google.cloud import dataproc_v1 as dataproc
from google.cloud import storage
def submit_job(project_id: str, region: str, cluster_name: str):
job_client = dataproc.JobControllerClient(
client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
)
operation = job_client.submit_job_as_operation(
request={
"project_id": project_id,
"region": region,
"job": {
"placement": {"cluster_name": cluster_name},
"pyspark_job": {
"main_python_file_uri": "gs://<PATH>/main.py",
"file_uris": ["gs://<PATH>/env.pex"],
"properties": {
"spark.pyspark.python": "./env.pex",
"spark.executorEnv.PEX_ROOT": "./.pex",
},
},
},
}
)
The error I get is
Exception in thread "main" java.io.IOException: Cannot run program "./env.pex": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
Should I expect packaging my environment like this to work? I don't see a way to change the permission of files included as file_uris in the pyspark job config, and I don't see any documentation on google cloud about packaging with PEX, but PySpark official docs include this guide.
Any help is appreciated - thanks!
You can always run a PEX file using a compatible interpreter. So instead of specifying a program of ./env.pex you could try python env.pex. That does not require env.pex to be executable.
I wasn't able to run the pex directly in the end, but did get a workaround working for now, which was suggested by a user in the pants slack community (thanks!)...
The workaround is to unpack the pex as a venv in a cluster initialization script.
The initialization script gsutil copied to gs://<PATH TO INIT SCRIPT>:
#!/bin/bash
set -exo pipefail
readonly PEX_ENV_FILE_URI=$(/usr/share/google/get_metadata_value attributes/PEX_ENV_FILE_URI || true)
readonly PEX_FILES_DIR="/pexfiles"
readonly PEX_ENV_DIR="/pexenvs"
function err() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $*" >&2
exit 1
}
function install_pex_into_venv() {
local -r pex_name=${PEX_ENV_FILE_URI##*/}
local -r pex_file="${PEX_FILES_DIR}/${pex_name}"
local -r pex_venv="${PEX_ENV_DIR}/${pex_name}"
echo "Installing pex from ${pex_file} into venv ${pex_venv}..."
gsutil cp "${PEX_ENV_FILE_URI}" "${pex_file}"
PEX_TOOLS=1 python "${pex_file}" venv --compile "${pex_venv}"
}
function main() {
if [[ -z "${PEX_ENV_FILE_URI}" ]]; then
err "ERROR: Must specify PEX_ENV_FILE_URI metadata key"
fi
install_pex_into_venv
}
main
To start the cluster and run the initialization script to unpack the pex into a venv on the cluster:
from google.cloud import dataproc_v1 as dataproc
def start_cluster(project_id: str, region: str, cluster_name: str):
cluster_client = dataproc.ClusterControllerClient(...)
operation = cluster_client.create_cluster(
request={
"project_id": project_id,
"region": region,
"cluster": {
"project_id": project_id,
"cluster_name": cluster_name,
"config": {
"master_config": <CONFIG>,
"worker_config": <CONFIG>,
"initialization_actions": [
{
"executable_file": "gs://<PATH TO INIT SCRIPT>",
},
],
"gce_cluster_config": {
"metadata": {"PEX_ENV_FILE_URI": "gs://<PATH>/env.pex"},
},
},
},
}
)
To start the job and use the unpacked pex venv to run the pyspark job:
def submit_job(project_id: str, region: str, cluster_name: str):
job_client = dataproc.ClusterControllerClient(...)
operation = job_client.submit_job_as_operation(
request={
"project_id": project_id,
"region": region,
"job": {
"placement": {"cluster_name": cluster_name},
"pyspark_job": {
"main_python_file_uri": "gs://<PATH>/main.py",
"properties": {
"spark.pyspark.python": "/pexenvs/env.pex/bin/python",
},
},
},
}
)
Following #megabits answer here is the bash based workflow that works for me
copy the init script (from answer) to GCS as gs://BUCKET/pkg/cluster-env-init.bash
build PEX providing --include-tools argument that is required by initialization script, e.g.
pex --include-tools -r requirements.txt -o env.pex
put PEX file on GCS
gsutil mv env.pex "gs://BUCKET/pkg/env.pex"
create cluster using PEX file to set-up env
gcloud dataproc clusters create your-cluster --region us-central1 \
--initialization-actions="gs://BUCKET/pkg/cluster-env-init.bash" \
--metadata "PEX_ENV_FILE_URI=gs://BUCKET/pkg/env.pex"
run job
gcloud dataproc jobs submit pyspark your-script.py \
--cluster=your-cluster --region us-central1 \
--properties spark.pyspark.python="/pexenvs/env.pex/bin/python"
Related
Looking for a way to pause gitlab job and resume based on AWS lambda input.
Due to restrictive permissions in my organization, below is my current CI workflow:
In diagram above, on push event lambda triggers gitlab job through webhook. the gitlab job only gets latest code, zip files and copy to certain s3 bucket. AFTER gitlab job is finished, same lambda then triggers codebuild build which does gets latest zip file from s3 bucket, creates UI chunk files and artifacts are pushed to a different s3 bucket.
### gitlab-ci.yml ###
variables:
environment: <aws-account-number>
stages:
- get-latest-code
get-latest-code:
stage: get-latest-code
script:
- zip -r project.zip $(pwd)/*
- export PATH=$PATH:/tmp/project/.local/bin
- pip install awscli
- aws s3 cp $(pwd)/project.zip s3://project-input-bucket-dev
rules:
- if: ('$CI_PIPELINE_SOURCE == "merge_request_event"' || '$CI_PIPELINE_SOURCE == "push"')
### lambda code ###
def runner_lambda_handler(event, context):
cb = boto3.client( 'codebuild' )
builds_dir = os.environ.get('BUILDS_DIR', '/tmp/project/builds')
logger.debug("STARTING GIT LAB RUNNER")
gitlab_runner_cmd = f"gitlab-runner --debug run-single -u https://git.company.com/ -t {token} " \
f"--builds-dir {builds_dir} --max-builds 1 " \
f"--wait-timeout 900 --executor shell"
s3_libraries_loader.subprocess_cmd(gitlab_runner_cmd)
cb.start_build(projectName='PROJECT-Deploy-dev')
return {
"statusCode": 200,
"body": "Gitlab build success."
}
#### Codebuild Stack ####
codebuild.Project(self, f"Project-{env_name}",
project_name=f"Project-{env_name}",
role=codebuild_role,
environment_variables={
"INPUT_S3_ARTIFACTS_BUCKET": {
"value": input_bucket.bucket_name
},
"INPUT_S3_ARTIFACTS_OBJECT": {
"value": "project.zip"
},
"OUTPUT_S3_ARTIFACTS_BUCKET": {
"value": output_bucket.bucket_name
},
"PROJECT_NAME": {
"value": f"Project-{env_name}"
}
},
cache=codebuild.Cache.bucket(input_bucket),
environment=codebuild.BuildEnvironment(
build_image=codebuild.LinuxBuildImage.STANDARD_5_0
),
vpc=vpc,
security_groups=[codebuild_sg],
artifacts=codebuild.Artifacts.s3(
bucket=output_bucket,
include_build_id=False,
package_zip=False,
encryption=False
),
build_spec=codebuild.BuildSpec.from_object({
"version": "0.2",
"cache": {
"paths": ['/root/.m2/**/*', '/root/.npm/**/*', 'build/**/*', '*/project/node_modules/**/*']
},
"phases": {
"install": {
"runtime-versions": {
"nodejs": "14.x"
},
"commands": [
"aws s3 cp --quiet s3://$INPUT_S3_ARTIFACTS_BUCKET/$INPUT_S3_ARTIFACTS_OBJECT .",
"unzip $INPUT_S3_ARTIFACTS_OBJECT",
"cd project",
"export SASS_BINARY_DIR=$(pwd)",
"npm cache verify",
"npm install",
]
},
"build": {
"commands": [
"npm run build"
]
},
"post_build": {
"commands": [
"echo Clearing s3 bucket folder",
"aws s3 rm --recursive s3://$OUTPUT_S3_ARTIFACTS_BUCKET/$PROJECT_NAME"
]
}
},
"artifacts": {
"files": [
"**/*.html",
"**/*.js",
"**/*.css",
"**/*.ico",
"**/*.woff",
"**/*.woff2",
"**/*.svg",
"**/*.png"
],
"discard-paths": "yes",
"base-directory": "$(pwd)/dist/proj"
}
})
)
What's needed:
Currently there is a disconnect between gitlab job and codebuild job. I'm looking to find a way to PAUSE gitlab job after all steps are executed. later on codebuild job successful completion I can resume the same gitlab job and mark as done
thanks in advance
you need to change gitlab's webhook update part and there by curl you can check the status of other parts.
if their status are not finished pause git push to pushing event
You'll need to implement your own logic to wait for the codebuild job you finish, you may use batch-get-builds you check the status.
Check out this, you can have something similar in you gitlab job waiting for the codebuild job to finish
You can create a manual job in your pipeline to serve as a pause. Code build can call the GitLab API to run the manual job, allowing the pipeline to continue. That would probably be the most resource-efficient way to handle this scenario.
However you won’t be able to resume the same job with this method.
There is no mechanism to ‘pause’ a job, but you can implement a polling mechanism as suggested in another answer if you really need to keep things in the same job for some reason. However, you will end up consuming more minutes (if using GitLab.com shared runners) or system resources than needed. You may also want to consider your job timeouts if the process takes a long time.
After deploying my server on Cloud Run, calling the endpoint with BloomRPC or other clients, the service returns "error": "2 UNKNOWN: No status received". Calling the server locally with PORT=7000 node index.js on localhost:7000 works fine. I guess Cloud Run is adding some TLS magic somewhere and maybe resets the headers but I have no idea how to fix that.
Here's my code:
const grpc = require("#grpc/grpc-js");
const protoLoader = require('#grpc/proto-loader');
const reportGenProto = protoLoader.loadSync("reportGen.proto");
const packageObject = grpc.loadPackageDefinition(reportGenProto);
const server = new grpc.Server();
server.addService(packageObject.ReportGeneratorService.service, {
Ping: async function (call, callback) {
console.log("PONG");
callback(null, {
pong: "pong"
});
},
});
server.bindAsync(`0.0.0.0:${process.env.PORT}`, grpc.ServerCredentials.createInsecure(), (err) => {
if (err) throw err;
console.log('Server running on port', process.env.PORT);
server.start();
});
And this is the .proto:
syntax = "proto3";
service ReportGeneratorService {
rpc Ping (Empty) returns (Pong) {}
}
message Pong {
string pong = 1;
}
message Empty {}
Cloud Run logs don't say anything other than that console.log("PONG"). I've tried enabling HTTP/2 even though I'm not using streams but the same thing is returned.
I'm building and deploying inside a Docker container, as I need some dependencies for what I'll do later with the server. This is the Dockerfile:
# we just need the LibreOffice environment to convert the file, this we'll do
FROM ideolys/carbone-env-docker
ENV DIR /app
WORKDIR ${DIR}
COPY . ${DIR}
RUN npm install
ENV TZ Europe/Rome
CMD [ "node", "index.js" ]
I'm also adding that using the deprecated gRPC node package works fine without changing anything in the configuration. The problem happens only when using the new grpc-js package.
I repro'd your solution (code unchanged) and it works for me.
{
"name": "72141720",
"version": "0.0.1",
"scripts": {
"start": "node index.js"
},
"dependencies": {
"#grpc/grpc-js": "^1.6.7",
"#grpc/proto-loader":"^0.6.12"
}
}
And:
BILLING=
PROJECT=
REGION=
SERVICE=
gcloud projects create ${PROJECT}
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
for SERVICE in "artifactregistry" "cloudbuild" "run"
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
# Deploy from source
gcloud run deploy ${SERVICE} \
--source=${PWD} \
--project=${PROJECT} \
--region=${REGION}
# Determine service endpoint
# Remove "https://"
# Add ":443"
ENDPOINT=$(\
gcloud run services describe ${SERVICE} \
--project=${PROJECT} \
--region=${REGION} \
--format="value(status.Url)") && \
ENDPOINT="${ENDPOINT#https://}:443"
# gRPCurl it
# https://github.com/fullstorydev/grpcurl
grpcurl \
-proto ./reportGen.proto \
${ENDPOINT} \
ReportGeneratorService/Ping
Yields:
{
"pong": "pong"
}
I am trying to run a script in the cfn-init command but it keeps timing out.
What am I doing wrong when running the startup-script.sh?
"WebServerInstance" : {
"Type" : "AWS::EC2::Instance",
"DependsOn" : "AttachGateway",
"Metadata" : {
"Comment" : "Install a simple application",
"AWS::CloudFormation::Init" : {
"config" : {
"files": {
"/home/ec2-user/startup_script.sh": {
"content": {
"Fn::Join": [
"",
[
"#!/bin/bash\n",
"aws s3 cp s3://server-assets/startserver.jar . --region=ap-northeast-1\n",
"aws s3 cp s3://server-assets/site-home-sprint2.jar . --region=ap-northeast-1\n",
"java -jar startserver.jar\n",
"java -jar site-home-sprint2.jar --spring.datasource.password=`< password.txt` --spring.datasource.username=`< username.txt` --spring.datasource.url=`<db_url.txt`\n"
]
]
},
"mode": "000755"
}
},
"commands": {
"start_server": {
"command": "./startup_script.sh",
"cwd": "~",
}
}
}
}
},
The file part works fine and it creates the file but it times out at running the command.
What is the correct way of executing a shell script?
You can tail the logs in /var/log/cfn-init.log and detect the issues while running the script.
The commands in Cloudformation Init are ran as sudo user by default. Maybe there can be an issue were your script is residing in /home/ec2-user/ and you are trying to run the script from '~' (i.e. /root).
Please give the absolute path (/home/ec2-user) in cwd. It will solve your concern.
However, the exact issue can be fetched from the logs only.
Usually the init scripts are executed by root unless specified otherwise. Can you try giving the full path while running your startup script. You can give cloudkast a try. It is an online cloudformation template generator. Makes easier creating objects such as aws::cloudformation::init.
I’m trying to figure out how to add a spark step properly to my aws-emr cluster from the command line aws-cli.
Some background:
I have a large dataset (thousands of .csv files) that I need to read in and analyze. I have a python script that looks something like:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("s3n://data_input/*csv")
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
df_output.save_as_csv_to_s3_somehow
I want the output csv file to go to the directory s3://dataoutput/
Do I need to add the py file to a jar or something? What command do I use to run this analysis utilizing my cluster nodes, and how do I get the output to the correct directoy? Thanks.
I launch the cluster using:
aws emr create-cluster --release-label emr-5.5.0\
--name PySpark_Analysis\
--applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Ganglia Name=Presto Name=Zeppelin\
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge\
--region us-west-2\
--log-uri s3://emr-logs-zerex/
--configurations file://./zeppelin-env-config.json/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-code/bootstraps/install_python_packages_custom.bash"
I usually use the --steps parameter of the aws emr create-cluster which can be specified like --steps file://mysteps.json. The file has the following look to it:
[
{
"Type": "Spark",
"Name": "KB Spark Program",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"Args": [
"--verbose",
"--packages",
"org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1,com.amazonaws:aws-java-sdk-s3:1.11.27,org.apache.hadoop:hadoop-aws:2.7.2,com.databricks:spark-csv_2.11:1.5.0",
"/tmp/analysis_script.py"
]
},
{
"Type": "Spark",
"Name": "KB Spark Program",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"Args": [
"--verbose",
"--packages",
"org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1,com.amazonaws:aws-java-sdk-s3:1.11.27,org.apache.hadoop:hadoop-aws:2.7.2,com.databricks:spark-csv_2.11:1.5.0",
"/tmp/analysis_script_1.py"
]
}
]
You can read more about steps here. I use the bootstrap script to load my code from S3 into /tmp and then specify the steps of execution in the file.
As for writing to s3 here is a link that explains that.
I am unable to set environment variables for my spark application. I am using AWS EMR to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.
To run the cluster in the EMR I am using following AWS CLI command
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/bootstrap.sh"' --steps file://./.envs/steps.json --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>
Now Production.json looks like this:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"FOO": "bar"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "2800m",
"spark.driver.memory": "900m"
}
}
]
And steps.json like this :
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"s3://<bucket>/code/dependencies.zip",
"s3://<bucket>/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
When I try to access the environment variable inside my __init__.py code, it simply prints empty. As you can see I am running the step using spark with yarn cluster in cluster mode. I went through these links to reach this position.
How do I set an environment variable in a YARN Spark job?
https://spark.apache.org/docs/latest/configuration.html#environment-variables
https://spark.apache.org/docs/latest/configuration.html#runtime-environment
Thanks for any help.
Use classification yarn-env to pass environment variables to the worker nodes.
Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.
(Dear moderator, if you want to delete the post, let me know why.)
To work with EMR clusters I work using the AWS Lambda, creating a project that build an EMR cluster when a flag is set in the condition.
Inside this project, we define the variables that you can set in the Lambda and then, replace this to its value. To use this, we have to use the AWS API. The possible method you have to use is the AWSSimpleSystemsManagement.getParameters.
Then, make a map like val parametersValues = parameterResult.getParameters.asScala.map(k => (k.getName, k.getValue)) to have a tuple with its name and value.
Eg: ${BUCKET} = "s3://bucket-name/
What this means, you only have to write in your JSON ${BUCKET} instead all the name of your path.
Once you have replace the value, the step JSON can have a view like this,
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"${BUCKET}/code/dependencies.zip",
"${BUCKET}/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
I hope this can help you to solve your problem.