We had spinned a google cloud composer environment, but need to use it only for testing purpose. Is there a way to pause the environment and only use it when needed?
I am unable to find a way to do it.
Please suggest if any solutions possible to pause or diable it, rather than deleting it.
Thanks!
I tried to find a way to disable/pause the environment but could not find any.
You can't do that but if you are using Cloud Composer 2, it uses a GKE with autopilot mode.
The autopilot mode is optimized if there is no DAGs executions in the cluster.
If it's used for testing purpose, I recommend you using the small environment size and a cheap and small configuration regarding worker and webserver : cpu, memory and storage..., example :
gcloud composer environments create example-environment \
--location us-central1 \
--image-version composer-2.0.31-airflow-2.2.5 \
--environment-size small \
--scheduler-count 1 \
--scheduler-cpu 0.5 \
--scheduler-memory 2.5 \
--scheduler-storage 2 \
--web-server-cpu 1 \
--web-server-memory 2.5 \
--web-server-storage 2 \
--worker-cpu 1 \
--worker-memory 2 \
--worker-storage 2 \
--min-workers 1 \
--max-workers 2
Check the documentation for the best sizing in your case.
Related
I'm working on a auto devops workflow only based on the dockerfile using Cloud Build on GCP, when I try to use the following command it seems is not using the flag: --dockerfile-image
gcloud beta builds triggers create cloud-source-repositories \
--name="test-trigger-2" \
--repo="projects/nodrize-dev/repos/b722166a-56e0-46af-bd0d-42af8d37c570/bf11672f-34d5-4d8c-80cb-31120f39251a/quirino-backend" \
--branch-pattern="^master$" \
--dockerfile="Dockerfile" \
--dockerfile-dir="" \
--dockerfile-image="gcr.io/nodrize-dev/test-backend"
Created [https://cloudbuild.googleapis.com/v1/projects/nodrize-dev/triggers/896f8ac8-397c-464a-84f7-43e69f1bc6cb].
NAME CREATE_TIME STATUS
test-trigger-2 2021-06-02T21:06:54+00:00
I want to create trigger to run it later but the last flag isnt working I asume is using the default or fallback, because as you can see in the image name is:
gcr.io/nodrize-dev/b722166a-56e0-46af-bd0d-42af8d37c570/bf11672f-34d5-4d8c-80cb-31120f39251a/quirino-backend:$COMMIT_SHA:
dockerimage-name in gcp concole:
I hope someone can help me or at least know what is happening.
This works for me.
I suspect perhaps that the trigger is incorrect or is not being triggered and|or the image is not what was generated by the trigger.
PROJECT=...
REPO=...
gcloud source repos create ${REPO} \
--project=${PROJECT}
gcloud beta builds triggers create cloud-source-repositories \
--name="trigger" \
--project=${PROJECT} \
--repo=${REPO} \
--branch-pattern="^master$" \
--dockerfile="Dockerfile" \
--dockerfile-dir="." \
--dockerfile-image="gcr.io/${PROJECT}/freddie-01"
NAME CREATE_TIME STATUS
trigger 2021-06-03T15:24:27+00:00
git push google master
gcloud builds list \
--project=${PROJECT} \
--format="value(images)"
gcr.io/${PROJECT}/freddie-01:7dcf74e126af711d24bb2b652d86f0d28bbe3bd9
gcloud container images list \
--project=${PROJECT}
NAME
gcr.io/${PROJECT}/freddie-01
I found this article about running a dataflow batch on preemptive machines.
I tried to use this feature using this script:
gcloud beta dataflow jobs run $JOB_NAME \
--gcs-location gs://.../Datastore_to_Datastore_Delete \
--flexRSGoal=COST_OPTIMIZED \
--region ...1 \
--staging-location gs://.../temp \
--network XXX \
--subnetwork regions/...1/subnetworks/... \
--max-workers 1 \
--parameters \
datastoreReadGqlQuery="$QUERY",\
datastoreReadProjectId=$PROJECTID,\
datastoreDeleteProjectId=$PROJECTID
But this is the result:
ERROR: (gcloud.beta.dataflow.jobs.run) unrecognized arguments:
--flexRSGoal=COST_OPTIMIZED
To search the help text of gcloud commands, run: gcloud help --
SEARCH_TERMS
I run the command gcloud beta dataflow jobs run help and seems like this option flexRSGoal is not there...
# gcloud version
Google Cloud SDK 319.0.0
alpha 2020.11.13
beta 2020.11.13
bq 2.0.62
core 2020.11.13
gsutil 4.55
kubectl 1.16.13
What I'm missing?
Have you followed this? It seems that the correct command should be:
--flexrs_goal=COST_OPTIMIZED
It seems the --flexrs_goal flag [1] is not intended for the gcloud beta dataflow jobs run command tool, but for java/python command tools. For example the python3 -m ... command as the ones in [2] (Complete lecture of this doc recommended).
So instead of using:
gcloud beta dataflow jobs run <job_name>
--flexRSGoal=COST_OPTIMIZE ...
Run:
python3 <my-pipeline-script.py> \
--flexrs_goal=COST_OPTIMIZED ...
If you prefer to use java just switch the --flexRSGoal flag to --flexRSGoal and follow [3] instead [2].
[1] https://cloud.google.com/dataflow/docs/guides/flexrs#python
[2] https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python#run-wordcount-on-the-dataflow-service
[3] https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven
I am new to dataproc and PySpark. I created a cluster with the below configuration:
gcloud beta dataproc clusters create $CLUSTER_NAME \
--zone $ZONE \
--region $REGION \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--num-workers 3 \
--bucket $GCS_BUCKET \
--image-version 1.4-ubuntu18 \
--optional-components=ANACONDA,JUPYTER \
--subnet=default \
--enable-component-gateway \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--properties ${PROPERTIES}
Here, are the property settings i am using currently based on what i got on the internet.
PROPERTIES="\
spark:spark.executor.cores=2,\
spark:spark.executor.memory=8g,\
spark:spark.executor.memoryOverhead=2g,\
spark:spark.driver.memory=6g,\
spark:spark.driver.maxResultSize=6g,\
spark:spark.kryoserializer.buffer=128m,\
spark:spark.kryoserializer.buffer.max=1024m,\
spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark:spark.default.parallelism=512,\
spark:spark.rdd.compress=true,\
spark:spark.network.timeout=10000000,\
spark:spark.executor.heartbeatInterval=10000000,\
spark:spark.rpc.message.maxSize=256,\
spark:spark.io.compression.codec=snappy,\
spark:spark.shuffle.service.enabled=true,\
spark:spark.sql.shuffle.partitions=256,\
spark:spark.sql.files.ignoreCorruptFiles=true,\
yarn:yarn.nodemanager.resource.cpu-vcores=8,\
yarn:yarn.scheduler.minimum-allocation-vcores=2,\
yarn:yarn.scheduler.maximum-allocation-vcores=4,\
yarn:yarn.nodemanager.vmem-check-enabled=false,\
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
"
I want to understand if this is the right property setting for my cluster and if not how do i assign the most ideal values to these properties, specially the core, memory and memoryOverhead to run my pyspark jobs in the most efficient way possible and also because i am facing
this error : Container exited with a non-zero exit code 143. Killed by external signal?
It is important here to understand the configuration and limitations of the machines you are using, and how memory is allocated to spark components.
n1-standard-4 is a 4 core machine with 15GB RAM. By default, 80% of a machine's memory is allocated to YARN Node Manager. Since you are not setting it explicitly, in this case it will be 12GB.
Spark Executor and Driver run in the containers allocated by YARN.
Total memory allocated to spark executor is a sum of spark.executor.memory and spark.executor.memoryOverhead, which in this case is 10GB. I would advise you to allocate more memory to executor than to the memoryOverhead, as the former is used for running tasks and latter is used for special purposes. By default, spark.executor.memoryOverhead is max(384MB, 0.10 * executor.memory).
In this case, you can have only one executor per machine (10GB per executor and 15GB machine capacity). Because of this configuration you are underutilizing the cores because you are using only 2 cores for each executor. It is advised to leave 1 core per machine for other OS processes, so it might help to change executor.cores to 3 here.
In general it is recommended to use default memory configurations, unless you have a very good understanding of all the properties you are modifying. Based on the performance of your application under default settings, you may tweak other properties. Also consider changing to a different machine type based on the memory requirements of your application.
References -
1. https://mapr.com/blog/resource-allocation-configuration-spark-yarn/
2. https://sujithjay.com/spark/with-yarn
This one unsolved part from another post. I am trying to submit a google cloud job that trains cnn model for mnist digit.
here's my systems. windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0.
I use gcloud command for local train. the second cell seems stuck at [*] status and showing nothing until I close and halt the ipynb file. the training started right after it and results are correct as I monitored it on Tensorboard.
I can make it run in a terminal without this issue. I also successfully submitted the job to cloud and finished successfully.
any thought of the local train problem? codes are here.
OUTDIR='trained_test'
INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True)
!gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=trainer \
-- \
--output_dir=$OUTDIR \
--input_dir=$INPDIR \
--epochs=2 \
--learning_rate=0.001 \
--batch_size=100
I am new to google cloud and was told to use Variant Transforms in order to get .vcf files into Big Query. I did everything specified on the Variant Transforms read me and copy and pasted the first block of code in to a bash file:
#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
TEMP_LOCATION=gs://BUCKET/temp
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq \
--project ${GOOGLE_CLOUD_PROJECT} \
--input_pattern ${INPUT_PATTERN} \
--output_table ${OUTPUT_TABLE} \
--temp_location ${TEMP_LOCATION} \
--job_name vcf-to-bigquery \
--runner DataflowRunner"
gcloud alpha genomics pipelines run \
--project "${GOOGLE_CLOUD_PROJECT}" \
--logging "${TEMP_LOCATION}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
--zones us-west1-b \
--service-account-scopes https://www.googleapis.com/auth/cloud-platform \
--docker-image gcr.io/gcp-variant-transforms/gcp-variant-transforms \
--command-line "${COMMAND}"
I tried to run this, while replacing the parameters appropriately and got this error:
ERROR: (gcloud.alpha.genomics.pipelines.run) INVALID_ARGUMENT: Error: validating pipeline: zones and regions cannot be specified together
I since then have tried to specify the region and zone on separate lines and have even changed the default region and zone. I have even tried example pipelines from google themselves and they still result in the same error. Am I doing something wrong or is there just something more I need to install for this to work?
You need to use the --regions flag first and in the end the --zone flag. As workaround you can set the default zone and region to your local client. Also keep in mind that the region is "us-west1" and the zone is "b"