Unable to create environments on Google Cloud Composer - google-cloud-platform

I tried to create a Google Cloud Composer environment but in the page to set it up I get the following errors:
Service Error: Failed to load GKE machine types. Please leave the field
empty to apply default values or retry later.
Service Error: Failed to load regions. Please leave the field empty to
apply default values or retry later.
Service Error: Failed to load zones. Please leave the field empty to apply
default values or retry later.
Service Error: Failed to load service accounts. Please leave the field
empty to apply default values or retry later.
The only parameters GCP lets me change are the region and the number of nodes, but still lets me create the environment. After 30 minutes the environment crashes with the following error:
CREATE operation on this environment failed 1 day ago with the following error message:
Http error status code: 400
Http error message: BAD REQUEST
Errors in: [Web server]; Error messages:
Failed to deploy the Airflow web server. This might be a temporary issue. You can retry the operation later.
If the issue persists, it might be caused by problems with permissions or network configuration. For more information, see https://cloud.google.com/composer/docs/troubleshooting-environment-creation.
An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-07-20T14:31:23.047Z7050.xd.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
Got error "Another operation failed." during CP_DEPLOYMENT_CREATING_STANDARD []
Is it a problem with permissions? If so, what permissions do I need? Thank you!

It looks like more of a temporary issue:
the first set of errors is stating you cannot load the metadata :
regions list, zones list ....
you dont have a clear
PERMISSION_DENIED error
the second error: is suggesting also:
This might be a temporary issue.

Related

Can't access Elastic Beanstalk Configuration: A problem occurred while loading your page: Configuration validation exception: Invalid option value

My application is running on Elastic beanstalk AL2 with docker. It is still up and running, this issue is not user facing and only internal to AWS.
I upgraded to AL2 about 7 months ago and there were no problems. Recently I logged into to Elastic Beanstalk console to look into upgrading the platform. When I clicked on "Configuration" I get an error and redirected back to the applications list.
The error says:
Error

A problem occurred while loading your page: Configuration validation exception: Invalid option value: 'awseb-e-XXXXXXXX-stack-AWSEBSecurityGroup-XXXXXXXX’ (Namespace: 'aws:autoscaling:launchconfiguration', OptionName: 'SecurityGroups'): The security group 'awseb-e-XXXXXXXX-stack-AWSEBSecurityGroup-XXXXXXXX' does not exist
The error before this actually referenced a launch template so when I created a version 2 of the launch template but without the errored security group (as indicated by the UI) then this error message changed to the one we see above.
I tried redeploying thinking new setting need to take effect. Didn't work. Error still present.
I tried creating a new unrelated environment from scratch, but it has the same error message when clicking on Configuration.
I tried cloning production environment but am blocked with the same error message.
I also posted this question here: https://repost.aws/questions/QU4VTSJDyCT7OtIIgPCblUPA/elastic-beanstalk-unable-to-access-environments-configuration-error-a-problem-occurred-while-loading-your-page-configuration-validation-exception
I tried the suggestions but neither worked.
Any ideas on how to fix or debug this further?
Thank you

AWS EKS Returns Error 'certificate has expired or is not yet valid'

When I deploy new deployments or edit any settings, It returns following Error
Error creating: Internal error occurred: failed calling webhook
"mpod.elbv2.k8s.aws": Post
"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s":
x509: certificate has expired or is not yet valid: current time
2022-01-28T02:05:13Z is after 2022-01-20T10:00:30Z
How can i fix it??
I think the reason is because your time and date are not right. As I can see in the log, your time is 8 days behind the current day.
Please sync your time in this server and try again.
You need to have new certificate for aws-load-balancer-webhook-service. We have issuer set up in the cluster and when we get similar error in OPA we do a rollout restart for opa.

Google Cloud Django App Deployment - Permission Issues

I'm following this tutorial, yet I get stuck at the very end when I'm trying to deploy the app on the App Engine.
I get the following error message:
Updating service [default] (this may take several minutes)...failed.
ERROR: (gcloud.app.deploy) Error Response: [13] Flex operation projects/responder-289707/regions/europe-west6/operations/a0e5f3f4-29a7-49d8-98b5-4a52b7bf04ca error [INTERNAL]: An internal error occurred while processing task /app-engine-flex/insert_flex_deployment/flex_create_resources>2020-09-21T20:32:48.366Z12808.hy.0: Deployment Manager operation responder-289707/operation-1600720369987-5afd8c109adf5-6a4ad9a9-e71b9336 errors: [code: "RESOURCE_ERROR"
location: "/deployments/aef-default-20200921t223056/resources/aef-default-20200921t223056"
message: "{\"ResourceType\":\"compute.beta.regionAutoscaler\",\"ResourceErrorCode\":\"403\",\"ResourceErrorMessage\":{\"code\":403,\"message\":\"The caller does not have permission\",\"status\":\"PERMISSION_DENIED\",\"statusMessage\":\"Forbidden\",\"requestPath\":\"https://compute.googleapis.com/compute/beta/projects/responder-289707/regions/europe-west6/autoscalers\",\"httpMethod\":\"POST\"}}"
I don't really understand why though. I'm have authenticated my gcloud, made sure my account has App Engine Admin/Deployment rights. Have everything in place.
Any hints would be much appreciated.
You apparently do not have the rights for autoscaling resources. This could be due to a free account or that you need different rights to deploy an autoscaling service (other than App Engine Admin/Deployment).
Seeing as how you're doing the tutorial you could define a static resource amount, this is safer for your wallet as wel.
app.yaml
# add this
automatic_scaling:
min_num_instances: 1
max_num_instances: 2

BigQuery unable to insert job. Workflow failed

I need to run a batch job from GCS to BigQuery via Dataflow and Beam. All my files are avro with the same schema.
I've created a dataflow java application that is successful on a smaller set of data (~1gb, about 5 files).
But when I try to run it on a bigger set of data ( >500gb, >1000 files), i receive an error message
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix 1b83679a4f5d48c5b45ff20b2b822728_6e48345728d4da6cb51353f0dc550c1b_00001_00000, reached max retries: 3, last failed load job: ...
After 3 retries it terminates with:
Workflow failed. Causes: S57....... A work item was attempted 4 times without success....
This step is the load to BigQuery.
Stack Driver says the processing is stuck in step ....for 10m00s... and
Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes.....
I looked up the 409 error code stating that I might have an existing job, dataset, or table. I've removed all the tables and re-ran the application but it still shows the same error message.
I am currently limited on 65 workers and I have them using n1-standard-4 cpus.
I believe there are other ways to move the data from gcs to bq, but i need to demonstrate dataflow.
"java.lang.RuntimeException: Failed to create job with prefix beam_load_csvtobigqueryxxxxxxxxxxxxxx, reached max retries: 3, last failed job: null.
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:198)..... "
One of the possible cause could be the privilege issue. Ensure the user account which interacts with the BigQuery has privilege "bigquery.jobs.create" in the predefined role "*BigQuery User"
Posting the comment of #DeaconDesperado as community wiki, where they experienced the same error and what they did was remove the restricted characters (eg. Unicode letters, marks, numbers, connectors, dashes or spaces) in the table name and the error is gone.
I got the same problem using "roles/bigquery.jobUser", "roles/bigquery.dataViewer", and "roles/bigquery.user". But only when granting "roles/bigquery.admin" did the issue get resolved.

Amazon SageMake throwing error Building your own algorithm container execution time?

I am trying to run my own algorithm container in amazon sagemaker,at the time of deployment time ,I am getting error like below.
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)
ValueError: Error hosting endpoint decision-trees-sample-2018-03-01-09-59-06-832: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.
then I run same line of code this time i am getting below error.
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)
ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: Cannot create already existing endpoint "arn:aws:sagemaker:us-east-1:69759707XXxXX:endpoint/decision-trees-sample-2018-03-01-09-59-06-832".
Check out this issue: https://github.com/awslabs/amazon-sagemaker-examples/issues/210
#djarpin wrote:
The ping health check message is a general error that can be caused by several different issues. Typically the error message in the CloudWatch log group named /aws/sagemaker/Endpoints/ will provide a more detailed description of why the ping health check didn't pass.
Hope that helps!