Cloud Build builds in worker pools stuck in Queued - google-cloud-platform

We are using Google Cloud Build as CI/CD tool and we use private pools to be able to connect to our database using private IPs.
Since 08/27 our builds using private pools are stuck in Queued and are never executed ou fail due to timeout, they just hang there until we cancel them.
We have already tried without success:
Change the worker pool to another region (from southamerica-east1 to us-central1);
Recreate the worker pool with different configurations;
Recreate all triggers and connections.
Removing the worker pool configuration (running the build in global) executed the build.
cloudbuild.yaml:
steps:
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: Backup database
args: ['gcloud', 'sql', 'backups', 'create', '--instance=${_DATABASE_INSTANCE_NAME}']
- name: 'node:14.17.4-slim'
id: Migrate database
entrypoint: npm
dir: 'build'
args: ['...']
secretEnv: ['DATABASE_URL']
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: Migrate traffic to new version
dir: 'build'
entrypoint: bash
args: ['-c', 'gcloud app services set-traffic ${_SERVICE_NAME} --splits ${_VERSION_NAME}=1']
availableSecrets:
secretManager:
- versionName: '${_DATABASE_URL_SECRET}'
env: 'DATABASE_URL'
options:
pool:
name: 'projects/$PROJECT_ID/locations/southamerica-east1/workerPools/<project-id>'
our worker pool configuration:
$ gcloud builds worker-pools describe <worker-pool-id> --region=southamerica-east1 --project=<project-id>
createTime: '2021-08-30T19:35:57.833710523Z'
etag: W/"..."
name: <worker-pool-id>
privatePoolV1Config:
networkConfig:
egressOption: PUBLIC_EGRESS
peeredNetwork: projects/<project-id>/global/networks/default
workerConfig:
diskSizeGb: '1000'
machineType: e2-medium
state: RUNNING
uid: ...
updateTime: '2021-08-30T20:14:13.918712802Z'

It was my last week discussion with the Cloud Build PM... TL;DR: if you haven't support subscription, or a corporate account, you can't (for now)
In detail, you can check the 1. link of RJC, you will get that
If you have a closer look, you can see (with my personal account, even if I have an Organization structure) the Concurrent Builds per worker pool is set to 0. That is the reason of your infinite queue of your build job.
The most annoying part is this one. Click on a Concurrent build per worker pool line checkbox and then click on edit, to change the limit. Here what you get
Read carefully: set a limit between 0 and 0.
Therefore, if you haven't support subscription (like me) you can't use the feature with your personal account. I was able to use it with my corporate account, even if I shouldn't...
For now, I haven't a solution, only this latest message from the PM
The behaviour around quota restrictions in private pools is a recent change that we're still iterating on and appreciate the feedback to make it easier for personal accounts to try out the feature.

The build in queue state can have the following possible reasons:
Concurrency limits. Cloud Build enforces quotas on running builds for various reasons. As a default, Cloud Build has only 10 concurrent build limit, whilst as per Worker Pool, it has a 30 concurrent build limit. You can also further check in this link for the quotas limit.
Using a custom machine size. In addition to the standard machine type, Cloud Build provides four high-CPU virtual machine types to run your builds.
You are using worker pools alpha and has too few nodes available.
Additionally, if the issue still persist, you can submit a bug under Google Cloud. I see that your colleague already submitted a public issue tracker in this link. In addition, if you have a free trial or paid support plan, it would be better to use it to file an issue.

Related

Data Fusion pipelines fail without execute

I have more than 50 datafusion pipelines running concurrently in an Enterprise istance of DataFusion.
About 4 of them randomly fail at each concurrent run, showing in the logs only the operation of provision followed by the deprovision of the Dataproc cluster, as in this log:
2021-04-29 12:52:49,936 - INFO [provisioning-service-4:i.c.c.r.s.p.d.DataprocProvisioner#203] - Creating Dataproc cluster cdap-fm-smartd-cc94285f-a8e9-11eb-9891-6ea1fb306892 in project project-test, in region europe-west2, with image 1.3, with system labels {goog-datafusion-version=6_1, cdap-version=6_1_4-1598048594947, goog-datafusion-edition=enterprise}
2021-04-29 12:56:08,527 - DEBUG [provisioning-service-1:i.c.c.i.p.t.ProvisioningTask#116] - Completed PROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
2021-04-29 13:04:01,678 - DEBUG [provisioning-service-7:i.c.c.i.p.t.ProvisioningTask#116] - Completed DEPROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
When a failed pipeline is restarted it completes the execution with success.
All the pipeline are started and monitored via Composer using async start and custom wait SensorOperator.
There is no warning of quota exceeded.
Additional info:
Data Fusion 6.1.4
with Dataporc ephemeral cluster with 1 master 2 workers. Image version 1.3.89
EDIT
The appfabric log releted to each failed pipeline are:
WARN [program.status:i.c.c.i.a.r.d.DistributedProgramRuntimeService#172] - Twill RunId does not exist for the program program:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow, runId f34a6fb4-acb2-11eb-bbb2-26edc49aada0
WARN [pool-11-thread-1:i.c.c.i.a.s.RunRecordCorrectorService#141] - Fixed RunRecord for program run program_run:default.[piepleine_name].-SNAPSHOT.workflow.DataPipelineWorkflow.fdc22f56-acb2-11eb-bbcf-26edc49aada0 in STARTING state because it is actually not running
Further research connected somehow the problem to an inconsistent state in the CDAP run records, when many concurrent requests (via REST API) are made.

How to troubleshoot long pod kill time for GKE?

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:
UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK
If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:
Killing container with id docker://{container-name}:Need to kill Pod
I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.
Any suggestions on how to keep troubleshooting this?
You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.
As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.
I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.
1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME
Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

How to schedule task to call gRPC method?

I have .Net server running in Google Kubernetes Engine. It is configured to use gRPC through Google Cloud Endpoints. Now I need to schedule task to call my gRPC method once per day.
The first thing I tried was to use Google Cloud Scheduler to call http methods directly. For that I have:
Set up HTTP to gRPC transcoding on my server to call my gRPC method through http.
Created and enabled SSL certificate as described here.
Created service account in IAM & admin console with Service Account Token Creator and Service Account User permissions.
Created Cloud Scheduler job with my url and Auth header as OIDC token and created above service account.
Deployed Google Cloud Endpoints configuration with following parameters (not only them):
authentication:
providers:
- id: google_service_account
issuer: MY_SERVICE_ACCOUNT_EMAIL
jwks_uri: https://www.googleapis.com/robot/v1/metadata/x509/MY_SERVICE_ACCOUNT_EMAIL
rules:
- selector: "*"
requirements:
- provider_id: google_service_account
After that when I run scheduler job it returns result "Failed". In logs it writes ERROR with status UNKNOWN.
The second thing I tried was to use Google Cloud Scheduler to publish message in Pub Sub topic with my server as subscriber.
Unsuccesfully too because I can't verify ownership of Google Cloud Endpoints domain. I asked regarding question here: How to verify ownership of Google Cloud Endpoints service URL?
Now the question: what is the best way to schedule task that would call gRPC method assuming following environment:
.Net server running on GKE
gRPC
Automated periodical call of that task (I can call manually but it's meaningless)
So you were able to make a HTTP call manually, but not automatically by Google Cloud Scheduler, is that correct?
If so, check to see if the request reach the Cloud Endpoint Proxy in the cloud console Endpoint Logging, it may give you some hints.
Distributed scheduler
more details refer sourcedcode Distributed scheduler
This application can be run on different hosts and offers functionality to
schedule execution of arbitrary command at particular time or periodically.
There are two ways to communicate with application: gRPC and REST. Remote
interfaces are
specified in dsched.proto file
Corresponding REST API could be also found over there in form of API
annotations. We also provide generated Swagger files.
To specify task execution timing, we are using notation adopted by cron.
Scheduled tasks are stored in file and loaded automatically during startup.
Building
Install gRPC
Install gRPC gateway
To parse crontab statements and schedule task execution, we are using gopkg.in/robfig/cron.v2 library.
So it should be installed also: go get -u gopkg.in/robfig/cron.v2. Documentation could be found here
Get dsched package: go get
-u gitlab.com/andreynech/dsched
Now it is possible to run standard go build command in dscheduler and
gateway directories to generate binaries for scheduler and REST/JSON API
gateway. It might be also helpful to examine our
CI configuration file to see how we
set up building environment.
Running
All the scheduling functionality is implemented by dscheduler executable. So
it could be run on system startup or on demand. As described by dscheduler --help,
there are two command line parameters:
-i string - File name to store task list (default "/var/run/dscheduler.db")
-p string - Endpoint to listen (default ":50051")
If there is a need to offer REST/JSON API, gateway application located in
gateway directory should be run. It could reside on the same host as
dscheduler, but typically it would be other host which is accessible over
HTTP from outside and at the same way can talk to dscheduler running in
internal network. This setup was also the reason to split scheduler and
gateway in two executables. gateway is mostly generated application and
supports several command-line parameters described by running gateway --help.
Important parameter is -sched_endpoint string which is endpoint of Scheduler
service (default "localhost:50051"). It specifies the host name and port
where dscheduler is listening for requests.
Scheduling tasks (testing)
There are three ways to control scheduler server:
Using Go client implemented in cli/ directory
Using Python client implemented in py_cli directory
Using REST/JSON API gateway and curl
Go and Python clients have similar set of command line parameters.
$ ./cli --help
Usage of cli:
-a string
The command to execute at time specified by -c parameter
-c string
Statement in crontab format describes when to execute the command
-e string
Host:port to connect (default "localhost:50051")
-l List scheduled tasks
-p Purge all scheduled tasks
-r int
Remove the task with specified id from schedule
-s Schedule task. -c and -a arguments are required in this case
They are using gRPC protocol to talk to scheduler server. Here are several
example invocations:
$ ./cli -l list currently scheduled tasks
$ ./cli -s -c "#every 0h00m10s" -a "df" schedule df command for
execution every 10 seconds
$ ./cli -s -c "0 30 * * * *" -a "ls -l" schedule ls -l command to
run every 30 minutes
$ ./cli -r 3 remove task with ID 3
$ ./cli -p remove all scheduled tasks
It is also possible to use curl to invoke dscheduler functionality over
REST/JSON API gateway. Assuming that dscheduler and gateway applications
are running, here are some invocations to list, add and remove scheduling
entries from the same host (localhost):
curl 'http://localhost:8080/v1/scheduler/list' list currently scheduled tasks
curl -d '{"id":0, "cron":"#every 0h00m10s", "action":"ls"}' -X POST 'http://localhost:8080/v1/scheduler/add' schedule ls command for execution every 10 seconds
curl -d '{"id":0, "cron":"0 30 * * * *", "action":"ls -l"}' -X POST 'http://localhost:8080/v1/scheduler/add' schedule ls -l to run every 30 minutes
curl -d '{"id":2}' -X POST 'http://localhost:8080/v1/scheduler/remove' remove task with ID 2.
curl -X POST 'http://localhost:8080/v1/scheduler/removeall' remove all scheduled tasks
All changes are automatically saved in file.
Thoughts on scheduler service discovery
In large deployment scenarios (like hundreds of hosts) it might be
challenging problem to find out all IP addresses and ports where scheduler
service is started. It would be pretty easy to add support for Zeroconf
(Bonjour/Avahi) technology to simplify service discovery. As alternative, it
might be possible to implement something similar to CORBA Naming Service
where running services register themself and location of naming service is
well known. We decide to collect feedback before deciding for particular
service discovery implementation. So your input very welcome!

Unable to launch task from a spring cloud data flow stream

I registered my task app in Spring Cloud Data Flow, created a definition for it and the status shows 'unknown'. I created the stream and trying to launch the task through task-sink and I get an error:
java.lang.IllegalStateException: failed to resolve MavenResource:
How to launch a task from the task-sink? Am I missing something? Any help is appreciated. Another question I have is how do I access the payload sent via TaskLaunchRequest in my task?
S1 http | step1: transformer-rabbit | log
S2 :S1.step1 > filter --expression=payload.contains('CUSTADDRMODRQ_V15') | task-processor | task-sink
task-sink is launching the task provided by the uri in the TaskLaunchRequest. It is looking for the resource as shown in the log
OUT Using manager EnhancedLocalRepositoryManager with priority 10.0 for /home/vcap/.m2/repository
OUT Using transporter HttpTransporter with priority 5.0 for https://repo.spring.io/libs-snapshot and finally failing.
The task is deployed in our repository and as mentioned I registered and created the definition for it as well.
This one is in cf environment and I am using SCDF server 1.0.0.M4.
In the application.properties for the task-sink i am providing maven.remote.repositories.snapshots.url=**
task create fis-ifx-event-task --definition "fis-event-task"
My goal is launching the task from the stream.
Thanks for the information. I am in fact using the BUILD-SNAPSHOT as I am unable to enable taks in 1.0.0M4 version. Here is the one I am using spring-cloud-dataflow-server-cloudfoundry-1.0.0.BUILD-20160808.144306-116. I am able to register and create task definitions. The status of the task definition is showing as 'unknown' even when I am using the sample task module provided by your team. But when I initiate the flow of the stream and when task-sink tries to launch the task, it is unable to find the maven resource. When I create the task definition, does the task module gets deployed? I don't see any app in Pivotal Apps Manager. As mentioned earlier, I provided maven.remote.repositories.snapshot.url in the application.properties file for the task-sink application. Another thing I observed is when I launch the task manually from dataflow shell it gives an error CF-UnprocessableEntity(10008): The request is semantically invalid: Unknown field(s): 'staging_disk_in_mb', 'staging_memory_in_mb' and also a message saying 'Source is empty'. Presently the task is supposed to print the timestamp and is not dependent on any input.
TaskProcessor code:
#EnableBinding(Processor.class)
#EnableConfigurationProperties(TaskProcessorProperties.class)
public class TaskProcessor {
#Autowired
private TaskProcessorProperties processorProperties;
public TaskProcessor() {
}
#Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
#ELI(level = "info", eventType = ELIEventType.INBOUND)
public Object setupRequest(String message) {
Map<String, String> properties = new HashMap<String, String>();
properties.put("payload", message);
TaskLaunchRequest request = new TaskLaunchRequest(processorProperties.getUri(), null, properties, null);
return new GenericMessage<>(request);
}
}
TaskSink code:
#SpringBootApplication
#EnableTaskLauncher
#EnableBinding(Sink.class)
#EnableConfigurationProperties(TaskSinkProperties.class)
public class FisIfxEventTaskSinkApplication {
public static void main(String[] args) {
SpringApplication.run(FisIfxEventTaskSinkApplication.class, args);
}
}
I provided the stream I am using earlier in the post. Sink is receiving the TaskLaunchRequest with uri and payload as you can see here and unable to launch the task.
OUT registering [40, java.io.File] with serializer org.springframework.integration.codec.kryo.FileSerializer
2016-08-10T16:08:55.02-0600 [APP/0]
OUT Launching Task for the following resource TaskLaunchRequest{uri='maven://com.xxx:fis.ifx.event-task:jar:1.0-SNAPSHOT', commandlineArguments=[], environmentProperties={payload={"statusCode":0,"fisT
opic":"CustomerDataUpdated","payloadId":"CUSTADDRMODR``Q_V15","customerIds":[1597304]}}, deploymentProperties={}}
Before I begin, you have a number of questions here. In the future, it's better to break them up into multiple questions so that they are easier to find by other users and easier to answer. That being said:
A little context on the current state of things
In order to understand how things will work, it's important to understand the current state of things. The current releases of the software involved are:
Pivotal Cloud Foundry (PCF) - 1.7.12. This version is required for any task support.
Spring Cloud Task (SCT) - 1.0.2.RELEASE
Spring Cloud Data Flow CF (SCDF) - 1.0.0.BUILD-SNAPSHOT (current as of the date of this post).
Currently PCF 1.7.12+ has all the capabilities to run tasks. You can create v3 applications (the type of application used to launch a task), run it as a task, etc. However, the tooling around that functionality is not currently complete. There is no support for v3 applications in Apps Manager or the CLI. There is a plugin for the CLI that is more of a dev tool that can be used to help with some functions (it will show you logs, etc), but it is not fully functional and requires a specific version of the CLI to work [1]. This is one of the reasons that the task functionality within PCF is still considered experimental.
Spring Cloud Task is currently GA and supports all the functionality needed to effectively run tasks on CF. However, it's important to note that SCT doesn't handle orchestration so the actual launching of tasks on CF is the responsibility of either the user, or Spring Cloud Data Flow (the easier route).
Spring Cloud Data Flow's Cloud Foundry server implementation currently has functionality to launch tasks on PCF in the latest snapshots. We have validated this against 1.7.12 as well as the development branch of 1.8.
The task workflow within SCDF
Tasks are fundamentally different from stream applications within the context of SCDF. When you create a stream definition, you are given the option to deploy it. What this does is it actually downloads the Spring Boot über jars and deploys them to PCF as long running processes. If they go down, PCF, will relaunch them as expected, etc.
Tasks on the other hand, are not deployed. They are launched. The difference is that while you create a task definition, there is nothing deployed until you click launch. And when the task completes, the software is shut down and cleaned up. So while a stream definition may have states, it's really a one to one relationship between the definition and the deployed software. Where with a task, you can launch a task definition as many times as you want.
Your issues
Reading through your post, I see a few things that you are struggling with. Let me see if I can help:
Task Definitions within SCDF and launching them via a stream - When launching a task from a stream, the task registry within SCDF is not used. The sink expects the URL for the resource to be within the TaskLauchRequest.
Apps Manager and tasks - As mentioned above, there is no support for v3 applications in Apps Manager yet so you won't be able to see your tasks there.
Viewing the logs - In order to debug what's going wrong with launching your task on CF, you're going to want to view the logs. To do so, use the v3 CLI plugin mentioned above to view them. It's important to note that you can only tail live logs with the plugin, not view logs that have previously been rendered. Because of that, when testing, you'll want to tail the logs as soon as the app is created, before it's launched.
Error in SCDF Shell - The error you received from the SCDF shell (CF-UnprocessableEntity(10008):...) leads me to wonder if you have both the correct version of PCF (1.7.12+) and the correct version of the following other libraries:
spring-cloud-deployer-cloudfoundry - The latest snapshots
cf-java-client - 2.0.0.M10+
reactor-core - 3.0.0.RC1+
I hope this helps!
[1] https://github.com/cloudfoundry/v3-cli-plugin
Task support is not available in 1.0.0.M4 release of SCDF's CF-server. In this release, the task commands/REST-APIs should be disabled - see here. And for that reason, you wouldn't see any docs related to Tasks in the 1.0.0.M4 reference guide.
That said, the Task support is available/enabled in the BUILD-SNAPSHOT release. If you're locally building the CF-server and upon pushing it to CF, you could take advantage the task commands in the shell to create and launch task definitions.