I have streaming Dataflow job that sink the output data to a pub/sub topic. But randomly the job logs throws an error:
There were errors attempting to publish to topic projects/my_project/topics/my_topics. Recent publish calls completed successfully: 574, recent publish calls completed with error: 1.
there is no stack trace provided by the dataflow, and from the job metrics the error type is "unavailable". After some time, the error will stopped and the pipeline still running as usual. Does this error occurs because of internal error in the GCP service, or because of quota issue? The output request was peaked at 10 req/s.
Found the similar issue, it resolved by adding the "Pub/Sub Admin" permission to Compute Engine default service account under IAM permissions.
Related
Loading log-based metrics suddenly stopped working with this error:
An error occurred requesting data. Can not find metric metric_types: "logging.googleapis.com/user/<redacted>" resource_container_ids: <redacted> request_context: REQUEST_CONTEXT_READ
Alerts defined on this metric are still triggering properly.
Any idea what can be the cause of this error? And since it was working before, perhaps this is a GCP bug?
Earlier this month we've enabled Stackdriver Monitoring in 3 our projects on GCP.
Recently we've found that Stackdriver API metrics show around 85% of errors:
On graphs, these error codes are 429:
I've checked Quotas, everything seems fine:
Next metrics graph tells us what method causing errors:
Using the other graph "Errors by credential" I found out that API requests made by our GKE service account. We have custom service account for GKE instances, and as far as we know it has all required permissions for monitoring:
roles/logging.logWriter
roles/monitoring.metricWriter
roles/stackdriver.resourceMetadata.writer (as noted in this issue)
Also, stackdriver-metadata-agent pods in GKE cluster logs related error every minute:
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent I0203 15:03:16.911940 1 binarylog.go:265] rpc: flushed binary log to ""
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent W0203 15:03:56.495034 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent I0203 15:04:16.912272 1 binarylog.go:265] rpc: flushed binary log to ""
stackdriver-metadata-agent-cluster-level-d6556b55-2bkbc metadata-agent W0203 15:04:56.657831 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
Aside from that I haven't found any logs related to the issue yet, and I cannot figure out who does 2 requests per second to Stackdriver API receiving 429 errors.
I should add that everything above is true for all 3 projects.
Can someone suggest how can we solve the issue?
Is this still an excess of the quota? If yes, why request metrics for quotas are ok Quota exceeded errors count contains no data?
Are we missing any permissions on our GKE service account?
What else can be related?
Thanks in advance.
This is a known behavior where container and pods tend to publish updates very frequently and that hitting the rate limits. There's no performance or functionality issues with this behavior except getting noisy logs.
Its also possible to apply logs exclusion to avoid getting them posted on Stackdriver logging.
I have a Data flow pipeline running on GCP which reads the messages from pub/sub and writes to GCS bucket.My dataflow pipeline status is cancelled by some user and I want to know who that user is ?
You can view all Step Logs for a pipeline step in Stackdriver Logging by clicking the Stackdriver link on the right side of the logs pane.
Here is a summary of the different log types available for viewing from the Monitoring→Logs page:
job-message logs contain job-level messages that various components of Cloud Dataflow generate. Examples include the
autoscaling configuration, when workers start up or shut down,
progress on the job step, and job errors. Worker-level errors that
originate from crashing user code and that are present in worker logs
also propagate up to the job-message logs.
worker logs are produced by Cloud Dataflow workers. Workers do most of the pipeline work (for example, applying your ParDos to
data). Worker logs contain messages logged by your code and Cloud
Dataflow.
worker-startup logs are present on most Cloud Dataflow jobs and can capture messages related to the startup process. The startup
process includes downloading a job's jars from Cloud Storage, then
starting the workers. If there is a problem starting workers, these
logs are a good place to look.
shuffler logs contain messages from workers that consolidate the results of parallel pipeline operations.
docker and kubelet logs contain messages related to these public technologies, which are used on Cloud Dataflow workers.
As mentioned in previous comment you should filter by Pipeline ID, the actor of the task will be in the AuthenticationEmail entry.
I have a large backlog of undelivered messages for one Google Cloud Pub/Sub subscription. I would rather not have to process every message to get caught up, and I cannot delete the subscription manually because is was created using Cloud Deployments.
The gcloud seek command appears to be what I need (https://cloud.google.com/sdk/gcloud/reference/alpha/pubsub/subscriptions/seek). However, upon running this command in the Google Cloud Shell, I receive a "method not found" exception:
gcloud alpha pubsub subscriptions seek my_subscription__name --time=2016-11-11T06:20:57
ERROR: (gcloud.alpha.pubsub.subscriptions.seek) Subscription [my_subscription__name:seek] not found: Method not found.
The subscription type is "Pull".
The API for this method is white-list only at the moment -- but stay tuned. We'll find a way to clarify this in the CLI documentation or output.
I am getting below error form aws emr. I have submitted job from cli. The job status is pending.
A client error (ThrottlingException) occurred when calling the ListSteps operation: Rate exceeded
How to see the all the active jobs in emr cluster and how to kill them from cli and also from the aws console.
Regards
sanjeeb
AWS APIs are rate limited. According to the AWS docs the recommended approach to dealing with a throttling response is to implement exponential backoff in your retry logic means when you get ThrottlingException try to catch it and sleep for some time say half second and then retry ...