Dataflow: SDK harness disconnected errors - google-cloud-platform

We have a pipeline to extract embeddings (feature vectors) from images stored in Cloud Storage bucket and insert into a BigQuery table.
We're consistently getting SDK harness sdk-0-1 disconnected. errors when the Dataflow job runs on N1 type VM instances.
Error message from worker:
Data channel closed, unable to send additional data to SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
SDK harness sdk-0-0 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-1
Notes
N2 machines work fine but N1 fails somewhat surprising because N1 is Google-default machine.
Jobs run slower on N1 machines and sometimes appear to fail due to these errors.
Using a larger VM (more memory, CPU and disk) didn't resolve the errors.
We also have another pipeline to extract embeddings from text and using lapse model which has the same errors on both N1 and N2 machines
Diagnostics tab: No errors found during this interval.
We're creating DF job templates (Apache Beam 2.40 Python), storing them on Cloud Storage and using API to launch new jobs.
We're batching the items before giving them to the stage where embeddings are extracted. Reducing batch size didn't matter.
Pipeline option sdk_worker_parallelism changed from 0 (default) to 1 and didn't change anything.
Auto-scaling disabled (max_worker=1) and same errors.
Reshuffle stage removed from the pipe
There are disconnect errors e.g. SDK harness sdk-0-0 disconnected.
but no data channel errors e.g. The Data channel closed, unable to send additional data to SDK sdk-0-3

This error message could be due to a wide variety of causes which cannot be easily detected unless the error message is accompanied by the other behavior. This error could be due to any number of listed errors in this documentation.
For getting more information about the error, it can be investigated in the Diagnostics table which can be seen in the below image.
The Diagnostics table shows the timeline and possible recommendations for your pipeline for the errors that occurred. You can view the job metrics to monitor your Dataflow Jobs.

Related

Dataflow - 20 streaming Windmill RPC errors for a stream

I've been using dataflow and pubsub for streaming for over a year, and today without me changing anything dataflow is not reading from pubsub anymore. At first, I was getting the below error in my logging but it stopped popping up once I updated pubsub to the latest version and apache beam sdk from 2.10.0 to 2.17.0
20 streaming Windmill RPC errors for a stream, last was: org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
I see the below link but at the end it just says GCP is working on it and does not say if the writer did anything to fix the issue. How does this get fixed and want is causing it?
Dataflow: streaming Windmill RPC errors for a stream

SWF Activity is not completing even though the computation has finished

I'm testing a new SWF workflow, and I've got some activity that makes a RESTful call out to another service. Problem is, I can see through logging that the actual call takes less than a second to complete, but the Activity always times out in SWF (START_TO_CLOSE of 5 mins). Being more specific, the RESTful call is a list call, and when I limit the batch size to a small number, the Activity completes and moves on very quickly. But at some seemingly arbitrary threshold, it chokes completely.
Does anyone have any insight into this? I've read that SWF calls have a size limitation of 1 MB, does anyone know how to find the size of data my workers are trying to pass SWF?
After some remote debugging, it turns out the response from the task is too big and the activity is failing silently. The failure occurs when the framework tries to report the response back to SWF, and the SDK calls RespondActivityTaskCompleted. That API has a length restriction on the internal result param:
Length Constraints: Maximum length of 32768.
This is a validation error that throws an uncaught exception and is swallowed internally until the Activity times out.
I wouldn't recommend using activity input and output parameters for passing large data sets. SWF is an orchestration technology, not the data passing one. The standard workarounds are:
Storing result in a separate store (S3 for example) and passing reference to it.
Caching result locally on a machine and route all following activities to the same host for them to have access to the cached result. See fileprocessing sample for the details of routing approach.
BTW. Have you checked out Cadence which is an open source version of SWF with much better client side libraries?

Sitecore 8.2 Separate Processing Server

I am using Sitecore 8.2 Update 6. Earlier my CM, Processing and Reporting roles are on single CM server. Now I just need to use separate Processing server and my Reporting and CM will be on one server.
I have configured my processing server as mentioned in the following url:
https://doc.sitecore.net/sitecore_experience_platform/82/setting_up_and_maintaining/xdb/configuring_servers/configure_a_processing_server
and configured my connection strings as per the following url:
https://doc.sitecore.net/sitecore_experience_platform/81/setting_up_and_maintaining/xdb/configuring_servers/database_connection_strings_for_configuring_servers
Now I have couple of questions:
1) IS there any change required in my CM or CD to know about my separate processing server
2) How can I test whether my processing server is doing the required tasks.
Thanks,
Nicks
Your CM and CD do not need to know about the processing server, but you need to make sure that processing functions are not enabled on the CM or CD.
You will know if processing is working by looking at the logs and seeing if the pipelines are executing and not throwing errors.
You will also see analytics data being processed and showing up in the reporting database. If you are not seeing analytics data, this is an indication you might have errors in processing.
Note that there are several possible reasons reporting data might not be working, but if it is succeeding at getting your new analytics data than processing is running.

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'

Forwarding journald to Cloudwatch Logs

I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?
Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.
Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.
Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.