Cloud Data Fusion - Errors with quickstart - google-cloud-platform

I'm working on testing Cloud Data Fusion in GCP by executing their quickstart tutorial. The tutorial I am following is here
I configured my environment to have all the appropriate permissions and get to the point where my Dataproc cluster is up and running and the job starts.
After a few minutes, the job fails with the following error:
java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out)
And:
io.grpc.netty.shaded.io.netty.channel.ChannelException: eventfd_write(...) failed: Bad file descriptor
For the second error, I manually changed the 'input' format to be JSON instead of text (like it comes when you import the pipeline from the HUB), but still no luck. The first error I'm not exactly sure whats going wrong.
I have already review the Creating a Cloud Data Fusion instance documentation, but still receive errors.
Any suggestions?

Related

PERMISSION_DENIED for BigQuery Storage API on Apache Beam 2.39.0 and DataFlow runner

I have the following error for one of my DataFlow Jobs:
2022-06-15T16:12:27.365182607Z Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: BigQuery Storage API has not been used in project 770406736630 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/bigquerystorage.googleapis.com/overview?project=770406736630 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
The same code works fine with Apache Beam 2.38.0. I tested multiple times and this is not a temporary issues. The project number mentioned in the error (770406736630) is not mine.
Any idea why I get this error?
I had the same issue. I'm using Spring Cloud GCP and hadn't set the spring.cloud.gcp.project-id property, which I'm guessing makes the SDK or API use some default value.
I don't know how you've set up you environment, because you haven't specified, but look into how you can explicitly set the project id. You can get it from the dialog for selecting a project in GCP Console.
I just ran into this, and simply needed to re-authenticate with the gcp cli by running gcloud auth application-default login.
The error happens for the latest Apache Beam SKD (2.41.0) when BigQueryIO.Write.Method.STORAGE_WRITE_API is used and destination does not specify the project name. For example dataset.table instead of project-id:dataset.table
This is the solution that worked for me:
BigQueryIO.writeTableRows()
.to("project-id:dataset.table")
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
For some reason the Apache Beam implementation for BigQuery Write Storage API does not handle this situation even though it works fine for FILE_LOADS method.
You may also receive a sightly different error for the latest Beam SDK.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException:
java.lang.RuntimeException:
java.lang.RuntimeException: com.google.api.gax.rpc.PermissionDeniedException:
io.grpc.StatusRuntimeException:
PERMISSION_DENIED: Permission denied: Consumer 'project:null' has been suspended.

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

Unable to create environments on Google Cloud Composer

I tried to create a Google Cloud Composer environment but in the page to set it up I get the following errors:
Service Error: Failed to load GKE machine types. Please leave the field
empty to apply default values or retry later.
Service Error: Failed to load regions. Please leave the field empty to
apply default values or retry later.
Service Error: Failed to load zones. Please leave the field empty to apply
default values or retry later.
Service Error: Failed to load service accounts. Please leave the field
empty to apply default values or retry later.
The only parameters GCP lets me change are the region and the number of nodes, but still lets me create the environment. After 30 minutes the environment crashes with the following error:
CREATE operation on this environment failed 1 day ago with the following error message:
Http error status code: 400
Http error message: BAD REQUEST
Errors in: [Web server]; Error messages:
Failed to deploy the Airflow web server. This might be a temporary issue. You can retry the operation later.
If the issue persists, it might be caused by problems with permissions or network configuration. For more information, see https://cloud.google.com/composer/docs/troubleshooting-environment-creation.
An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-07-20T14:31:23.047Z7050.xd.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
Got error "Another operation failed." during CP_DEPLOYMENT_CREATING_STANDARD []
Is it a problem with permissions? If so, what permissions do I need? Thank you!
It looks like more of a temporary issue:
the first set of errors is stating you cannot load the metadata :
regions list, zones list ....
you dont have a clear
PERMISSION_DENIED error
the second error: is suggesting also:
This might be a temporary issue.

Apache Beam StatusRuntimeException on Dataflow pipeline

I am working on a dataflow pipeline written in python2.7 using apache_beam==2.24.0 . The work of the pipeline is to consume pubsub messages from a subscription using beam's ReadFromPubSub in batches, do some processing on the messages and then to persist the resultant data to two different bigquery tables. There is a lot of data that I am consuming. Google-cloud-pubsub version is 1.7.0 . After running the pipeline everything works fine but after a few hours I start getting the exception:
org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
On gcp dataflow console, the logs show this error but the job in itself seems to work fine. It consumes data from the subscription and writes it to bigquery. What CANCELLED: call is being referred to here and why am I getting this error? How can I resolve this?
Full stacktrace:
Caused by: org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Status.asRuntimeException(Status.java:524)
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341)
org.apache.beam.sdk.fn.stream.DirectStreamObserver.onNext(DirectStreamObserver.java:98)
org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.flush(BeamFnDataSizeBasedBufferingOutboundObserver.java:100)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.shouldWait(RemoteGrpcPortWriteOperation.java:124)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.maybeWait(RemoteGrpcPortWriteOperation.java:167)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.process(RemoteGrpcPortWriteOperation.java:196)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1365)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1085)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
The client I am working for has option for raising request ticket for Google Cloud Support. The exact reply from Google Cloud Support:
This error you are finding is rather harmless. The dataflow is a massively parallel data processing platform and when there are autoscaling events which can move the worker VM around. When the VM is getting shut down the grpc channel is closed before the runner process and the work item being processed will be retried on another newly launched runner. These errors can be ignored.

AWS Glue job runs correct but returns a connection refused error

I am running a test job on AWS. I am reading CSV data from S3 bucket, running a GLUE ETL job on it and storing the same data on Amazon Redshift. GLUE job is just reading the data from S3 and storing in Redshift without any modification. The job runs fine and I get the desired result in Redshift but it returns an error which I am unable to understand.
Here is the error log:
18/11/14 09:17:31 WARN YarnClient: The GET request failed for the URL http://169.254.76.1:8088/ws/v1/cluster/apps/application_1542186720539_0001
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 169.254.76.1:8088 [/169.254.76.1] failed: Connection refused (Connection refused)
It is a WARN rather than error but I want to understand what is causing the WARN. I tried to search for the IP that is indicated in the WARN but I am not able to find the machine with the mentioned IP.
I noticed these error comming up to me in my AWS Glue Job so I found something that could be helpful from AWS:
This WARN message is not so special, and does not mean job failure or any errors directly. I guess there should be other cause.
I would recommend you to enable continuous logging, and check both driver/executor logs to see if there are any suspicious behavior.
If you enable job bookmark, please try disabling it and see how it goes without bookmark.
https://forums.aws.amazon.com/thread.jspa?messageID=927547
I had dissabled bookmarks from the begining. What I check is that my Glue job writing data to S3 and got an exeption per Memory, so what I did is to repartition the data.
MyDynamicFrame.coalesce(100).write.partitionBy("month").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out_data")
so if you have some write opperations, I'll recommend to check how you are writing to S3