GCloud Dataflow recreate BigQuery table if it gets deleted during job run - google-cloud-platform

I have set up a GCloud Dataflow pipeline which consumes messages from a Pub/Sub subscription, converts them to table rows and writes those rows to a corresponding BigQuery table.
Table destination is decided based on the contents of the Pub/Sub message and will occasionally lead to the situation that a table does not exist yet and has to be created first. For this I use create disposition CREATE_IF_NEEDED, which works great.
However, I have noticed that if I manually delete a newly created table in BigQuery while the Dataflow job is still running, Dataflow will get stuck and will not recreate the table. Instead I get an error:
Operation ongoing in step write-rows-to-bigquery/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 05m00s without outputting or completing in state finish at sun.misc.Unsafe.park(Native Method) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429) at
java.util.concurrent.FutureTask.get(FutureTask.java:191) at
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:816) at
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:881) at
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:143) at
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:115) at
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
If I go back to BigQuery and manually recreate this table, Dataflow job will continue working.
However, I am wondering if there is a way to instruct the Dataflow pipeline to recreate the table if it gets deleted during the job run?

This is not possible in the current BigqueryIO connector. From the github link of the connector present here you will observe that for StreamingWriteFn which is what your code, the table creation process is done in getOrCreateTable and this is called in finishBundle. There is a map of createdTables that is maintained and in finishBundle the table gets created if it not is already present, once it is present and stored in the hashmap it is not re-created as shown below:-
public TableReference getOrCreateTable(BigQueryOptions options, String tableSpec)
throws IOException {
TableReference tableReference = parseTableSpec(tableSpec);
if (!createdTables.contains(tableSpec)) {
synchronized (createdTables) {
// Another thread may have succeeded in creating the table in the meanwhile, so
// check again. This check isn't needed for correctness, but we add it to prevent
// every thread from attempting a create and overwhelming our BigQuery quota.
if (!createdTables.contains(tableSpec)) {
TableSchema tableSchema = JSON_FACTORY.fromString(jsonTableSchema, TableSchema.class);
Bigquery client = Transport.newBigQueryClient(options).build();
BigQueryTableInserter inserter = new BigQueryTableInserter(client);
inserter.getOrCreateTable(tableReference, WriteDisposition.WRITE_APPEND,
CreateDisposition.CREATE_IF_NEEDED, tableSchema);
createdTables.add(tableSpec);
}
}
}
return tableReference;
}
For you to meet your requirement you might have to maintain your own BigqueryIO wherein you don't perform this specific check
if (!createdTables.contains(tableSpec)) {
The more important question though is why would the table get deleted in a production system by itself? This problem should be fixed rather than trying to re-create the table from Dataflow.

Related

How to apply Path Patterns in GCP Eventarc for BigQuery service's jobCompleted method?

I am developing a solution where a cloud function calls BigQuery procedure and upon successful completion of this stored proc trigger another cloud function. For this I am using Audit Logs "jobservice.jobcompleted" method. Problem with this approach is it will trigger cloud function on every job that are completed in BigQuery irrespective of dataset and procedure.
Is there any way to add Path Pattern to the filter so that it triggers only for specific query completion and not for all?
My query starts something like: CALL storedProc() ...
Also, as I tried to create a 2nd Gen function from console, I tried Eventarc trigger. But to my surprise BigQuery Event provider doesn't have Event for jobCompleted
Now I'm wondering if it's possible to trigger based on job complete event.
Update:I changed my logic now to use google.cloud.bigquery.v2.TableService.InsertTable method to make sure after inserting a record to a table it will add AuditLog message so that I can trigger the next service. This insert statement is present as the last statement in BigQuery procedure.
After running the procedure, the insert statement is inserting the data but resource name is coming as projects/<project_name>/jobs
I was expecting something like projects/<project_name>/tables/<table_name> so that I can apply path pattern on resource name.
Do I need to use different protoPayload.method?
Try to create a Log Sink for job completed with unique principal-email sv account and use pubsub with the sink.
Get pubsub published event to run destination service.

Facing issue while updating job in AWS Glue Job

I am new to AWS Glue studio. I am trying to create a job involving multiple joins and custom code. Trying to read data from Glue catalog and writing the data into S3 bucket. It was working fine untill recently. I only increased more number of withColumn operations in custom transform block. Now when i try to save the job i am getting error as follows:
Failed to update job
[gluestudio-service.us-east-2.amazonaws.com] updateDag: InternalFailure: null
I tried cloning the job and doing changes on it. I also tried creating a new job from scratch.

Dataflow Pipeline - “Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish…”

Since I'm not allowed to ask my question in the same thread where another person have the same problem (but not using a template) I'm creating this new thread.
The problem: Im creating a dataflow job from a template in gcp to ingest data from pub/sub into BQ. This works fine until the job executes. The job gets "stuck" and does not write anything to BQ.
I cant do so much because I cant choose the beam version in the template. This is the error:
Processing stuck in step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite for at least 01h00m00s without outputting or completing in state finish
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:803)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:867)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:140)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:112)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)
Any ideas how to get this to work?
The issue is coming from the step WriteSuccessfulRecords/StreamingInserts/StreamingWriteTables/StreamingWrite which suggest a problem while writing the data.
Your error can be replicated by (using either Pub/Sub Subscription to BigQuery or Pub/Sub Topic to BigQuery):
Configuring a template with a table that doesn't exist.
Starting the
template with a correct table and delete it during the job execution.
In both cases the stuckness message happens because the data is being read from Pubsub but it is waiting for the table availability to insert the data. The error is being reported each 5 minutes and it gets resolved when the table is created.
To verify the table configured in your template, see the property outputTableSpec in the PipelineOptions in the Dataflow UI.
I had the same issue before. The problem was that I used NestedValueProviders to evaluate the Pub/Sub topic/subscription and this is not supported in case of templated pipelines.
I was getting the same error and reason was that I created an empty BigQuery table without specifying an schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.

How can I setup a delete Trigger on AWS RDS

I have a setup on AWS RDS with MariaDB 10.3. I have several DBs on the RDS instance. I'm trying to replicate a table (routes) from one DB (att) to another DB (pro) using triggers. I have triggers for create, update and delete. The create and update triggers works fine while the delete trigger gives the error message below. I've tested all triggers locally and they work.
My trigger looks like this.
CREATE DEFINER=`root`#`%` TRIGGER routes_delete AFTER DELETE ON
`routes` FOR EACH ROW
BEGIN
DELETE FROM `pro`.`routes`
WHERE `route_id` = OLD.route_id;
END
Error message
Query execution failed
Reason:
SQL Error [1442] [HY000]: (conn:349208) Can't update table 'routes' in
stored function/trigger because it is already used by statement which
invoked this stored function/trigger
Query is : DELETE FROM `att`.routes WHERE route_code = 78 AND company_id = 3
I don't understand what other statement is using the routes table since there is nothing else linked to it. What adjustment is needed to get this work on AWS RDS?
what other statement is using the routes table
The "other" query is the query that invoked the trigger.
...is already used by [the] statement which invoked this stored function/trigger
A trigger is not allowed to modify its own table. BEFORE INSERT and BEFORE UPDATE triggers can modify the current row before it is written to the table using the NEW alias, but that is the extent to which a trigger can modify the table where it is defined.
Triggers are subject to all the same limitations as stored functions, and a stored function...
Cannot make changes to a table that is already in use (reading or writing) by the statement invoking the stored function.
https://mariadb.com/kb/en/library/stored-function-limitations/

Importing data from Excel sheet to DynamoDB table

I am having a problem importing data from Excel sheet to a Amazon DynamoDB table. I have the Excel sheet in an Amazon S3 bucket and I want to import data from this sheet to a table in DynamoDB.
Currently I am following Import and Export DynamoDB Data Using AWS Data Pipeline but my pipeline is not working normally.
It gives me WAITING_FOR_RUNNER status and after sometime the status changed to CANCELED. Please suggest what I am doing wrong or is there any other way to import data from an Excel sheet to a DynamoDB table?
The potential reasons are as follows:-
Reason 1:
If your pipeline is in the SCHEDULED state and one or more tasks
appear stuck in the WAITING_FOR_RUNNER state, ensure that you set a
valid value for either the runsOn or workerGroup fields for those
tasks. If both values are empty or missing, the task cannot start
because there is no association between the task and a worker to
perform the tasks. In this situation, you've defined work but haven't
defined what computer will do that work. If applicable, verify that
the workerGroup value assigned to the pipeline component is exactly
the same name and case as the workerGroup value that you configured
for Task Runner.
Reason 2:-
Another potential cause of this problem is that the endpoint and
access key provided to Task Runner is not the same as the AWS Data
Pipeline console or the computer where the AWS Data Pipeline CLI tools
are installed. You might have created new pipelines with no visible
errors, but Task Runner polls the wrong location due to the difference
in credentials, or polls the correct location with insufficient
permissions to identify and run the work specified by the pipeline
definition.