Dataprep: No scheduled destinations set. Create an output to set a destination - google-cloud-platform

What does this error mean?
No scheduled destinations set. Create an output to set a destination.
I am getting this error on dataprep when I attempt to create a run schedule for my jobs. They work perfectly when i simply hit run. But this error appear when I want to have them scheduled

As per the Dataprep documentation (emphasis mine):
To add a scheduled execution of the recipes in your flow:
Define the scheduled time and interval of execution at the flow level.
See Add Schedule Dialog. After the schedule has been created, you can
review, edit, or delete the schedule through the Clock icon.
Define the scheduled destinations for each recipe through its output
object. These destinations are targets for the scheduled job. See View for
Outputs below.
You'll find detailed instructions on how to set these up here.

Related

GCP BigQuery - Verify successful execution of stored procedure

I have a BigQuery routine that inserts records into a BQ Table.
I am looking to have a Eventarc trigger that triggers Cloud Run, and performs some action on successful execution of the BigQuery Routine.
From Cloud Logging, I can see two events that would seem to confirm the successful execution of the BQ Routine.
protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob"
protoPayload.metadata.tableDataChange.insertedRowsCount
However, this does not give me the Job ID.
So, I am looking at event -
protoPayload.methodName="jobservice.jobcompleted"
Would it be correct to assume that, if protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error is empty, then the stored procedure execution was successful?
Thanks!
Decided to go with protoPayload.methodName="jobservice.jobcompleted" in this case.
It gives the job id at protoPayload.requestMetadata.resourceName, status like protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state and errors if any like protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error

How to process files serially in cloud function?

I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.

Trigger event when specific 3 AWS jobs are completed

I have submitted 3 jobs in parallel in AWS Batch and I wanted to create a trigger when all these 3 jobs are completed.
Something like I should be able to specify 3 job ids and can update DB once all 3 jobs are done.
I can do this task easily by having long pooling but wanted to do something based on events.
I need your help with this.
The easiest option would be to create a fourth Batch job that specifies the other three jobs as dependencies. This job will sit in the PENDING state until the other three jobs have succeeded, and then it will run. Inside that job, you could update the DB or do whatever other actions you wanted.
One downside to this approach is that if one of the jobs fails, the pending job will automatically go into a FAILED state without running.

PubSub resource setup failing for Dataflow job when assigning timestampLabel

After modifying my job to start using timestampLabel when reading from PubSub, resource setup seems to break every time I try to start the job with the following error:
(c8bce90672926e26): Workflow failed. Causes: (5743e5d17dd7bfb7): Step setup_resource_/subscriptions/project-name/subscription-name__streaming_dataflow_internal25: Set up of resource /subscriptions/project-name/subscription-name__streaming_dataflow_internal failed
where project-name and subscription-name represent the actual values of my project and PubSub subscription I'm trying to read from. Before trying to attach timestampLabel on message entry, the job was working correctly, consuming messages from the specified PubSub subscription, which should mean that my API/network settings are OK.
I'm also noticing two warnings with the payload
Internal Issue (119d3b54af281acf): 65177287:8503
but no more information can be found in the worker logs. For the few seconds that my job is setting up I can see the timestampLabel being set in the first step of the pipeline. Unfortunately I can't find any other cases or documentation about this error.
When using the timestampLabel feature, a second subscription is created for tracking purposes. Double check the permission settings on your topic to make sure it matches the permissions required.

Sitecore Scheduled Job: Unable to run

I am very new with Sitecore, I am trying to create one task, but after creating task I configured command and task at content editor. Still I don't see run now option for my task at content editor. Need help.I want to know where the logs of scheduled jobs are written?
There are 2 places where you can define custom task.
In database
In config file
If you decide to go with 1st option
a task item must be created under /sitecore/system/tasks/schedules
item in the “core” database (default behavior).
no matter what schedule you set on that item, it may never be executed, if you do not have right DatabaseAgent looking after that task item.
DatabaseAgent periodically checks task items and if tasks must be
executed (based on the value set in the Scheduling field), it
executes actual code
By default the DatabaseAgent is called each 10
minutes
If you decide to go with 2nd option, check this article first.
In short, you need to define your task class and start method in the config files (check out the /sitecore/admin/showconfig.aspx page, to make sure config changes are applied successfully)
<scheduling>
<!--
Time between checking for scheduled tasks waiting to execute
-->
<frequency>00:00:05</frequency>
<agent type="NameSpace.TaskClass" method="MethodName" interval="00:10:00"/>
</agent>
</scheduling>
As specified in the other answers, you can use config file or database to execute your task. However, it seems that you want to run it manually.
I have a custom module on the Sitecore Marketplace which allows you to select the task you want run. Here is the link
In brief, you need to go to the Sitecore control panel, then click on administration and lastly click on the Run Agent.
It will open a window where you can select the task. I am assuming that the task you have implemented does not take the Item details on which you are when triggering the job.