I have a python application which is used fetch data from one database to another database. I want to schedule this process in a daily basis. How to implement this?
Related
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
Does anyone know a way to generate a record in a table without the system being used by the user?
I need to generate something similar to a notification or reminder, with various data obtained from other tables, something similar to a report
Thank you
To run periodic tasks, you will need some sort of task scheduler like celery or huey. With that in place, you can just create and save instances of whatever model you have in mind from the task scripts and the task scheduler will repeat it periodically.
I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?
Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.
Let's consider I have multiple jobs which are updating/loading the same table. As per the semaphore concept, if any 1 process is loading data to the table other processes will wait till the resource for that table gets free. I would like to know is there any semaphore concepts for loading data into BigQuery table using dataflow? if yes, then how to handle such scenario for BigQuery table load using dataflow?
I don't believe that dataflow has knowledge of the table activity, they just send the requested update as a Job to bigquery.
Bigquery receives the job and then sends it to a queue of the given table. So all the "semaphore concept" is handled internally by bigquery and the given table.
So for example, imagine that in parallel you are running three queries that update a table, two of them run via dataflow and the other via script.
The three ones go to the same queue and bigquery process one by one(one after the other completed) in the order they arrived to bigquery.
I have a Django application.
One of my models looks like this:
class MyModel(models.Model):
def house_cleaning(self):
// cleaning up data of the model instance
Every time when I update an instance of MyModel, I'd need to clean up the data N days later. So I'd like to schedule a job to call
this_instance.house_cleaning()
N days from now.
Is there any job queue that would allow me to:
Integrate well with Django - allow me to call a method of individual model instances
Only run jobs that are scheduled to run today
Ideally handle failures gracefully
Thanks
django-chronograph might be good for your use case. If you write your cleanup jobs as django commands, then you schedule them to run at some time. It runs using unix cron behind the scene.
Is there any reason why a cron job wouldn't work? Or something like django-cron that behaves the same way? It's pretty easy to write stand-alone Django scripts. If you want to trigger house cleaning on some change to you model after a certain number of days, why not create a date flag in the model which is set to N days in the future when the job needs to be scheduled? You could run a script on a daily basis which pulls all records where the date is <= today, calls the instance's house_cleaning() method and then clears the date field. If an exception is raised during the process, it's easy enough to log it or dispatch an email.