Schedule BigQuery backfill programmatically - google-cloud-platform

I need to schedule the backfill in transfer service for at least a few 100 times for several data sources.
The REST API is deprecated and the python client is not helping either.
How can I automate this?

#Yun Zhang is right. I will elaborate more on this.
The deprecated method required you to setup transfers in a specific schedule. The creation of regular ones was different and used another endpoint.
Now, by using the ScheduledOptions argument, we can set times for the transfers to start. We are able to now schedule (or not) the transfers. This is why using that endpoint is the right path.
Hope this is helpful! :)

Related

Identify AWS service for fast retrival of data

I have one generic question, actually, I am hunting for a solution to a problem,
Currently, we are generating the reports directly from the oracle database, now from the performance perspective, we want to migrate data from oracle to any specific AWS service which could perform better. We will pass data from that AWS service to our reporting software.
Could you please help which service would be idle for this?
Thanks,
Vishwajeet
To answer well, additional info is needed:
How much data is needed to generate a report?
Are there any transformed/computed values needed?
What is good performance? 1 second? 30 seconds?
What is the current query time on Oracle and what kind of query? Joins, aggregations etc.

Can we use Google cloud function to convert xls file to csv

I am new to google cloud functions. My requirement is to trigger cloud function on receiving a gmail and convert the xls attachment from the email to csv.
Can we do using GCP.
Thanks in advance !
Very shortly - that is possible as far as I know.
But.
You might found that in order to automate this task in a reliable, robust and self-healing way, it may be necessary to use half a dozen cloud functions, pubsub topics, maybe a cloud storage, maybe a firestore collection, security manager, customer service account with relevant IAM permissions, and so on. Maybe more than a dozen or two dozens of different GCP resources. And, obviously, those cloud functions are to be developed (I mean the code is to be developed). All together that may be not a very easy or quick to implement.
At the same time, I personally saw (and contributed to a development of) a functional component, based on cloud functions, which together did exactly what you would like to achieve. And that was in production.

PythonOperator or SimpleHttpOperator to make HTTP get request and save results to GCP storage

I an in the early stages of learning Airflow. I am learning Airflow to build a simple ETL (ELT?) data pipeline, and am in the process of the figuring out the architecture for the pipeline (what operators I should use). The basics of my data pipeline are going to be:
Make HTTP GET request from API for raw data.
Save raw JSON results into a GCP bucket.
Transform the data and save into a BigQuery database.
...and the pipeline will be scheduled to run once daily.
As the title suggests, I am trying to determine if the SimpleHttpOperator or PythonOperator is more appropriate to use to make the HTTP GET requests for data. From this somewhat related stackoverflow post, stackoverflow post, the author simply concluded:
Though I think I'm going to simply use the PythonOperator from now on
It seems simple enough to write a 10-20 lines-of-code python script that makes the http request, identifies the GCP storage bucket, and writes to that bucket. However, I'm not sure if this is the best approach for this type of task (call api --> get data --> write to gcp storage bucket).
Any help or thoughts on this, any example links on building similar pipelines, etc. would be greatly helpful. Thanks in advance
I recommend you to see airflow as a glue between processing steps. The processing performed into Airflow should be to conditionally trigger or not a step, doing loop on steps and handle errors.
Why? Because, if tomorrow you choose to change your workflow app, you won't have to code again your process, you will only have to rewrite the workflow logic (because you changed your workflow app). A simple separation of concern.
Thereby, I recommend you to deploy your 10-20 lines of python code into a Cloud Functions and to set a SimpleHTTPOperator to call it.
In addition, it's far more easier to a function than a workflow (to run and to look at the code). The deployments and the updates will be also easier.

Schedule loading data from GCS to BigQuery periodically

I've researched it and currently come up with a strategy using Apache Airflow. I'm still not sure how to do this. The most blogs and answers I'm getting are directly codes instead of some material to better understand it. Also, please suggest if there is a good way to do it.
I also got an answer like using Background Cloud Function with a Cloud Storage trigger.
You can use BigQuery's Cloud Storage transfers, but note that it's still in BETA.
It gives you the option to schedule transfers from Cloud Storage to BigQuery with certain limitations.
The most blogs and answers I'm getting are directly codes
Apache Airflow comes with a rich UI for many tasks but that doesn't mean you are not supposed to write code in order to get your task done.
For your case, you need to use BigQuery command line operator for Apache Airflow
A good way on how to do this can be found in this link

Are there any Schedulers for AWS/DynamoDB?

We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: