I am trying to run a SQL activity in Redshift cluster through data pipeline. After SQL activity, few logs need be written to a Table in redshift [such as number of rows affected, the error message(if any)].
Requirement:
If the sql Activity is finished successfully, the mentioned table will be written with 'error' column as null,
else if the sql Activity fails on any error, that particular error message is need to be updated into the 'error' column in Redshift table.
Can we able to achieve this through pipeline? If yes, How can we achieve this?
Thanks,
Ravi.
Unfortunately you cannot do this directly with SqlActivity in DataPipeline. The work around is to write a java program (or any executable) that does what you want and schedule it via Datapipeline using ShellCommandActivity.
Related
I am trying to create a trigger for a Cloud Function to copy events_intraday table data as soon as new data has been exported.
So far I have been following this answer to generate a sink from Cloud Logging to Pub/Sub.
I have only been able to find logs for events_YYYMMDD tables but none for events_intraday_YYYYMMDD neither on Cloud Logging nor on BigQuery Job History (Here are my queries for events tables and events_intraday tables on Cloud Logging).
Am I looking at the wrong place? How is it possible for the table to be updated without any logs being generated?
Update: There is one(1) log generated per day when the table is created but "table update" logs are yet to be found.
Try
protoPayload.authorizationInfo.permission="bigquery.tables.create"
protoPayload.methodName="google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName : "projects/'your_project'/datasets/'your_dataset'/tables/events_intraday_"
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.
I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?
Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.
I'm setting up a scheduled query in the new BigQuery UI as the project owner and have enabled the data transfer API. The query itself is a very simple SELECT * FROM table query written in standard SQL. The datasets I'm using are in the same region.
No matter how I set up the schedule options (start now, schedule start time, daily, weekly, etc.) or the destination dataset/table, I always get the same error:
"Error updating scheduled query: Request contains an invalid argument."
I have no idea which argument is invalid, it gives no more detail than that.
How do I solve this problem?
By trying to schedule the query in the classic BigQuery UI, it shows a more descriptive error which illustrates the issue:
Error in creating a new transfer: BigQuery Data Transfer Service does not yet support location northamerica-northeast1.
The data must be stored in either the US or the EU at this time, it seems.
What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.