How to schedule BigQuery DataTransfer Service using bq command - amazon-web-services

I am trying to create a Data Transfer service using BigQuery. I used bq command to create the DTS,
I am able to create DTS successfully
I need to specify custom time for scheduling using the bq command
Is it possible to schedule custom time while creating the Data Transfer service. Refer sample bq command
bq mk --transfer_config \
--project_id='My project' \
--target_dataset='My Dataset' \
--display_name='test_bqdts' \
--params='{"data_path":<data_path>,
"destination_table_name_template":<destination_table_name>,
"file_format":<>,
"ignore_unknown_values":"true",
"access_key_id": "access_key_id",
"secret_access_key": "secret_access_key"
}' \
--data_source=data_source_id
NOTE: When you create an Amazon S3 transfer using the command-line tool, the transfer configuration is set up using the default value for Schedule (every 24 hours).

You can use the flag --schedule as you can see here
Option 2: Use the bq mk command.
Scheduled queries are a kind of transfer. To schedule a query, you can
use the BigQuery Data Transfer Service CLI to make a transfer
configuration.
Queries must be in StandardSQL dialect to be scheduled.
Enter the bq mk command and supply the transfer creation flag
--transfer_config. The following flags are also required:
--data_source
--target_dataset (Optional for DDL/DML queries.)
--display_name
--params
Optional flags:
--project_id is your project ID. If --project_id isn't specified, the default project is used.
--schedule is how often you want the query to run. If --schedule isn't specified, the default is 'every 24 hours' based on creation
time.
For DDL/DML queries, you can also supply the --location flag to specify a particular region for processing. If --location isn't
specified, the global Google Cloud location is used.
--service_account_name is for authenticating your scheduled query with a service account instead of your individual user account. Note:
Using service accounts with scheduled queries is in beta.
bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source
If you want to set a 24 hours schedule, for example, you should use --schedule='every 24 hours'
You can find the complete reference for the time syntax here
I hope it helps

Related

Getting details of a BigQuery job using gcloud CLI on local machine

I am trying to process the billed bytes of each bigquery job runned by all user. I was able to find the details in BigQuery UI under Project History. Also running bq --location=europe-west3 show --job=true --format=prettyjson JOB_ID on Google Cloud Shell gives the exact information that I want (BQ SQL query, billed bytes, run time for each bigquery job).
For the next step, I want to access the json that returned by above script on local machine. I have already configured gcloud cli properly, and able to find bigquery jobs using gcloud alpha bq jobs list --show-all-users --limit=10.
I select a job id and run the following script: gcloud alpha bq jobs describe JOB_ID --project=PROJECT_ID,
I get (gcloud.alpha.bq.jobs.describe) NOT_FOUND: Not found: Job PROJECT_ID:JOB_ID--toyFH. It is possibly because of creation and end times
as shown here
What am I doing wrong? Is there another way to get details of a bigquery job using gcloud cli (maybe there is a way to get billed bytes with query details using Python SDK)?
You can get job details with diff APIs or as you are doing, but first, why are you using the alpha version of the bq?
To do it in python, you can try something like this:
from google.cloud import bigquery
def get_job(
client: bigquery.Client,
location: str = "us",
job_id: str = << JOB_ID >>,
) -> None:
job = client.get_job(job_id, location=location)
print(f"{job.location}:{job.job_id}")
print(f"Type: {job.job_type}")
print(f"State: {job.state}")
print(f"Created: {job.created.isoformat()}")
There are more properties that you can get with some kind of command from the job. Also check the status of the job in the console first, to compare between them
You can find more details here: https://cloud.google.com/bigquery/docs/managing-jobs#python

BigQuery error in load operation: URI not found

I have, in the same GCP project, a BigQuery dataset and a cloud storage bucket, both within the region us-central1. The storage bucket has a single parquet file located in it. When I run the below command:
bq load \
--project_id=myProject --location=us-central1 \
--source_format=PARQUET \
myDataSet:tableName \
gs://my-storage-bucket/my_parquet.parquet
It fails with the below error:
BigQuery error in load operation: Error processing job '[job_no]': Not found: URI gs://my-storage-bucket/my_parquet.parquet
Removing the --project_id or --location tags don't affect the outcome.
Figured it out - the documentation is incorrect, I actually had to declare the source as gs://my-storage-bucket/my_parquet.parquet/part* and it loaded fine
There has been some internal issues with BigQuery on 3rd March and it has been fixed now.
I have confirmed and used the following command to upload successfully a parquet file from Cloud Storage to BigQuery Table using bq command:
bq load --project_id=PROJECT_ID \
--source_format=PARQUET \
DATASET.TABLE_NAME gs://BUCKET/FILE.parquet
Please note that according to the BigQuery Official Documentation, you have to declare the name of the table as following DATASET.TABLE_NAME ( In the post, I can see : instead of . )

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry"

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!
You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.

Automating the maintenance of Athena views

I am currently working on creating a data lake where we can compile, combine and analysis multiple data sets in S3.
I am using Athena and Quicksight as a central part of this to be able to quickly query and explore the data. To make things easier in Quicksight for end-users, I am creating many Athena views that do some basic transformation and aggregations.
I would like to be able to source control my views and create some automation around them so that we can have a code-driven approach and not rely on users manually updating views and running DDL to update the definitions.
There does not seem to be any support in Cloudformation for Athena views.
My current approach would be to just save the create or replace view as ... DDL in an .sql file in source control and then create some sort of script that runs the DDL so it could be made part of a continuous integration solution.
Anyone have any other experience with automation and CI for Athena views?
Long time since OP posted, but here goes a bash script to do just that. You can use this script on CI of your choice.
This script assumes that you have a directory with all your .sql files definition for the views. Within that directory there is a .env file to make some deploy time replacements of shell envs.
#!/bin/bash
export VIEWS_DIRECTORY="views"
export DEPLOY_ENVIRONMENT="dev"
export ENV_FILENAME="views.env"
export OUTPUT_BUCKET="your-bucket-name-$DEPLOY_ENVIRONMENT"
export OUTPUT_PREFIX="your-prefix"
export AWS_PROFILE="your-profile"
cd $VIEWS_DIRECTORY
# Create final .env file with any shell env replaced
env_file=".env"
envsubst < $ENV_FILENAME > $env_file
# Export variables in .env as shell environment variables
export $(grep -v '^#' ./$env_file | xargs)
# Loop through all SQL files replacing env variables and pushing to AWS
FILES="*.sql"
for view_file in $FILES
do
echo "Processing $view_file file..."
# Replacing env variables in query file
envsubst < $view_file > query.sql
# Running query via AWS CLI
query=$(<query.sql) \
&& query_execution_id=$(aws athena start-query-execution \
--query-string "$query" \
--result-configuration "OutputLocation=s3://${OUTPUT_BUCKET}/${OUTPUT_PREFIX}" \
--profile $AWS_PROFILE \
| jq -r '.QueryExecutionId')
# Checking for query completion successfully
echo "Query executionID: $query_execution_id"
while :
do
query_state=$(aws athena get-query-execution \
--query-execution-id $query_execution_id \
--profile $AWS_PROFILE \
| jq '.QueryExecution.Status.State')
echo "Query state: $query_state"
if [[ "$query_state" == '"SUCCEEDED"' ]]; then
echo "Query ran successfully"
break
elif [[ "$query_state" == '"FAILED"' ]]; then
echo "Query failed with ExecutionID: $query_execution_id"
exit 1
elif [ -z "$query_state" ]; then
echo "Unexpected error. Terminating routine."
exit 1
else
echo "Waiting for query to finish running..."
sleep 1
fi
done
done
I think you could use AWS Glue
When Should I Use AWS Glue?
You can use AWS Glue to build a data warehouse to organize, cleanse,
validate, and format data. You can transform and move AWS Cloud data
into your data store. You can also load data from disparate sources
into your data warehouse for regular reporting and analysis. By
storing it in a data warehouse, you integrate information from
different parts of your business and provide a common source of data
for decision making.
AWS Glue simplifies many tasks when you are building a data warehouse:
Discovers and catalogs metadata about your data stores into a central catalog.
You can process semi-structured data, such as clickstream or process logs.
Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer
the schema, format, and data types of your data. This metadata is
stored as tables in the AWS Glue Data Catalog and used in the
authoring process of your ETL jobs.
Generates ETL scripts to transform, flatten, and enrich your data from source to target.
Detects schema changes and adapts based on your preferences.
Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data
warehouse. Triggers can be used to create a dependency flow between
jobs.
Gathers runtime metrics to monitor the activities of your data warehouse.
Handles errors and retries automatically.
Scales resources, as needed, to run your jobs.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

How do I filter and extract raw log event data from Amazon Cloudwatch

Is there any way to 1) filter and 2) retrieve the raw log data out of Cloudwatch via the API or from the CLI? I need to extract a subset of log events from Cloudwatch for analysis.
I don't need to create a metric or anything like that. This is for historical research of a specific event in time.
I have gone to the log viewer in the console but I am trying to pull out specific lines to tell me a story around a certain time. The log viewer would be nigh-impossible to use for this purpose. If I had the actual log file, I would just grep and be done in about 3 seconds. But I don't.
Clarification
In the description of Cloudwatch Logs, it says, "You can view the original log data (only in the web view?) to see the source of the problem if needed. Log data can be stored and accessed (only in the web view?) for as long as you need using highly durable, low-cost storage so you don’t have to worry about filling up hard drives." --italics are mine
If this console view is the only way to get at the source data, then storing logs via Cloudwatch is not an acceptable solution for my purposes. I need to get at the actual data with sufficient flexibility to search for patterns, not click through dozens of pages lines and copy/paste. It appears a better way to get to the source data may not be available however.
For using AWSCLI (plain one as well as with cwlogs plugin) see http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/SearchDataFilterPattern.html
For pattern syntax (plain text, [space separated] as as {JSON syntax}) see: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/FilterAndPatternSyntax.html
For python command line utility awslogs see https://github.com/jorgebastida/awslogs.
AWSCLI: aws logs filter-log-events
AWSCLI is official CLI for AWS services and now it supports logs too.
To show help:
$ aws logs filter-log-events help
The filter can be based on:
log group name --log-group-name (only last one is used)
log stream name --log-stream-name (can be specified multiple times)
start time --start-time
end time --end-time (not --stop-time)
filter patter --filter-pattern
Only --log-group-name is obligatory.
Times are expressed as epoch using milliseconds (not seconds).
The call might look like this:
$ aws logs filter-log-events \
--start-time 1447167000000 \
--end-time 1447167600000 \
--log-group-name /var/log/syslog \
--filter-pattern ERROR \
--output text
It prints 6 columns of tab separated text:
1st: EVENTS (to denote, the line is a log record and not other information)
2nd: eventId
3rd: timestamp (time declared by the record as event time)
4th: logStreamName
5th: message
6th: ingestionTime
So if you have Linux command line utilities at hand and care only about log record messages for interval from 2015-11-10T14:50:00Z to 2015-11-10T15:00:00Z, you may get it as follows:
$ aws logs filter-log-events \
--start-time `date -d 2015-11-10T14:50:00Z +%s`000 \
--end-time `date -d 2015-11-10T15:00:00Z +%s`000 \
--log-group-name /var/log/syslog \
--filter-pattern ERROR \
--output text| grep "^EVENTS"|cut -f 5
AWSCLI with cwlogs plugin
The cwlogs AWSCLI plugin is simpler to use:
$ aws logs filter \
--start-time 2015-11-10T14:50:00Z \
--end-time 2015-11-10T15:00:00Z \
--log-group-name /var/log/syslog \
--filter-pattern ERROR
It expects human readable date-time and always returns text output with (space delimited) columns:
1st: logStreamName
2nd: date
3rd: time
4th till the end: message
On the other hand, it is a bit more difficult to install (few more steps to do plus current pip requires to declare the installation domain as trusted one).
$ pip install awscli-cwlogs --upgrade \
--extra-index-url=http://aws-cloudwatch.s3-website-us-east-1.amazonaws.com/ \
--trusted-host aws-cloudwatch.s3-website-us-east-1.amazonaws.com
$ aws configure set plugins.cwlogs cwlogs
(if you make typo in last command, just correct it in ~/.aws/config file)
awslogs command from jorgebastida/awslogs
This become my favourite one - easy to install, powerful, easy to use.
Installation:
$ pip install awslogs
To list available log groups:
$ awslogs groups
To list log streams
$ awslogs streams /var/log/syslog
To get the records and follow them (see new ones as they come):
$ awslogs get --watch /var/log/syslog
And you may filter the records by time range:
$ awslogs get /var/log/syslog -s 2015-11-10T15:45:00 -e 2015-11-10T15:50:00
Since version 0.2.0 you have there also the --filter-pattern option.
The output has columns:
1st: log group name
2nd: log stream name
3rd: message
Using --no-group and --no-stream you may switch the first two columns off.
Using --no-color you may get rid of color control characters in the output.
EDIT: as awslogs version 0.2.0 adds --filter-pattern, text updated.
If you are using the Python Boto3 library for extraction of AWS cloudwatch Logs. The function of get_log_events() accepts start and end time in milliseconds.
For reference: http://boto3.readthedocs.org/en/latest/reference/services/logs.html#CloudWatchLogs.Client.get_log_events
For this you can take a UTC time input and convert it into milliseconds by using the Datetime and timegm modules and you are good to go:
from calendar import timegm
from datetime import datetime, timedelta
# If no time filters are given use the last hour
now = datetime.utcnow()
start_time = start_time or now - timedelta(hours=1)
end_time = end_time or now
start_ms = timegm(start_time.utctimetuple()) * 1000
end_ms = timegm(end_time.utctimetuple()) * 1000
So, you can give inputs as stated below y using sys input as:
python flowlog_read.py '2015-11-13 00:00:00' '2015-11-14 00:00:00'
While Jan's answer is a great one and probably what the author wanted, please note that there is an additional way to get programmatic access to the logs - via subscriptions.
This is intended for always-on streaming scenarios where data is constantly fetched (usually into Kinesis stream) and then further processed.
Haven't used it myself, but here is an open-source cloudwatch to Excel exporter I came across on GitHub:
https://github.com/petezybrick/awscwxls
Generic AWS CloudWatch to Spreadsheet Exporter CloudWatch doesn't provide an Export utility - this does. awscwxls creates spreadsheets
based on generic sets of Namespace/Dimension/Metric/Statistic
specifications. As long as AWS continues to follow the
Namespace/Dimension/Metric/Statistic pattern, awscwxls should work for
existing and future Namespaces (Services). Each set of specifications
is stored in a properties file, so each properties file can be
configured for a specific set of AWS Services and resources. Take a
look at run/properties/template.properties for a complete example.
I think the best option to retrieve the data is provided as described in the API.