Getting details of a BigQuery job using gcloud CLI on local machine - google-cloud-platform

I am trying to process the billed bytes of each bigquery job runned by all user. I was able to find the details in BigQuery UI under Project History. Also running bq --location=europe-west3 show --job=true --format=prettyjson JOB_ID on Google Cloud Shell gives the exact information that I want (BQ SQL query, billed bytes, run time for each bigquery job).
For the next step, I want to access the json that returned by above script on local machine. I have already configured gcloud cli properly, and able to find bigquery jobs using gcloud alpha bq jobs list --show-all-users --limit=10.
I select a job id and run the following script: gcloud alpha bq jobs describe JOB_ID --project=PROJECT_ID,
I get (gcloud.alpha.bq.jobs.describe) NOT_FOUND: Not found: Job PROJECT_ID:JOB_ID--toyFH. It is possibly because of creation and end times
as shown here
What am I doing wrong? Is there another way to get details of a bigquery job using gcloud cli (maybe there is a way to get billed bytes with query details using Python SDK)?

You can get job details with diff APIs or as you are doing, but first, why are you using the alpha version of the bq?
To do it in python, you can try something like this:
from google.cloud import bigquery
def get_job(
client: bigquery.Client,
location: str = "us",
job_id: str = << JOB_ID >>,
) -> None:
job = client.get_job(job_id, location=location)
print(f"{job.location}:{job.job_id}")
print(f"Type: {job.job_type}")
print(f"State: {job.state}")
print(f"Created: {job.created.isoformat()}")
There are more properties that you can get with some kind of command from the job. Also check the status of the job in the console first, to compare between them
You can find more details here: https://cloud.google.com/bigquery/docs/managing-jobs#python

Related

GCP Airflow Operators: BQ LOAD and sensor for job ID

I need to schedule automatically a bq load process that gets AVRO files from a GCS bucket and load them in BigQuery, and wait for its completion in order to execute another task upon completion, specifically a task that will read from above mentioned table.
As showed here there is a nice API to run this [command][1] , example given:
bq load \
--source_format=AVRO \
mydataset.mytable \
gs://mybucket/mydata.avro
This will give me a job_id
Waiting on bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1 ... (0s) Current status
job_id that I can check with bq show --job=true bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1
And that is nice... I guess under the hood the bq load job is a DataTransfer. I found some operators related to this: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery_dts.html
Even if the documentation does not cover specifically avro load configuration, digging through the documentation, gave me what I was looking for.
My question is: Is there an easier way of getting the status of the job given a job_id similar to the bq show --job=true <job_id> command?
Is there something that might help me in not going through creating a DataTransferJob, starting it, monitoring it and then delete it (because, I don't need it to stay there since next time the parameters will change).
Maybe a custom sensor, using the python-sdk-api?
Thank you in advance.
[1]: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
I think you can use GCSToBigQueryOperator operators and tasks sequencing with Airflow to solve your issue :
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
with airflow.DAG(
'dag_name',
default_args=your_args,
schedule_interval=None) as dag:
load_avro_to_bq = GCSToBigQueryOperator(
task_id='load_avro_file_to_bq',
bucket={your_bucket},
source_objects=['folder/{SOURCE}/*.avro'],
destination_project_dataset_table='{your_project}:{your_dataset}.{your_table}',
source_format='AVRO',
compression=None,
create_disposition='CREATE_NEVER',
write_disposition='WRITE_TRUNCATE'
)
second_task = DummyOperator(task_id='Next operator', dag=dag)
load_avro_to_bq >> second_task
The first operator allows to load the Avro file from GCS to BigQuery
If this operator is in success, the second task is executed otherwise it's not executed

Error Bigquery/dataflow "Could not resolve table in Data Catalog"

I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.

Cloud DataFlow SQL from BigQuery UI cannot read Cloud Storage filesets: "Table not found: datacatalog.entry"

I'm trying to create a Data Flow job using the beta Cloud DataFlow SQL within Google Big Query UI.
My data source is a Cloud Storage Fileset (that is a set of files in Cloud Storage defined through a Data Catalog).
Following GCP documentation, I was able to define my fileset, assign it a schema and visualize it in the Resources tab of Big Query UI.
But then I cannot launch any Dataflow job in the Query Editor, because I get the following error message in the query validator: Table not found: datacatalog.entry.location.entry_group.fileset_name...
Is it an issue of some APIs not authorized?
Thanks for your help!
You may be using the wrong location in the full path. When your create a Data Catalog Fileset, check the location you provided, i.e: using the sales regions example from the docs:
gcloud data-catalog entries create us_state_salesregions \
--location=us-central1 \
--entry-group=dataflow_sql_dataset \
--type=FILESET \
--gcs-file-patterns=gs://us_state_salesregions_{my_project}/*.csv \
--schema-from-file=schema_file.json \
--description="US State Sales regions..."
When you are building your DataFlow SQL query:
SELECT tr.*, sr.sales_region
FROM pubsub.topic.`project-id`.transactions as tr
INNER JOIN
datacatalog.entry.`project-id`.`us-central1`.dataflow_sql_dataset.us_state_salesregions AS sr
ON tr.state = sr.state_code
Check the full path, it should look like the example above:
datacatalog.entry, then your location - in this example is us-central1, next your project-id, next your entry group id - in this example dataflow_sql_dataset, next your entry id - in this example us_state_salesregions
let me know if this works for you.

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know

Get the BigQuery Table creator and Google Storage Bucket Creator Details

I am trying to identify the users who created tables in BigQuery.
Is there any command line or API that would provide this information. I know that audit logs do provide this information, but I was looking for a command line which could do the job so that i could wrap this in a shell script and run them against all the tables at one time. Same for Google Storage Buckets as well. I did try
gsutil iam get gs://my-bkt and looked for "role": "roles/storage.admin" role, but I do not find the admin role with all buckets. Any help?
This is a use case for audit logs. BigQuery tables don't report metadata about the original resource creator, so scanning via tables.list or inspecting the ACLs don't really expose who created the resource, only who currently has access.
What's the use case? You could certainly export the audit logs back into BigQuery and query for table creation events going forward, but that's not exactly the same.
You can find it out using Audit Logs. You can access them both via Console/Log Explorer or using gcloud tool from the CLI.
The log filter that you're interested in is this one:
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
If you want to run it from the command line, you'd do something like this:
gcloud logging read \
'
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
'\
--limit 10
You can then post-process the output to find out who created the table. Look for principalEmail field.