Creating a BigQuery dataset from a log sink in GCP

Creating a BigQuery dataset from a log sink in GCP - google-cloud-platform

When running
gcloud logging sinks list
it seems I have several sinks for my project
▶ gcloud logging sinks list
NAME DESTINATION FILTER
myapp1 bigquery.googleapis.com/projects/myproject/datasets/myapp1 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp1"
myapp2 bigquery.googleapis.com/projects/myproject/datasets/myapp2 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp2"
myapp3 bigquery.googleapis.com/projects/myproject/datasets/myapp3 resource.type="k8s_container" resource.labels.cluster_name="mygkecluster" resource.labels.container_name="myapp3"
However, when I navigate in my BigQuery console, I don't see the corresponding datasets.
Is there a way to import these sinks as datasets so that I can run queries against them?
This guide on creating BigQuery datasets does not list how to do so from a log sink (unless I am missing something)
Also any idea why the above datasets are not displayed when using the bq ls command?

Firstly, be sure to be in the good project. if not, you can import dataset from external project by clicking on the PIN button (and you need to have enough permission for this).
Secondly, the Cloud Logging sink to BigQuery doesn't create the dataset, only the tables. So, if you have created the sinks without the dataset, you sinks aren't running (or run in error). Here more details
BigQuery: Select or create the particular dataset to receive the exported logs. You also have the option to use partitioned tables.

In general, what you expect for this feature to do is right, using BigQuery as log sink is to allow you to query the logs with BQ. For the problem you're facing, I believe it is to do with using Web console vs. gcloud.
When using BigQuery as log sink, there are 2 ways to specify a dataset:
point to an existing dataset
create a new dataset
When creating a new sink via web console, there's an option to have Cloud Logging create a new dataset for you as well. However, when using gcloud logging sinks create, it does not automatically create a dataset for you, only create the log sink. It seems like it also does not validate whether the specified dataset exists.
To resolve this, you could either use web console for the task or create the datasets on your own. There's nothing special about creating a BQ dataset to be a log sink destination comparing to creating a BQ dataset for other purpose. Create a BQ dataset, then create a log sink to point to the dataset and you're good to go.
Conceptually, different products (BigQuery, Cloud Logging) on GCP runs independently, the log sink in Cloud Logging is simply an object that pairs up filter and destination, but does not own/manage the destination resource (eg. BQ dataset). It's just that in web console, it provide some extra integration to make things easier.

Related

How can I save SQL script from AWS Athena view with boto3/python

I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.

Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.

Have you tried using get_query_results in boto3? get_query_results

Can't find BigQuery Logs for GA4 events_intraday tables

I am trying to create a trigger for a Cloud Function to copy events_intraday table data as soon as new data has been exported.
So far I have been following this answer to generate a sink from Cloud Logging to Pub/Sub.
I have only been able to find logs for events_YYYMMDD tables but none for events_intraday_YYYYMMDD neither on Cloud Logging nor on BigQuery Job History (Here are my queries for events tables and events_intraday tables on Cloud Logging).
Am I looking at the wrong place? How is it possible for the table to be updated without any logs being generated?
Update: There is one(1) log generated per day when the table is created but "table update" logs are yet to be found.

Try
protoPayload.authorizationInfo.permission="bigquery.tables.create"
protoPayload.methodName="google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName : "projects/'your_project'/datasets/'your_dataset'/tables/events_intraday_"

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.

There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

How to find unused / unaccessed objects in BigQuery

Is there a way to find unused objects ( Tables, Views etc ) within datasets in BigQuery or objects less frequently accessed ( like we can run audits in Oracle to find out the same )

Just like you can run audits in Oracle, you can enable StackDriver logging on BigQuery and run audits from StackDriver.
If you'd also like to use BigQuery syntax to query StackDriver logging, you can export StackDriver logging to BigQuery.

You can take stackdriver loggings into BigQuery again and run the Audit queries against that table
Create Stackdriver Monitoring on the same using custom metrics.
Both of these incur costs.
However, BigQuery automatically lowers cost on the data stored in tables or in partitions that have not been modified in the last 90 days.

we can write a script to find the unused bq table through the logs file. This script works in a manner to find a bq tables list on which query has been done for the last (Number of days).
with obtained bq table list will compare with a final table list which contains a total list of tables.

Google Data Studio Billing Report Demo for GCP multiple projects

Basically I am trying to setup Google Cloud Billing Report Demo for multiple projects.
Example mentioned in this link
In it there are 3 steps to configure datasource for data studio
Create the Billing Export Data Source
Create the Spending Trends Data Source
Create the BigQuery Audit Data Source
Now 1st point is quite clear.
For 2nd point the query example which is provided in demo is based on a single project. In my case I wanted to have spending datasource from multiple projects.
Does doing UNION of query based on each project works in this case?
For 3rd point, I need Bigquery Audit log from all my projects. I thought setting the external single dataset sink as shown below for bigquery in all my project should be able to do the needful.
bigquery.googleapis.com/projects/myorg-project/datasets/myorg_cloud_costs
But I see that in my dataset tables are creating with a suffix _(1) as shown below
cloudaudit_googleapis_com_activity_ (1)
cloudaudit_googleapis_com_data_access_ (1)
and these tables doesn't contain any data despite running bigquery queries in all projects multiple times.In fact it shows below error on previewing.
Unable to find table: myorg-project:cloud_costs.cloudaudit_googleapis_com_activity_20190113
I think auto generated name with suffix _ (1) is causing some issue and because of that data is also not getting populated.
I believe there should be a very simple solution for it, but it just I am not able to think in correct way.
Can somebody please provide some information on how to solve 2nd and 3rd requirement for multiple projects in gcp datastudio billing report demo?

For 2nd point the query example which is provided in demo is based on
a single project. In my case I wanted to have spending datasource from
multiple projects. Does doing UNION of query based on each project
works in this case?
That project is the project you specify for the bulling audit logs in BigQuery. The logs are attached to the billing account, which can contain multiple projects underneath it. All projects in the billing account will be captured in the logs - more specifically, the column project.id.
For 3rd point, I need Bigquery Audit log from all my projects. I
thought setting the external single dataset sink as shown below for
bigquery in all my project should be able to do the needful.
You use the includeChildren property. See here. If you don't have an organisation or use folders, then you will need to create a sink per project and point it at the dataset in BigQuery where you want all the logs to go. You can script this up using the gcloud tool. It's easy.
I think auto generated name with suffix _ (1) is causing some issue and because of that data is also not getting populated.
The suffix normal. Also, it can take a few hours for your logs/sinks to start flowing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js