Bigquery, save clusters of clustered table to cloud storage - google-cloud-platform

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.
What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:
gs://my_bucket/{client_id}/{attribute_id}/object.avro
I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.
But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.
Is there a way to accomplish this already without requesting a new feature to be added?
Thanks!

Related

Can I trace back the version of the data my model was trained on in VertexAI?

Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.
But can I simply go to my model and get redirected to the exact version of he data it was trained on?
Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.
On the Vertex Ai creating a dataset from BigQuery there is this statement:
The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.
So there is no copy or clone of the table prepared automatically for you.
Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.
The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.
So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).
Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.
CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname
CLONE dataset.basetable;
Other params and some guide is here.
You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.

PowerBi - Connection Type (DIRECT QUERY or IMPORT DATA) Question

I am working on a PowerBi project and I need some advice/questions on the best way to approach this project. I am tasked to create a dashboard for employee metrics pulled from an onsite SQL Server database. The managers here are going to have access to the PowerBi cloud, so I will end up uploading this to the cloud. There are 10 or so metrics that need to be shown on the dashboard. We have 5000+ employees. My first thought was to create a table and dump all the metrics into a table and set the PowerBi report to import the data, but that seems excessive and a waste of space to upload all that data to the CLOUD because all of the managers don't need access to every employee. They may want to see 1 or 2 employees' metrics on the dashboard.
My second thought is to (and if this is possible) create a stored procedure that will take the employee id and output a dataset for PowerBi to create a visual for. On the dashboard, have a list of employees and when a manager selects one, PowerBi will call the stored procedure with the employee id and the dataset will be returned for PowerBi to decipher into a visual based on my measurements. I guess I would set the PowerBi report connection type as DIRECT QUERY?
Here are my questions:
Is this possible? Is it possible to what I am thinking for my second plan? Is this how DIRECT QUERY works?
If so, how does DIRECT QUERY work with the PowerBi cloud?
What is setup like? Do I just install the PowerBi Data Gateway/configure it like IMPORT DATA and PowerBi does the rest?
A couple of queries:
What is the frequency of data update ?
In case if it is a batch job, it is ideally preferable to import that data from source into powerbi model and do reporting on the imported data as
a) The performance would be quicker
b) There would be no to and for of data across on prem database and cloud
c) the source would not be impacted constantly
So is the ask to have RLS wherein the managers should see only the employees under them?
Then it is pretty easy to implement RLS in imported version rather than in case of direct query.
Also you won't be able to pass parameters to stored procedures, and you can't execute them in direct query mode. You can however, create table valued functions which give you the ability to use table variables and perform other functions that are more complex in nature in Direct Query mode
you can refer this for additional details :
https://community.powerbi.com/t5/Desktop/Can-i-call-Stored-Procedure-with-Direct-Query/m-p/267141#:~:text=%40Pallavi%20you%20won't%20be,nature%20in%20Direct%20Query%20mode.

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}
Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

Dataset shows only 5 event tables after re-linking Firebase with another Google Analytics account

Recently unlinked and re-linked a Firebase project with a different Google Analytics account.
The BigQuery integration configured to export GA data created the new dataset and data started populating into that.
The old dataset corresponding to the unlinked, "default" GA account, which contained ~2 years of data is still accessible in the BigQuery UI, however only the 5 most recent event_ tables are visible in the dataset. (5 days worth of event data)
Is it possible to extract historical data from the old, unlinked dataset?
What I could suggest, it's to do some queries for further validate the data that you have within your BigQuery dataset.
In this case, I would start by getting the dates for each table to see the amount (days) of data contained on the dataset.
SELECT event_date
FROM `firebase-public-project.analytics_153293282.events_*`
GROUP BY event_date ORDER BY event_date
EDIT
A better way to do this, and get all the tables within the dataset, is using the bq command line tool, see reference here.
bq ls firebase-public-project:analytics_153293282
You'll get something like this:
You could also do a COUNT(event_date), so you can see how many records you have per day, and compare this to the content that you have or you can see on your Firebase project.
SELECT event_date, COUNT(event_date) ...
On the case that there's data missing, you could use table decorators, to try to recover that data, see example here.
About the table's expiration date you can see this, in short, expiration time can be set by default at dataset level and it would be applied for new tables (existing tables require a manual update of their expiration time one by one), and expiration time can be set during the creation of the table. To see if there was any change on the expiration time you could look into your logs for protoPayload.methodName="tableservice.update", and see if there was set an expireTime as follows:
tableUpdateRequest: {
resource: {
expireTime: "2020-12-31T00:00:00Z"
...
}
}
Besides this, if you have a GCP support plan, you could reach them looking for further assistance on what could have happened with your tables on that dataset. Otherwise, you could open an issue tracker. Keep in mind that Firebase doesn't delete your data when unlinking a Firebase project from BigQuery, so in theory the data should be there.

Unloading & reloading data between S3 and Redshift with schema changes

I'm interested in setting up some automated jobs that will periodically export data from our Redshift instance and store it on S3, where ideally it will then be bubbled back up into Redshift via an external table running in Redshift Spectrum. One thing I'm not sure of how to best deal with is the case of certain tables I'm working with changing in schema over time.
I'm able to both UNLOAD data from Redshift to S3 without a problem, and I'm also able to set up an external table within Redshift and have that S3 data available for querying. However, I'm not sure how to best deal with cases where our tables will change columns over time. For example, in the case of certain event data we capture through Segment, traits that get added will result in a new column on the Redshift table that won't have existed in previous UNLOADs. In Redshift, the column value for data that came in before the column existed will just result in NULL values.
What are best way to deal deal with this gradual change in data structure over time? If I just update the new fields in our external table will Redshift be able to deal with the fact that these fields don't necessarily exist on the older UNLOADs, or do I need to go some other route?