Creating An external Table With Partitions in GCP - google-cloud-platform

I am trying to creating an external table with Partition below is the reference image i am using.
Here is what i am intending to do :
I have files flowing into this folder:
I need to query the external table based on the date :
eg :
select * from where _PartitionDate ='';
My specific query is what should i fill in the GCS bucket & source Data partitioning fields.
Thank you.

According to the documentation that Guillaume provided [1], you should click on the Source data partitioning box and provide the following link there:
gs://datalake-confidential-redacted/ExternalTable_Data/
Also, the Table type should be External table.
Once that is fixed, you should be able to create the table. I have reproduced the issue on my own and it is working.
[1] -
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#hive-partitioning-options

This part of the documentation should help you. You need to check the Source data partitioning and then to fill in your prefix URI such as
gs://datalake-confidential-redacted/ExternalTable_Data/{dt:DATE}
And then, use this dt field as any field in your queries
SELECT *
FROM `externale-table`
WHERE dt = "2020-01-10"

Custom Wizard has an issue with this approch. Once we used Teraform scripts it has been successful. It mandates a need to mark HIVE partition to custom & once the date column is created it is added as column into the table. there by allowing to query.

Related

Loading multiple files from multiple paths to Big Query

I have a file structure such as:
gs://BUCKET/Name/YYYY/MM/DD/Filename.csv
Every day my cloud functions are creating another path with another file innit corresponding to the date of the day (so for today's 5th of August) we would have gs://BUCKET/Name/2022/08/05/Filename.csv
I need to find a way to query this data to Big Query automatically so that if I want to query it for 'manual inspection' I can select for example data from all 3 months in one query doing CREATE TABLE with gs://BUCKET/Name/2022/{06,07,08}/*/*.csv
How can I replicate this? I know that BigQuery does not support more than 1 wildcard, but maybe there is a way to do so.
To query data inside GCS from Big Query you can use an external table.
Problem is this will fail because you cannot have a comma (,)
as part of the URI list
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = ['gs://bucket/2022/{1,2,3}/data.csv']
)
You have to specify the 3 CSV file locations like this:
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = [
'gs://inigo-test1/2022/1/data.csv',
'gs://inigo-test1/2022/2/data.csv']
'gs://inigo-test1/2022/3/data.csv']
)
Since you're using this sporadically, probably makes more sense to create a temporal external table.
se I found a solution that works at least for my use case, without using the external table.
During the creation of table in dataset in BigQuery use create table from: GCS and then when using URI pattern I used gs://BUCKET/Name/2022/* ; As long as filename is the same in each subfolder and schema is identical, then BQ will load everything and then you can perform date operations directly in BQ (I have a column with ingestion date)

In BigQuery, why can I still query the geo_census_blockgroups table but cannot find it in the public data?

In BigQuery, I can run this query fine:
a.geo_id,
a.total_pop,
a.white_pop,
a.black_pop,
a.hispanic_pop,
a.asian_pop,
a.amerindian_pop,
a.other_race_pop,
a.two_or_more_races_pop,
b.blockgroup_geom
FROM `bigquery-public-data.census_bureau_acs.blockgroup_2010_5yr` a
join `bigquery-public-data.geo_census_blockgroups.us_blockgroups_national` b
using(geo_id)
limit 100
But when I search for the table geo_census_blockgroups, I can't find it.
If I search boundaries, I cannot see block groups or census tracts.
Is geo_census_blockgroups being phased out or why doesn't it appear more easily in public data searches?
The current Bigquery interface which you are using is in Preview
You can use the workaround of using the Legacy BQ interface to search for geo_census_blockgroups dataset in bigquery-public-data by following the below steps.
Go to BigQuery UI.
Select Disable Editor tabs
Opt out Bugs and BigQuery UI will be switched to the Legacy BigQuery Interface.
Search for the dataset in the search bar using keywords and the searched dataset will appear.
Also for reference you can check the public issue tracker, any update related to this will be provided there.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

How does AWS Athena react to schema changes in S3 files?

What happens when after creating the table in AWS Athena for files on S3, the structure of the files on S3 change?
For eg:
If the files previously had 5 columns when the table was created and later the new files started getting 1 more column:
a) at the end?
b) in between?
What happens when some columns are not available in new files?
What happens when the columns remain the same but the column order changes?
Can we alter Athena tables to adjust to these changes?
1 - Athena is not a NoSQL solution. It is not dynamic schema either. If you change the schema, all your files in a particular folder should reflect that change. Athena wont magically update to have it included.
2 - Then it'll be a problem and it'll break. You should include NULL or ,, to force it to be okay.
3 - Athena picks it up by column order. Not by name, really. If your column orders change, it'll probably break (different types).
4 - Yes. You can always easily recreate Athena tables by dropping it and creating a new one.
If you have variable length files, then you should insert them into different folders so that each folder represents one consistent schema. You can then unify this later on in Athena with a union or similar to create a condensed, simplified table that you can apply the consistent schema to.
It depends on the files format you are using and the setup (if the schema is by field order or by field name). All the details are here: https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
Take a big note that if the data is nested or in arrays, it will completely break your data, to quote from this page:
Schema updates described in this section do not work on tables with complex or nested data types, such as arrays and structs.

Unable to add duplicate entry through informatica

I am using Informatica to finally write in the oracle table after performing certain logical operations on the data.
The problem is that if a certain ID was already previously processed and is present in the target table then it is not inserted again.
Please suggest a workaround.
Hi, This is because you might have same primary key in the source which is available in the target. Look into primary key columns and try loading them.
Altering your target table
alter table target_table_name drop constraint constraint name;