Querying on Bigquery repeated fields - google-cloud-platform

Below is the schema of my BigQuery table. I am selecting the sentence_id, store and BU_model and inserting data into another table in BigQuery. The datatypes for the new table generated are integer, repeated and repeated respectively.
I want to flatten/unnest the repeated fields so that they are created as STRING fields in my second table. How could this be achieved using standard sql?
+- sentences: record (repeated)
| |- sentence_id: integer
| |- autodetected_language: string
| |- processed_language: string
| +- attributes: record
| | |- agent_rating: integer
| | |- store: string (repeated)
| +- classifications: record
| | |- BU_Model: string (repeated)
The query that I am using to create the second table is as below. I would want to query on the BU_Model as a STRING column.
SELECT sentence_id ,a.attributes.store,a.classifications.BU_Model
FROM staging_table , unnest(sentences) a
Expected Output should look like:
Staging table:
41783851 regions Apparel
district Footwear
12864656 regions
district
Final Target Table:
41783851 regions Apparel
41783851 regions Footwear
41783851 district Apparel
41783851 district Footwear
12864656 regions
12864656 district
I tried the below query and it seems to work as expected, but this means that i would have to unnest every expected repeated field. My table in Bigquery has 50+ columns which are repeated. Is there a easier way around this ?
SELECT
sentence_id,
flattened_stores,
flattened_Model
FROM `staging`
left join unnest(sentences) a
left join unnest(a.attributes.store) as flattened_stores
left join unnest(a.classifications.BU_Model) as flattened_Model

Assuming you want still three columns in your output - with arrays being flattened into string
SELECT sentence_id ,
ARRAY_TO_STRING(a.attributes.store, ',') store,
ARRAY_TO_STRING(a.classifications.BU_Model, ',') BU_Model
FROM staging_table , unnest(sentences) a
UPDATE to address recent changes in question
In BigQuery Standard SQL - use of LEFT JOIN UNNEST() (as you did in your last query) is the most reasonable way to do what you want to get as a result
In BigQuery Legacy SQL - you can use FLATTEN syntax - but it has same drawback of needing to repeat same for all 50+ column
Very simplified example:
#legacySQL
SELECT sentence_id, store, BU_Model
FROM (FLATTEN([project:dataset.stage], BU_Model))
Conclusion: I would go with LEFT JOIN UNNEST() approach

Related

How to create a dimension table in PowerBI

I have 2 data source, each has its own dimensions and fact table.
Both data sources have a dimension called "Country"
Dimension 1 for Data Source 1
Country | Country Order
Australia | 1
Singapore | 2
Indonesia | 3
..
Dimension 2 for Data Source 2
Country | Country Order
AUSTRALIA | 1
SINGAPORE | 2
INDONESIA | 3
..
Data Source 1 Dimension Table 1 has more countries compared to Data Source 2 Dimension Table 2.
I am building a dashboard and i want to use country as a common filter, meaning one filter that can filter 2 different fact table.
I have tried creating a new table that contain one column with all the distinct values and tried to create relationship to both dimension table but it keep promoting me an error - circular dependency.
Any other methods that can work for this?
Thanks!
I find it best to create my dimension tables in PQ. Create a new query named country_dimension that is an append of your two fact tables. Remove all other columns and then remove duplicates. You can then use this in your model and you won't get any circular dependency problems.

Using Dataprep to write to just a date partition in a date partitioned table

I'm using a BigQuery view to fetch yesterday's data from a BigQuery table and then trying to write into a date partitioned table using Dataprep.
My first issue was that Dataprep would not correctly pick up DATE type columns, but converting them to TIMESTAMP works (thanks Elliot).
However, when using Dataprep and setting an output BigQuery table you only have 3 options for: Append, Truncate or Drop existing table. If the table is date partitioned and you use Truncate it will remove all existing data, not just data in that partition.
Is there another way to do this that I should be using? My alternative is using Dataprep to overwrite a table and then using Cloud Composer to run some SQL pushing this data into a date partitioned table. Ideally, I'd want to do this just with Dataprep but that doesn't seem possible right now.
BigQuery table schema:
Partition details:
The data I'm ingesting is simple. In one flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-08 | Josh1 |
| 2018-08-08 | Josh2 |
+------------+--------+
In the other flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-09 | Josh1 |
| 2018-08-09 | Josh2 |
+------------|--------+
It overwrites the data in both cases.
You ca create a partitioned table bases on DATE. Data written to a partitioned table is automatically delivered to the appropriate partition.
Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
Append the data to have the new data added to the partitions.
You can create the table using the bq command:
bq mk --table --expiration [INTEGER1] --schema [SCHEMA] --time_partitioning_field date
time_partitioning_field is what defines which field you will be using for the partitions.

Effiecient for/each loop to match phrases?

I am going to use a for/each loop, to search different names (table1) among textual information of records in another table (table2) using regular expressions.
SELECT id FROM "table1"
where tags ~* 'south\s?\*?africa'
or description ~* 'south\s?\*?south'
order by id asc;
but I do not know how to put it in a for each loop!
table1:
t1ID | NAME
1 | Shiraz
2 | south africa
3 | Limmatplatz
table2:
t2ID |TAGS | DESCRIPTIONS
101 |shiraz;Zurich;river | It is too hot in Shiraz and Limmatplatz
201 |southafrica;limmatplatz| we went for swimming
I have a list of names in table1. Another table has some text information that might contain those names.
I would like to get back the id of table2 that contains items in table1 with the id of the items.
For example:
t2id | t1id
101 |1
101 |3
201 |2
201 |3
My tables have 60,000 and 550.000 rows.
I need to use a way that time wise be efficient!
You don't need a loop. A simple join works.
SELECT t2.id AS t2id, t1.id AS t1id
FROM table1 t1
JOIN table1 t2 ON t2.tags ~* replace(t1.name, ' ', '\s?\*?')
OR t2.description ~* replace(t1.name, ' ', '\s?\*?')
ORDER BY t2.id;
But performance will be terrible for big tables.
There are several things you can do to improve it:
Normalize table2.tags into a separate 1:n table.
Or an n:m relationship to a tag table if tags are used repeatedly (typical case). Details:
How to implement a many-to-many relationship in PostgreSQL?
Use trigram or textsearch indexes
PostgreSQL LIKE query performance variations
Use a LATERAL join to actually use those indexes.
LATERAL JOIN not using trigram index
Ideally, use the new capability in Postgres 9.6 to search for phrases with full text search. The release notes:
Full-text search can now search for phrases (multiple adjacent words)

Syncframework:Map single table into multiple tables

I have two tables like the fallowing:
On server:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| ServerName | Product
| Quantity
On client:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| Product
| Quantity
| ClientName
I need to sync the [Server].[Orders Table].[ServerName] to [Client].[OrderDetails Table].[ClientName]
The Question:
What is the true and efficient way of making it?
I know Deprovisioning and provisioning with different config, is one way of doing it.
So I just wanna know the correct way.
Thanks.
EDIT :
Other columns of each table should sync normally ([Server].[Orders Table].[Id] to [Client].[Orders Table].[Id] ...).
And mapping strategy sometimes changes based on the row of data (which which is sending/receiving).
Sync Fx is not an ETL tool. simply put, it's DB sync is per table.
if you really want to force it to do what you want, you can simply intercept ChangesSelected event for the OrderDetails table, lookup the extra column from the other table and then dynamically add the column to the dataset before it gets applied on the other side.
see this link on how to manipulate the change dataset

Kettle Pentaho - Getting records from two table which does not match on common Key (Merge)

I have two tables in transformation and I need to get data from two tables which does not meet on common key. i.e I am doing join on table A and B
from table A I need those records which are not present in table B.
it will be helpful if someone can tell me what step I can use in Kettle spoon to do above transformation
You can achieve this with the Merge Join step. Under Join Type choose LEFT OUTER. After this step your results will look like this:
key_a|value_a|key_b|value_b
1 | 1| null | null
2 | 2 | null | null
3 | 3| 3| 3|
Then choose the Filter rows step and set key_b as the field and the condition to IS NULL.
If you also need records where key_a does not match key_b, choose the Join Type as FULL OUTER.
If both your tables are in a database of the same type, this can easily be achieved by using the Table input step and doing the join in the query itself:
SELECT table_a.key
, table_a.value
FROM table_a
LEFT JOIN table_b
ON table_a.key = table_b.key
WHERE table_b.key IS NULL