Kettle Pentaho - Getting records from two table which does not match on common Key (Merge) - kettle

I have two tables in transformation and I need to get data from two tables which does not meet on common key. i.e I am doing join on table A and B
from table A I need those records which are not present in table B.
it will be helpful if someone can tell me what step I can use in Kettle spoon to do above transformation

You can achieve this with the Merge Join step. Under Join Type choose LEFT OUTER. After this step your results will look like this:
key_a|value_a|key_b|value_b
1 | 1| null | null
2 | 2 | null | null
3 | 3| 3| 3|
Then choose the Filter rows step and set key_b as the field and the condition to IS NULL.
If you also need records where key_a does not match key_b, choose the Join Type as FULL OUTER.
If both your tables are in a database of the same type, this can easily be achieved by using the Table input step and doing the join in the query itself:
SELECT table_a.key
, table_a.value
FROM table_a
LEFT JOIN table_b
ON table_a.key = table_b.key
WHERE table_b.key IS NULL

Related

SQL Scalar Function values into power bi table

I have two SQL functions in Power BI
Each returns one number each (say fn_1 =3 fn_2 = 4). What I am trying to do is create a table that has 2 column 1 row with fn_1 as a colum and fn_2 as the 2nd column
|---------------------|------------------|
| fn_1 | fn_2 |
|---------------------|------------------|
| 3 | 4 |
|---------------------|------------------|
Tried duplicate, combine, aggregate, merge. They're all returning something different.
Thanks
EDIT: The only reason wanting to do that is so I can combine them into 1 row as 3/4 (which I can do fine with regular table) have both values (3 and 4 as 3/4) display on top of a report as 3/4 using a card.
For two values, I'd try some basic table constructions:
#table({"fn_1", "fn_2"}, {{fn_1, fn_2}})
or
Table.FromRecords({[fn_1 = fn_1, fn2 = fn_2]})
or
Table.FromColumns({{fn_1}, {fn_2}}, {"fn_1", "fn_2"})

Querying on Bigquery repeated fields

Below is the schema of my BigQuery table. I am selecting the sentence_id, store and BU_model and inserting data into another table in BigQuery. The datatypes for the new table generated are integer, repeated and repeated respectively.
I want to flatten/unnest the repeated fields so that they are created as STRING fields in my second table. How could this be achieved using standard sql?
+- sentences: record (repeated)
| |- sentence_id: integer
| |- autodetected_language: string
| |- processed_language: string
| +- attributes: record
| | |- agent_rating: integer
| | |- store: string (repeated)
| +- classifications: record
| | |- BU_Model: string (repeated)
The query that I am using to create the second table is as below. I would want to query on the BU_Model as a STRING column.
SELECT sentence_id ,a.attributes.store,a.classifications.BU_Model
FROM staging_table , unnest(sentences) a
Expected Output should look like:
Staging table:
41783851 regions Apparel
district Footwear
12864656 regions
district
Final Target Table:
41783851 regions Apparel
41783851 regions Footwear
41783851 district Apparel
41783851 district Footwear
12864656 regions
12864656 district
I tried the below query and it seems to work as expected, but this means that i would have to unnest every expected repeated field. My table in Bigquery has 50+ columns which are repeated. Is there a easier way around this ?
SELECT
sentence_id,
flattened_stores,
flattened_Model
FROM `staging`
left join unnest(sentences) a
left join unnest(a.attributes.store) as flattened_stores
left join unnest(a.classifications.BU_Model) as flattened_Model
Assuming you want still three columns in your output - with arrays being flattened into string
SELECT sentence_id ,
ARRAY_TO_STRING(a.attributes.store, ',') store,
ARRAY_TO_STRING(a.classifications.BU_Model, ',') BU_Model
FROM staging_table , unnest(sentences) a
UPDATE to address recent changes in question
In BigQuery Standard SQL - use of LEFT JOIN UNNEST() (as you did in your last query) is the most reasonable way to do what you want to get as a result
In BigQuery Legacy SQL - you can use FLATTEN syntax - but it has same drawback of needing to repeat same for all 50+ column
Very simplified example:
#legacySQL
SELECT sentence_id, store, BU_Model
FROM (FLATTEN([project:dataset.stage], BU_Model))
Conclusion: I would go with LEFT JOIN UNNEST() approach

Effiecient for/each loop to match phrases?

I am going to use a for/each loop, to search different names (table1) among textual information of records in another table (table2) using regular expressions.
SELECT id FROM "table1"
where tags ~* 'south\s?\*?africa'
or description ~* 'south\s?\*?south'
order by id asc;
but I do not know how to put it in a for each loop!
table1:
t1ID | NAME
1 | Shiraz
2 | south africa
3 | Limmatplatz
table2:
t2ID |TAGS | DESCRIPTIONS
101 |shiraz;Zurich;river | It is too hot in Shiraz and Limmatplatz
201 |southafrica;limmatplatz| we went for swimming
I have a list of names in table1. Another table has some text information that might contain those names.
I would like to get back the id of table2 that contains items in table1 with the id of the items.
For example:
t2id | t1id
101 |1
101 |3
201 |2
201 |3
My tables have 60,000 and 550.000 rows.
I need to use a way that time wise be efficient!
You don't need a loop. A simple join works.
SELECT t2.id AS t2id, t1.id AS t1id
FROM table1 t1
JOIN table1 t2 ON t2.tags ~* replace(t1.name, ' ', '\s?\*?')
OR t2.description ~* replace(t1.name, ' ', '\s?\*?')
ORDER BY t2.id;
But performance will be terrible for big tables.
There are several things you can do to improve it:
Normalize table2.tags into a separate 1:n table.
Or an n:m relationship to a tag table if tags are used repeatedly (typical case). Details:
How to implement a many-to-many relationship in PostgreSQL?
Use trigram or textsearch indexes
PostgreSQL LIKE query performance variations
Use a LATERAL join to actually use those indexes.
LATERAL JOIN not using trigram index
Ideally, use the new capability in Postgres 9.6 to search for phrases with full text search. The release notes:
Full-text search can now search for phrases (multiple adjacent words)

Syncframework:Map single table into multiple tables

I have two tables like the fallowing:
On server:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| ServerName | Product
| Quantity
On client:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| Product
| Quantity
| ClientName
I need to sync the [Server].[Orders Table].[ServerName] to [Client].[OrderDetails Table].[ClientName]
The Question:
What is the true and efficient way of making it?
I know Deprovisioning and provisioning with different config, is one way of doing it.
So I just wanna know the correct way.
Thanks.
EDIT :
Other columns of each table should sync normally ([Server].[Orders Table].[Id] to [Client].[Orders Table].[Id] ...).
And mapping strategy sometimes changes based on the row of data (which which is sending/receiving).
Sync Fx is not an ETL tool. simply put, it's DB sync is per table.
if you really want to force it to do what you want, you can simply intercept ChangesSelected event for the OrderDetails table, lookup the extra column from the other table and then dynamically add the column to the dataset before it gets applied on the other side.
see this link on how to manipulate the change dataset

Compare Tables in BigQuery

How would I compare two tables (Table1 and Table2) and find all the new entries or changes in Table2.
Using SQL Server I can use
Select * from Table1
Except
Select * from Table2
Here a sample of what I want
Table1
A | 1
B | 2
C | 3
Table2
A | 1
B | 2
C | 2
D | 4
So, if I comparing the two tables I want my results to show me the following
C | 2
D | 4
I tried a few statements with no luck.
Now that I have your actual sample dataset, I can write a query that finds every domain in one table that is not on the other table:
https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1024 has 24,729,816 rows. https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1025 has 24,732,640 rows.
Let's look at everything in 1025 that is not in 1024:
SELECT a.domain
FROM [inbound-acolyte-377:demo.1025] a
LEFT OUTER JOIN EACH [inbound-acolyte-377:demo.1024] b
ON a.domain = b.domain
WHERE b.domain IS NULL
Result: 39,629 rows.
(8.1s elapsed, 2.04 GB processed)
To get the differences (given that tkey is your unique row identifier):
SELECT a.tkey, a.name, b.name
FROM [your.tableold] a
JOIN EACH [your.tablenew] b
ON a.tkey = b.tkey
WHERE a.name != b.name
LIMIT 100
For the new rows, one way is the one you proposed:
SELECT col1, col2
FROM table2
WHERE col1 NOT IN
(SELECT col1 FROM Table1)
(you'll have to switch to a JOIN EACH when Table1 gets too large)