Applying Left Join in Pentaho - kettle

I'm try to create Transformation and need to merge two Database based on query like that by using Merge Join and I little bit confuse what should i filled in First Step, Second Step to Lookup for that each query format.
Query Format :
SELECT * FROM A a LEFT JOIN B b on a.value=b.value
SELECT * FROM A a LEFT JOIN B b on b.value=a.value

There are various way to do it.
Write the sql with the join in the Table input step. Quick an dirty solution if your table are in the same database, but do not tell a PDI expert you did it that way.
If you know there is only one B record for each A record, use a Lookup Stream Step. Very, very, very efficient. The Main flow is the A and the lookup step is B.
If you have many B records for each A records, use a Join Rows. Don't be afraid, you do not really make a Cartesian product, as you can put your condition a.value=b.value.
In the same situation, you can also make a Merge join. The first step is the step you write fist in the sql select statement.

Multiple ways to do this.
you can use TableInput Step and just simply write your query. No need to do anything else for implementing above query.

Related

Join 2 table in power bi

I need help on this issue as i don't have any experience in Power Bi. I want to join 2 table in Power Bi where it have the same column which is Part_Number. How can i make this 2 table to match by Part Number and return the value?
Recon Table
Inventory Table
I would like to have Part Number, Part Name, QTY, Total Quantity as the result. Hope that i can the clarification i need. Thanks a lot!
For this case you simply must merge the tables. It doesn't look like you have done a lot of research on the matter though, so it's hard to understand exactly what you need help with.
To merge your two tables in Power Query, I would right click in the left hand side menu and select Merge Queries as New.
After that you simply follow the on-screen instructions and select your two tables and their respective key columns. After merging you can choose to disable load of your two original tables to save space in your data model, but this depends on your requirements.
If this was my data model, I would think on why joining these tables are necessary, instead of using these two tables as fact tables, and creating a third table to handle the part number dimension with associated part metadata.
Read the docs: Merge queries in Power Query

Redshift Spectrum - Referencing an external table in a CTE?

I'm trying to make some data available via Redshift Spectrum to our reporting platform. I chose Spectrum because it offers lower latency to our data lake vs a batched ETL process.
One of the queries I have looks like this
with txns as (select * from spectrum_table where ...)
select field1, field2, ...
from txns t1
left join txns t2 on t2.id = t1.id
left join txns t3 on t3.id = t1.id
where...
Intuitively, this should cache the Spectrum query output in-memory with the CTE, and make it available to query later in query without hitting S3 a second (or third) time.
However, I checked the explain plan, and with each join the number of "S3 Seq Scan"s goes up by one. So it appears to do the Spectrum scan each time the CTE is queried.
Questions:
Is this actually happening? Or is the explain plan wrong? The run-time of this query doesn't appear to increase linearly with the number of joins, so it's hard to tell.
If it is happening, what other options are there to achieve this sort of result? Other than manually creating a temp table (this will be accessed by a reporting tool, so I'd prefer to avoid allowing explicit write access or requiring multiple statements to get the data)
Thanks!
Yes this is really happening. CTE references are not reused - this is due to the possibility that different data will be used in the different references. Applying where clauses at table scan is an important performance feature.
You could look into using a materialized view but I expect that you are dynamically applying the where clauses in the CTE so this may not match you need. If it was me I'd want to understand why the triple self-join. Seems like there may be a better way to construct the query but it is just a gut feel.

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

SAS Data Integration - Create a physical table from metadata structure

i need to use a append object after a series of join that have a conditional run... So the join step may be not execute if the condition is not verified and his work physical dataset will not be created.
The problem is that the append step take an error if one o more input physical dataset are not created.
Is there a smart way to create a physical empty table from a metadata structure of the works table of the joins or to use the append with some non-created datasets?
The create table with the list of all field is not a real solution because i've to replicate it per 8 different joins and then replicate the job 10 times...
Thanks to all
Roberto
Thank you for your comments.
What you should do:
Amend your conditional node so that it would on positive condition to create a global macro variable with value of MAX. On negative condition to create the same variable with value of 0.
Replace offending SQL step with "CREATE TABLE" node
In the options for "CREATE TABLE", specify macro variable for "MAXIMUM OUTPUT ROWS (OUTOBS)". See the picture below for example of those options.
So now when your condition is not met, you will always end up with an empty table. When condition is met, the step executes normally.
I must say my version of DI Studio is a bit old. In my version SQL node doens't allow passing macro variables to SQL options, only integers can be typed in. Check if your version allows it because if it does, then you can amend existing SQL step and avoid replacing it with another node.
One more thing, you will get a warning when OUTOBS options is less then the resulting would be dataset.
Let me know if you have any questions.
See the picture for create table options:
At the end i've created another step that extract 0 row from the source table by the condition 1=0 in the where tab. In this way i have a empty table that i can use with a data/set in the post sql of the conditional run if the work table of the join does not exist.
This is not a solution but a valid work around.

Efficient way to merge/join 2 large datasets in SAS 8.2

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.