In Informatica, I am trying to get the date after certain working days (say 10,20,30) based on another conditions ( say prio 1,2,3). Already I have one DIM_DATE table where holidays and working days are configured. There is no relation with priority table and DIM_DATE table.Here I am using one unconnected lookup with doing the query override. Below the query I used:
select day_date as DAY_DATE
--,rank1
--,PRIORITY_name
from (
select day_date as DAY_DATE,DENSE_RANK() OVER (ORDER BY day_date) as RANK1,PRIORITY_name as PRIORITY_NAME from (
select date_id,day_date from dim_date where day_date between to_date('10.15.2018','MM.DD.YYYY') and to_date('10.15.2018','MM.DD.YYYY') +interval '250' DAY(3) and working_day=1
)
,DIM_PRIORITY
where DIM_PRIORITY.PRIORITY_name='3'
) where rank1=10
order by RANK1 --
In this example I have hardcoded the day_date,priority_name,rank1. But I need to pass all of them as input coming from mapping.
This hardcode is working but while taking as input like ?created? then it is not working. Here created is the date which will come from mapping flow.
Could you please suggest if this is feasible which I am trying?
?created? is giving error missing right paranthesis but the hardcoded query is running fine in sql.
You match your incoming port against one of the return fields of one of the records in the cache via the lookup condition (not by feeding ports into the override itself).
If this is not possible for you for some unexplained reason then you could define 3 mapping variables and set them to be equal to each of the input ports you care about (using setvariable) before feeding the record in to the lookup. Then use the variables in your lookup override
Related
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
The problem is a little hard to explain, but I'll do my best. I'm constructing a GraphQL server in Django (using Graphene), and part of this is the need to batch up database queries to avoid n+1 problems. Which means, when say fetching a set of Events, rather than making one query per Event ID, we make a single query with where id in (...).
I'm trying to achieve the same with lists. My immediate use case is that every Event has a Venue, and I want to retrieve the next 5 (this is arbitrary) events for each Venue, in just one query.
(Note: for a full justification behind this, I wrote an article a while ago)
The PostgreSQL query I've come up with is this:
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY
venue_id
ORDER BY
starts_at,
ends_at,
id) AS row_index
FROM
events_event
WHERE
starts_at >= '2018-05-20') x
WHERE
row_index <= 5
I don't think I'll have the need for case statements, because I don't think there's a realistic need to be able to batch up subsequent pages -- just the initial ones.
So I have a few questions about this:
Is there a better way of writing this query?
Is this preferable to firing off multiple (realistically between 10 and 30) queries in parallel to do this the conventional way?
How can I achieve this query using the Django ORM (without dropping down to raw SQL)?
The third question is the one I'm banging my head over because the ORM seems to provide all the necessary building blocks (I'm using Django 2.0), but I can't figure out how to stick them all together.
This is what I have so far using Django:
Event.objects.annotate(
row_number=Window(
expression=RowNumber(),
partition_by=[F('venue_id')],
order_by=[F('starts_at').asc(), F('ends_at').asc(), F('id').asc()]
)
)
But I've been unable to find a way to filter on row_number, because I don't want to return everything -- just the first 5. When I try the following, Django tells me that filtering isn't allowed on Window clauses:
Event.objects.annotate(
row_number=Window(
expression=RowNumber(),
partition_by=[F('venue_id')],
order_by=[F('starts_at').asc(), F('ends_at').asc(), F('id').asc()]
)
).filter(row_number__lte=5)
I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.
I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?
The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.
i need to convert this sql query to hibernate criteria,please help guys.
SELECT NAME, COUNT(*) AS app
FROM device
GROUP BY NAME
ORDER BY app DESC
LIMIT 3
Try this code:
select device.name, count(device)
from Device device
group by device.name
order by count(device) desc
This assumes that you have an entity class called Device with a field name along with a getter method getName(). You may have to change the query depending on what your actual code is (which you never showed us).
The LIMIT clause you had is not applicable for HQL. Instead, you should be doing Query.setMaxResults().