Lead and Lag Function in informatica

Lead and Lag Function in informatica - informatica

How can we use Lead and Lag functions in Informatica?
Name | No.
------------
X | 100
Y | 200
Z | 300
I have to convert it to:
Name | No. | Lead(No.)
-----------------------------
X | 100 | 200
Y | 200 | 300
Z | 300 | 100
Name | No. | Lag(No.)
----------------------------
X | 100 | 0
Y | 200 | 100
Z | 300 | 200
The logic I used was:
EXP Transformation
Name (input & Output Port)
No. (input)
O_No.(VAR)=IIF(Prv_no IS NULL,0,No.)
Prv_no.(VAR)=No.
This was for Lag Function.

Never done that, but I would use the order of evaluation. In a EXP transformation, input port are evaluated before variable port that in turn are evaluated before output port. Also the row are readed one at a time.
If you send to the EXP TRAN sorted data, you could simulate lag() function. For Lead() you should reverse the sort.

Multiple ways-
You can use RANK transformation,check the rank (R) option for No..Set Top/Bottom as Top and Number of Ranks as 1 for
getting the lead value.**Set Top/Bottom as Bottom and Number
of Ranks as 1 for getting the lag value. Then using these
values in expression transformation, we can implement lead/lag
function.
2.Use sql query in source qualifier, select MAX() and MIN() to get the respective lead and lag values.

You can use Informatica Expressions to achieve lead and lag. For every incoming row make 2 output ports one for lead and one for lag calculation. Use mapping parameter for configuraing lead and lag variable for the run. After expression connect the ports to two different targets.

Related

Plot Min, Max, Average, Median values into a horizontal enumerated line in Power BI Desktop

I am trying to do a trivial task with Power BI Desktop. I have the following kind of data
| Name | Min | Max | Average | Median |
|-------- |----- |------- |--------- |-------- |
| team A | 0 | 3,817 | 120 | 120 |
| team B | -10 | 1,050 | 25 | 89 |
| team C | 5 | 14,320 | 50 | 48 |
And I want to create my own horizontal line with pre-defined (Start, End) points to plot for each team name the values of the Min, Max, Average, Median. And I filter the team name to adjust the numbers and the visual accordingly.
So far I have done the following static approach
The example above is totally non-dynamic because every point on the line is set by me. Also if for example, I select Team B with a higher median than average then the above visual line does not change the position of the relative spheres (in the image I posted, I have placed average always higher than the median which is not true for all the teams).
Thus, I would like to know if there is any fancy and well-plotted way to represent those 4 descriptive measures for a team name in a horizontal line that will respond when I use a different team. As I have noted on the image attached, the card visuals change when I change the team name. But the spheres do not move across the line.
My desired output
For Team B
While for Team C
I literally don't know if this is feasible in Power BI apart from the static approach I already did. Thank you in advance.
Regards.

To increment the value whenever the parameter crosses the threshold in power BI using DAX

I am using DirectQuery in Power BI and I want to create a measure for calculating the number of occurrences a particular parameter has crossed its threshold (say 150). If it has crossed twice till now , the value should be 2.
I am a newbie, if anybody could help it would be great!
The data set looks like (with threshold 150)
Parameter | Value | Count
A | 240 | 1
A | 245 | 2
A | 110 | 2
A | 160 | 3

What have you tried so far?
A simple measure like the following should give you what you need:
CALCULATE(COUNT('Table'[Value]), 'Table'[Value] > 150)

How can I visualize timeseries data aggregated by more than one dimension on AWS insights?

I'd like to use cloudwatch insights to visualize a multiline graph of average latency by host over time. One line for each host.
This stats query extracts the latency and aggregates it in 10 minute buckets by host, but it doesn't generate any visualization.
stats avg(latencyMS) by bin(10m), host
bin(10m) | host | avg(latencyMS)
0m | 1 | 120
0m | 2 | 220
10m | 1 | 130
10m | 2 | 230
The docs call this out as a common mistake but don't offer any alternative.
The following query does not generate a visualization, because it contains more than one grouping field.
stats avg(myfield1) by bin(5m), myfield4
aws docs
Experementally, cloudwatch will generate a multi line graph if each record has multiple keys. A query that would generate a line graph must return results like this:
bin(10m) | host-1 avg(latencyMS) | host-2 avg(latencyMS)
0m | 120 | 220
10m | 130 | 230
I don't know how to write a query that would output that.

Parse individual message for each host then compute their stats.
For example, to get average latency for responses from processes with PID=11 and PID=13.
parse #message /\[PID:11\].* duration=(?<pid_11_latency>\S+)/
| parse #message /\[PID:13\].* duration=(?<pid_13_latency>\S+)/
| display #timestamp, pid_11_latency, pid_13_latency
| stats avg(pid_11_latency), avg(pid_13_latency) by bin(10m)
| sort #timestamp desc
| limit 20
The regular expressions extracts duration for processes having id 11 and 13 to parameters pid_11_latency and pid_13_latency respectively and fills null where there is no match series-wise.
You can build from this example by creating the match regular expression that extracts for metrics from message for hosts you care about.

How to calculate time varying historical mean with Stata

How can I calculate the mean of X using an expanding window with at least four observations?
Here is a numeric example:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
46.654168
44.924652
43.807024
45.679814
48.366395
49.883396
48.230502
49.869179
53.942757
56.167884
56.226512
56.25608
58.765728
62.077038
62.780799
61.858235
61.167646
60.671859
60.480263
60.226433
61.65349
60.769882
61.497553
60.146182
60.292934
60.173739
58.60077
58.445601
60.404868
end

Time-varying means in an expanding time window can be phrased otherwise as to imply the mean of all values from the start of records to the current date. You don't give a time variable so I assume data are in order and supply a time variable.
The community-contributed command rangestat (to be installed from SSC using ssc install rangestat) can give the mean of all values to date in this way:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
end
gen t = _n
rangestat (count) X (mean) X, int(t . 0)
list
+-------------------------------------+
| X t X_count X_mean |
|-------------------------------------|
1. | 50.73547 1 1 50.73547 |
2. | 48.27841 2 2 49.506941 |
3. | 42.80767 3 3 47.273851 |
4. | 49.24785 4 4 47.767351 |
5. | 52.20223 5 5 48.654327 |
|-------------------------------------|
6. | 49.72669 6 6 48.833054 |
7. | 50.82317 7 7 49.117356 |
8. | 49.09935 8 8 49.115105 |
9. | 48.94956 9 9 49.096711 |
10. | 47.41043 10 10 48.928084 |
+-------------------------------------+
Evidently you can ignore results for small counts as you please.
The syntax is naturally explained in the help for rangestat: suffice it to say here that the syntax for the option -- namely interval(t . 0) -- is three-fold:
for the time variable t
and two offsets
backwards as far as possible: system missing . here means arbitrarily large
forwards just 0
In mathematical terms the mean is from time minus infinity, or as much as possible, to time 0, the present.
The count result is the number of observations in the window with non-missing values on X. Here as the time variable is 1 up the count is trivially the same as the time variable, but in real problems the time variable is much more likely to be a date of some kind. Unlike some other commands rangestat doesn't have an option to insist on a minimum number of points with non-missing values in a window, but you can count how many there are and decide to ignore those based on too few data. That is left to the user here.
Incidentally, you could make a good start on this kind of problem by working out a cumulative sum and then dividing by the number of values so far. That needs care with (e.g.) gaps in data, irregularly spaced data or missing values and a virtue of rangestat is that all such difficulties are considered.

I want to optimize a stored procedure that uses IN clause and a regex_str function. I am not sure that how I should optimize it more?

The response time I am getting is around 200ms.
I want to optimize it more.
How can I achieve this?
CREATE OR REPLACE
PROCEDURE GETSTORES
(
LISTOFOFFERIDS IN VARCHAR2,
REF_OFFERS OUT TYPES.OFFER_RECORD_CURSOR
)
AS
BEGIN
OPEN REF_OFFERS FOR
SELECT
/*+ PARALLEL(STORES 5) PARALLEL(MERCHANTOFFERS 5)*/
MOFF.OFFERID,
s.STOREID,
S.LAT,
s.LNG
FROM
MERCHANTOFFERS MOFF
INNER JOIN STORES s ON MOFF.STOREID =S.STOREID
WHERE
MOFF.OFFERID IN
(
SELECT
REGEXP_SUBSTR(LISTOFOFFERIDS,'[^,]+', 1, LEVEL)
FROM
DUAL CONNECT BY REGEXP_SUBSTR(LISTOFOFFERIDS, '[^,]+', 1, LEVEL) IS NOT NULL
)
;
END
GETSTORES;
I am using the regex_substr to get a list of OfferIDs from the comma separated string that comes in LISTOFOFFERIDS.
I have created the index on STOREID of the Stores table but to no avail.
A new approach to achieve the same is also fine if its faster.
The types declaration for the same:
create or replace
PACKAGE TYPES
AS
TYPE OFFER_RECORD
IS
RECORD(
OFFER_ID MERCHANTOFFERS.OFFERID%TYPE,
STORE_ID STORES.STOREID%TYPE,
LAT STORES.LAT%TYPE,
LNG STORES.LNG%TYPE
);
TYPE OFFER_RECORD_CURSOR
IS
REF
CURSOR
RETURN OFFER_RECORD;
END
TYPES;
The plan for the select reveals following information:
Plan hash value: 1501040938
-------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 276 | 67620 | 17 (12)| 00:00:01 |
|* 1 | HASH JOIN | | 276 | 67620 | 17 (12)| 00:00:01 |
| 2 | NESTED LOOPS | | | | | |
| 3 | NESTED LOOPS | | 276 | 61272 | 3 (34)| 00:00:01 |
| 4 | VIEW | VW_NSO_1 | 1 | 202 | 3 (34)| 00:00:01 |
| 5 | HASH UNIQUE | | 1 | | 3 (34)| 00:00:01 |
|* 6 | CONNECT BY WITHOUT FILTERING (UNIQUE)| | | | | |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | OFFERID_INDEX | 276 | | 0 (0)| 00:00:01 |
| 9 | TABLE ACCESS BY INDEX ROWID | MERCHANTOFFERS | 276 | 5520 | 0 (0)| 00:00:01 |
| 10 | TABLE ACCESS FULL | STORES | 9947 | 223K| 13 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("MERCHANTOFFERS"."STOREID"="STORES"."STOREID")
6 - filter( REGEXP_SUBSTR ('M1-Off2,M2-Off5,M2-Off9,M5-Off4,M10-Off1,M1-Off3,M2-Off4,M3-Off2,M4-Of
f6,M5-Off1,M6-Off1,M8-Off1,M7-Off3,M1-Off1,M2-Off1,M3-Off1,M3-Off4,M3-Off5,M3-Off6,M4-Off1,M4-Off7,M2
-Off2,M3-Off3,M5-Off2,M7-Off1,M7-Off2,M1-Off7,M2-Off3,M3-Off7,M5-Off5,M4-Off2,M4-Off3,M4-Off5,M8-Off2
,M6-Off2,M1-Off5,M1-Off6,M1-Off9,M1-Off8,M2-Off6,M2-Off7,M4-Off4,M9-Off1,M6-Off4,M1-Off4,M1-Off10,M2-
Off8,M3-Off8,M6-Off3,M5-Off3','[^,]+',1,LEVEL) IS NOT NULL)
8 - access("MERCHANTOFFERS"."OFFERID"="$kkqu_col_1")

If your server supports it (seems you want it), change the hints into /*+ PARALLEL(S 8) PARALLEL(MOFF 8)*/. When you have aliases you must use the aliases in the hints.
you should try the compound index suggested by APC(STORES(STOREID, LAT, LNG))
Please respond to the questions: For the example presented, how many distinct stores you get (select count(distinct storeid) from (your_query)) and how many stores are in the STORES table? (Select count(*) from Stores)?
Have you analysed the table with dbms_stats.gather_table_stats?
I believe the connect by query is NOT the problem. It runs in 0.02 seconds.

If you look at you explain plan the timings for each step are the same: there is no obvious candidate to focus on tuning.
The sample you posted has fifty tokens for OFFERID. Is that representative? They map to 276 STORES - is that a representative ratio? Do any offers hit more than one Store?
276 rows is about 2.7% of the rows which is a small-ish sliver: however, as STORES seems to be a very compact table it's marginal as to whether indexed reads would be faster than a full table scan.
The only obvious thing you could do to squeeze more juice out of the database would be to build a compound index on STORES(STOREID, LAT, LNG); presumably it's not a table which sees much DML so the overhead of an additional index wouldn't be much.
One last point: your query executes in 0.2s. So how much faster do you want it to go?

Consider dropping the regex on the join, so the join can happen fast.
If there are indexes on the join columns, chances are the join may move from nested loops
to a hashed join of some sort.
Once you have that result set (with hopefully fewer rows), then filter it with your regex.
You may find that the WITH statement helpful in this scenerio.
Something on the order of this. ( untested example )
WITH
base AS
(
SELECT /*+ PARALLEL(STORES 5) PARALLEL(MERCHANTOFFERS 5) */
moff.OFFERID,
s.STOREID,
s.LAT,
s.LNG
FROM MERCHANTOFFERS moff
INNER JOIN STORES s
ON MOFF.STOREID = S.STOREID
),
offers AS
(
SELECT REGEXP_SUBSTR(LISTOFOFFERIDS,'[^,]+', 1, LEVEL) offerid
FROM DUAL
CONNECT BY REGEXP_SUBSTR(LISTOFOFFERIDS, '[^,]+', 1, LEVEL) IS NOT NULL
)
SELECT base.*
FROM base,
offers
WHERE base.offerid = offers.offerid
Oracle may execute the two views into in memory tables, then join.
No guarentees. Your milage may vary. You were looking for ideas. This is an idea.
The very best of luck to you.
If I recall a hints chapter correctly, when you alias your table names, you need to use that alias in your hint. /*+ PARALLEL(s 5) PARALLEL(moff 5) */
I would be curious as to why you decided on the value 5 for your hints. I was under the impression that Oracle would chose a best value for it, depending on system load and other mysterious conditions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Lead and Lag Function in informatica - informatica

You can use Informatica Expressions to achieve lead and lag. For every incoming row make 2 output ports one for lead and one for lag calculation. Use mapping parameter for configuraing lead and lag variable for the run. After expression connect the ports to two different targets.

Related

Plot Min, Max, Average, Median values into a horizontal enumerated line in Power BI Desktop

To increment the value whenever the parameter crosses the threshold in power BI using DAX

How can I visualize timeseries data aggregated by more than one dimension on AWS insights?

How to calculate time varying historical mean with Stata

I want to optimize a stored procedure that uses IN clause and a regex_str function. I am not sure that how I should optimize it more?

Categories

Resources