What is the output of this aggregator - informatica

Quick question, here is my data
Data_field: 100|address|place|456|687
This column from expression is being passed to aggregator and marked as group by in aggregator.
What will be the output of this aggregator? also can u tell me something brief about aggregator?
Thanks,
Teja

Firstly say for example your data is consisting of 3 records
Data amount
100|address|place|456|687 10
100|address|place|456|687 20
100|address|place|456|687 30
In informatica if its group on Data and SUM(amount), the output will be
100|address|place|456|687 60
Say for example if there is no amount column as below
100|address|place|456|687
100|address|place|456|687
100|address|place|456|687
In informatica if its group on Data then your output is as below
100|address|place|456|687 only one record
In the above scenario One important note about aggregator is , even though
you have not checked the groupby option , informatica by default selects the last record
Aggregator in informatica is similar to using aggregate functions like 'MAX', 'MIN', 'COUNT' etc., upon group in SQL
example : Say if you want to know the max salary in a department then
SQL
select dept, count(*) from employee group by dept;
Informatica
you can enable the groupby option on dept and then create a port which will have max(salary). this will give the output similar to the SQL above
Things to take care in aggregator for better performance:
1) Use Sorter transformation before aggregator
2) Use number columns in group by whenever possible (try to avoid date and string columns)
3) If the source is having huge number of records its better to group the records in SQL override itself because the aggregator will create cache.
4)Add filter if required to avoid unnecessary aggregation
Hope this helps
Regards
Raj

Aggregator transformation can be used for multiple aggregation operation such as AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM and VARIENCE . GroupBy option can be checked to calculate the aggregates of a column as per your condition..
For example,
consider the source,
Aggregaion function is provided as,
Average of HEIGHT and maximum of WEIGHT is calculated by grouping POSITION column. Target is obtained as,
As POSITION column is grouped, Average of HEIGHT and maximum of WEIGHT is populated for each value avaliable in the POSITION column.
Aggregation transformation is nearly as same as SQL aggregation functions and SQL groupby clause.

Related

PowerBi- Prevent slicer from filtering a table returned measure

I will do my best to explain this correctly. I have a data set that has 4 category columns (Date, Store, Provider, Pick Location) and 4 value columns (Total Orders, Total Deliveries, Accepted Deliveries, Assigned Deliveries). Please see the image here:
The two total columns should be grouped by date and store and have the same value regardless of the provider or pick location - There is the potential to see the same value up to 4 times depending on the category combination. The final two value columns will be different for every line and should not be grouped.
The issue I am facing is to be able to:
Sum up the total columns by the date/store group to ensure there is no duplication
Sum up the other value columns NOT by the date/store group without affecting the totals
Have the total columns be only affected by the Date and Store slicer (NOT the Provider and Location)
Have the other values filter as normal for all slicers
Ultimately calculate percentages of "total" based on the affected slicers.
Here is a visual summary of what I would be trying to accomplish:
Currently, I have been able to create a "sum of averages" to allow the total values to display correctly based on the date and store, but am unable to get it to ignore the provider and location slicer:
total_pro_orders_f = var _table=SUMMARIZE(md_pbi_datafile,md_pbi_datafile[date], md_pbi_datafile[store_number],"_orders", averagex(md_pbi_datafile, md_pbi_datafile[total_pro_orders]))
RETURN
calculate(SUMX(_table,[_orders]), REMOVEFILTERS(md_pbi_datafile[lpName],md_pbi_datafile[transaction_is_sourced]))
I have tried all the different ignore filter commands I could find but no matter what I do I cannot get the totals to stay the same regardless of provider or location.
Does anyone have any advice on how to accomplish this?
Thank you all!

Calculate % of two columns Power BI

I want to calculate % of two columns which are already in %.
I want to calculate formula like
Target achieved= ACTUAL/TARGET
But here is ACTUAL is already a measure/calculated metrics so I'm not able to divide these two columns.
Any help would be appreciated..
Make sure both target and actual are actually numbers and not strings. You can do it in transform data (aka Power Query) part, before data is loaded into the report.
After that you should be able to create the measure you want, e.g. something like that:
UPDATE : What happens if Actual is not a column, but a measure?
If Actual measure is based on the columns in the same table as target you shouldn't have a problem. You cannot combine measure and column in the same formula. Measure is an aggregated field and needs to be grouped by another field (if you are familiar with SQL,think of SUM and GROUP BY). Coming back to your problem, you need to create measure out of "Target" column as well (notice I have in the formula SUM('Table'[Plan]) which makes it a measure). Than you can use both of them in the formula, but of course you need to "group" them by something(e.g. date) otherwise it will just show you a total.

Working with large offsets in BigQuery

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.

How to SUM DISTINCT Values in a column based on a unique date in another column of a Power BI table

I have a table in Power BI, where I have two columns like Date and Daily Targets. Daily Targets are always same on the same date so I need a measure to only SUM 1 value for 1 date instead of calculating every row because these two columns contains duplicate values. Please see at attached screenshot for the data table.
As you look into my data, there are two distinct dates and all I need is when I add this Daily Target Column in any visualization, instead of adding 11653+11653+11653 for 3rd Jan, it should only Sum 11653 for 3rd Jan. Please help me with it, I will be very grateful to you.
To get a measure that takes the maximum value of the Daily Target by date, you can do something like this:
Daily Target = SUMX(GROUPBY(Table1, Table1[Date], "Max Daily Target", MAXX(CURRENTGROUP(), [DailyTarget])), [Max Daily Target])
Assuming your table is called Table1
The inner GROUP BY says to identify the highest daily target for each date. This assumes any given date will only have a single daily target (you could equally pick the MIN or AVG as they should all result in the same number). Note, if you have a single date with 2 different daily targets, this formula will fall down because it will only pick the biggest.
The outer SUMX sums each day's biggest daily target. This is important if you are aggregating by month or year. At the end of January, you want to have up to 31 daily targets added together.
Note: In general, I would roll up the daily target by day before loading the data into Power BI. It's not fully clear from your screenshot why you have records at a lower granularity, so I can't explain how I'd do it in your particular case. However, this post by DAXPatterns.com does go into how to handle "sales vs. budget", which may be relevant to you: http://www.daxpatterns.com/handling-different-granularities/

Countif comparing dates in Tableau

I am trying to create a table where it only counts the attendees one one type of training (rows) if they attended another particular training (column) AFTER the first one. I think I need to recreate a countif function that compares the dates of the trainings, but not sure how to set this up so that it compares the dates of the row trainings and column trainings. Any ideas?
Edit 3/23
Alex, your solution would work if I had different variables for the dates of each type of training. Is there a way to construct this without having to create new variables for each type of training that I want to compare? Put another way, is there a way to refer to the rows and columns of the table in the formula that would compare the dates? So, something like "count if the start date of this column exceeds the start date of this row." (basically, is there something like the Excel index function in Tableau?)
It may help to see how my data is structured -- here is a scrubbed version: https://docs.google.com/spreadsheets/d/1YR1Wz-pfGHhBxDQDGYgmemLGoCK0cSvKOeE8w33ZI3s/edit?usp=sharing
The "table" tab shows the table that I'm trying to create in Tableau.
Define a calculated field for your condition, called say, trained_after, as:
training_b_date > training_a_date
trained_after will be true or false for each data row depending on whether the B training was dated later than the A training
If you want more precise control over the difference between the dates, use the date_diff function. Say date_diff("hour", training_a_date, training_b_date) > 24 to insist upon a short waiting period.
That field may be all you need. You can put trained_after on the filter shelf to filter only to see data rows meeting the condition. Or put it on another shelf to partition the data according to that condition. Or use your field to create other calculated fields.
Realize that if either of your date fields is null, then your calculated field will evaluate to null in that case. Aggregate functions like Sum(), Count() etc ignore null values.