Pivoting/reshaping a dataframe to have dates as columns - python-2.7

Here is my dataframe:
ID AMT DATE
0 1496846 54.76 2015-02-11
1 1496846 195.00 2015-01-09
2 1571558 11350.00 2015-04-30
3 1498812 135.00 2014-07-11
4 1498812 157.00 2014-08-04
5 1498812 110.00 2014-09-23
6 1498812 1428.00 2015-01-28
7 1558450 4355.00 2015-01-26
8 1858606 321.52 2015-03-27
9 1849431 1046.81 2015-03-19
I would like to make this a dataframe consisting of time series data for each ID. That is, each column name is a date (sorted), and it is indexed by ID, and the values are the AMT values corresponding to each date. I can get so far as doing something like
df.set_index("DATE").T
but from here I'm stuck.
I also tried
df.pivot(index='ID', columns='DATE', values='AMT')
but this gave me an error on having duplicate entries (the IDs).
I envision it as transposing DATE, and then grouping by unique ID and melting AMT underneath.

you want to use pivot_table where there is an aggfunc parameter that handles duplicate indices.
df.pivot_table('AMT', 'DATE', 'ID', aggfunc='sum')
You'll want to choose how to handle the dups. I put 'sum' in there. It defaults to 'mean'

Related

Power BI - Filtering model on latest version of all attributes of all dimensions through DAX

I have a model that's comprised of multiple tables containing, for every ID, multiple rows with a valid_from and valid_to dates.
This model has one table in that is linked to every other table (a table working as both a fact and a dimension).
This fact has bi-directional cross filtering with the other tables.
I also have a date dimension that is not linked to any other table.
I want to be able to calculate the sum of a column in this table in the following way:
If a date range is selected, I want to get the sum of the latest value per ID from the fact able that is before the max selected date from the date dimension.
If no date is selected, I want to get the sum of the current version of the value per ID.
This comes down to selecting the latest value per ID filtered on the dates.
Because of the nature of the model (bi-directional with the fact/dimension table), I want to have the latest version of any attribute from any dimension selected in the visual.
Here's an data example and the desired outcome:
fact/dimension table:
ID
Valid_from
Valid_to
Amount
SK_DIM1
SK_DIM2
1
01-01-2020
05-12-2021
50
1234
6787
1
05-13-2021
07-31-2021
100
1235
6787
1
08-01-2021
12-25-2021
100
1236
6787
1
12-26-2021
12-31-2021
200
1236
6787
1
01-01-2022
12-31-9999
200
1236
6788
Dimension 1:
ID
SK
Valid_from
Valid_to
Name
1
1234
10-20-2019
06-01-2021
Name 1
1
1235
06-02-2021
07-31-2021
Name 2
1
1236
08-01-2021
12-31-9999
Name 3
Dimension 2:
ID
SK
Valid_from
Valid_to
Name
1
6787
10-20-2019
12-31-2021
Name 1
1
6788
01-01-2022
12-31-9999
Name 2
My measure is supposed to do the following:
If no date is selected than the result will be a matrix like the following:
Dim 1 Name
Dim 2 Name
Amount Measure
Name 3
Name 2
200
If July 2021 is selected than the result will be a matrix like the following:
Dim 1 Name
Dim 2 Name
Amount Measure
Name 2
Name 1
100
So the idea here is that the measure would filter the fact table on the latest valid value in the selected date range, and then the bi-directional relationships will filter the dimensions to get the corresponding version to that row with the max validity (last valid row) in the selected range date.
I have tried to do the following two DAX codes but it's not working:
Solution 1: With this solution, filtering on other dimensions work and I get the last version in the selected date range for all attributes of all used dimensions. But the problem here is that the max valid from is not calculated per ID, so I only get the max valid from overall.
Amount Measure=
VAR _maxSelectedDate = MAX(Dates[Dates])
VAR _minSelectedDate = MIN(Dates[Dates])
VAR _maxValidFrom =
CALCULATE(
MAX(fact[valid_from]),
DATESBETWEEN(fact[valid_from], _minSelectedDate, _maxSelectedDate)
|| DATESBETWEEN(fact[valid_to], _minSelectedDate, _maxSelectedDate)
)
RETURN
CALCULATE(
SUM(fact[Amount]),
fact[valid_from] = _maxValidFrom
)
Solution 2: With this solution, I do get the right max valid from per ID and the resulting number is correct, but for some reason, when I use other attributes from the dimensions, it duplicates the amount for every version of that attribute. The bi-directional filtering does not work anymore with Solution 2.
Amount Measure=
VAR _maxSelectedDate = MAX(Dates[Dates])
VAR _minSelectedDate = MIN(Dates[Dates])
VAR _maxValidFromPerID =
SUMMARIZE(
FILTER(
fact,
DATESBETWEEN(fact[valid_from], _minSelectedDate, _maxSelectedDate)
|| DATESBETWEEN(fact[valid_to], _minSelectedDate, _maxSelectedDate)
),
fact[ID],
"maxValidFrom",
MAX(fact[valid_from])
)
RETURN
CALCULATE(
SUM(fact[Amount]),
TREATAS(
_maxValidFromPerID,
fact[ID],
fact[valid_from]
)
)
So if somebody can explain why the bi-directional filtering doesn't work anymore that will be great, and also, more importantly, if you have any solution to have both the latest value per ID and still keep filtering on other attributes, that would be great!
Sorry for the long post, but I thought it's best to give all the details for a complete understanding of my issue, this has been picking my brain since few days now and I'm sure I'm missing something stupid but I turned to this community for help because I cannot seem to be able to find a solution!
Thank you very much in advance for any help!
Seems to be workable with a dummy model. I didn't got the point how filter ID, so if it creates a problem let me know how you handle ID. Then I changed fact to facts as fact is a function. Also, I'm not sure about the workability of the measure at your real model. Hope you will give some feedback.
Amount Measure =
VAR ValidDate=
calculate(
max(facts[Valid_to])
,ALLEXCEPT(facts,facts[ID])
,facts[Valid_to]<=MAX(Dates[Date])
)
Return
CALCULATE(
SUM(facts[Amount])
,TREATAS({ValidDate},facts[Valid_to])
)

PowerBI - lookup - is a value that's in column 1 in column 2?

This is something that I can do in my sleep with Excel but I am struggling with this basic thing in PowerBI. I have a column in my Items table called ID and another column in the same table called ParentID.
What I am looking to do in PowerBI is create the CalcColumn - Looking up to see if an ID is in the Parent ID column and bringing back a response
I've tried several different variations of a calculated column using the lookup value function but I either bring back the parent id for a value (which I already have), I bring back random id numbers (that are neither the id nor the parent) or it tells me that I've got multiple values.
ID
ParentID
CalcColum
1
Parent
2
1
Not found
3
Parent
4
1
Parent
5
3
Not found
6
4
Parent
7
6
Not found
8
Not found
How would I resolve this?
It's fairly straightforward in DAX too.
Try this expression as a calculated column:
IF ( Table1[ID] IN VALUES ( Table1[ParentID] ), "Parent", "Not found" )

Return Slicer's Value (trade simulator)

I work with a single table (called sTradeSim) that I have created in PowerQuery. It has 3 columns (Fund1, Fund2, Fund3), each having values from -10 to 10, with an increment of 1.
I also have three separate slicers, each created using an option "Greater than or equal to". Each slicer is having a field assigned to it - Slicer 1 = Fund1, Slicer 2 = Fund2, Slicer 3 = Fund3. Below is a screenshot of Slicer 1.
Right next to these three slicers is a table with three rows. For each row, I would like to retrieve the value of the respective slicers. So the desired result would look like:
Row No 1 = -10.00 (the value of Slicer 1),
Row No 2 = -2.00 (the value of Slicer 2),
Row No 3 = 3.00 (the value of Slicer 3).
Unfortunately, DAX formula that I have developed is always returning 3.00 (the value of the third slicer).
I have tried to find a solution on the forum and combine my SWITCH formula with ALL, ALLEXCEPT, SELECTEDVALUE etc., but it seems like I'm missing something very basic.
mHV_Trades =
SWITCH(
MAX(FundTable[FundsRanked]),
1, MIN(sTradeSim[Fund1]),
2, MIN(sTradeSim[Fund2]),
3, MIN(sTradeSim[Fund3])
)
What you are trying to do doesn't work, because essentially when you place 1 filter on any column on the table, it will filter all the rows that have that value. So, when you apply a filter fund1 = -10 it will also filter the values for fund 2 and fund 3.
You have 2 options:
Create independent tables each with values from -10 to 10
Create a table with all the combinations of -10 to 10 values for every fund.
For your example with 3 funds this works quite nicely (the table has about 10k records), all the combinations of -10 to 10 (21) to the power of 3, the problem with this solution is that depending on the number of funds you have you will run out of space quite quickly.

DAX selecting and displaying the max value of all selected records

Problem
I'm trying to calculate and display the maximum value of all selected rows alongside their actual values in a table in Power BI. When I try to do this with the measure MaxSelectedSales = MAXX(ALLSELECTED(FactSales), FactSales[Value]), the maximum value ends up being repeated, like this:
If I add additional dimensions to the output, even more rows appear.
What I want to see is just the selected rows in the fact table, without the blank values. (i.e., only four rows would be displayed for SaleId 1 through 4).
Does anyone know how I can achieve my goal with the data model shown below?
Details
I've configured the following model.
The DimMarket and DimSubMarket tables have two rows each, you can see their names above. The FactSales table looks like this:
SaleId
MarketId
SubMarketId
Value
IsCurrent
1
1
1
100
true
2
2
1
50
true
3
1
2
60
true
4
2
2
140
true
5
1
1
30
false
6
2
2
20
false
7
1
1
90
false
8
2
2
200
false
In the table output, I've filtered FactSales to only include rows where IsCurrent = true by setting a visual level filter.
Your max value (the measure) is a scalar value (a single value only). If you put a scalar value in a table with the other records, the value just get repeated. In general mixing scalar values and records (tables) does not really bring any benefit.
Measures like yours can be better displayed in a KPI or Multi KPI visual (normally with the year, that you get the max value per year).
If you just want to display the max value of selected rows (for example a filter in your table), use this measure:
Max Value = MAX(FactSales[Value])
This way all filter which are applied are considered in the measures calculation.
Here is a sample:
I've found a solution to my problem, but I'm slightly concerned with query performance. Although, on my current dataset, things seem to perform fairly well.
MaxSelectedSales =
MAXX(
FILTER(
SELECTCOLUMNS(
ALLSELECTED(FactSales),
"id", FactSales[SaleId],
"max", MAXX(ALLSELECTED(FactSales), FactSales[Value])
),
[id] = MAX(FactSales[SaleId])
),
[max]
)
If I understand this correctly, for every row in the output, this measure will calculate the maximum value across all selected FactSales rows, set it to a column named max and then filter the table so that only the current FactSales[SaleId] is selected. The performance hit comes from the fact that MAX needs to be executed for every row in the output and a full table scan would be done when that occurs.
Posted on behalf of the question asker

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck