Self Join in Pandas: Merge all rows with the equivalent multi-index - python-2.7

I have one dataframe in the following form:
df = pd.read_csv('data/original.csv', sep = ',', names=["Date", "Gran", "Country", "Region", "Commodity", "Type", "Price"], header=0)
I'm trying to do a self join on the index Date, Gran, Country, Region producing rows in the form of
Date, Gran, Country, Region, CommodityX, TypeX, Price X, Commodity Y, Type Y, Prixe Y, Commodity Z, Type Z, Price Z
Every row should have all the different commodities and prices of a specific region.
Is there a simple way of doing this?
Any help is much appreciated!
Note: I simplified the example by ignoring a few attributes
Input Example:
Date Country Region Commodity Price
1 03/01/2014 India Vishakhapatnam Rice 25
2 03/01/2014 India Vishakhapatnam Tomato 30
3 03/01/2014 India Vishakhapatnam Oil 50
4 03/01/2014 India Delhi Wheat 10
5 03/01/2014 India Delhi Jowar 60
6 03/01/2014 India Delhi Bajra 10
Output Example:
Date Country Region Commodit1 Price1 Commodity2 Price2 Commodity3 Price3
1 03/01/2014 India Vishakhapatnam Rice 25 Tomato 30 Oil 50
2 03/01/2014 India Delhi Wheat 10 Jowar 60 Bajra 10

What you want to do is called a reshape (specifically, from long to wide). See this answer for more information.
Unfortunately as far as I can tell pandas doesn't have a simple way to do that. I adapted the answer in the other thread to your problem:
df['idx'] = df.groupby(['Date','Country','Region']).cumcount()
df.pivot(index= ['Date','Country','Region'], columns='idx')[['Commodity','Price']]
Does that solve your problem?

Related

Power BI Matrix Visual Showing Row of Blank Values Even Though Source Data Does Not Have Blanks

I have two tables one with data about franchise locations (Franchise Profile Info) and one with Award data. Each franchise location is given a certain number of awards they are allowed to give out per year. Each franchise location rolls up to a larger group depending on where in the country they are located. These tables are in a 1 to 1 relationship using Franchise ID. I am trying to create a matrix with the number of awards, total utilized, and percentage utilized rolled up to group with the ability to expand the groups and see individual locations. For some reason when I add the value fields a blank row is created. There are not any blank rows in either of the original tables so I'm not sure where this is coming from.
Franchise Profile Info table
ID
Franchise Name
Group
Street Address
City
State
164
Park's
West
12 Park Dr.
Los Angeles
CA
365
A & J
East
243 Whiteoak Rd
Stafford
VA
271
Otto's
South
89 Main St.
St. Augustine
FL
Award table
ID
Year
TotalAwards
Utilized
164
2022
16
12
365
2022
5
5
271
2022
22
17
This tables are in a relationship with a 1 to 1 match on ID
What I want the matrix to look like
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
Instead what I'm getting is this
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
0
0
0%
I can't for the life of me figure out where this row is coming from. I can add in the Group and Franchise name as rows but as soon as I add any of the value columns this blank row shows up.
You have a value on the many side that does not exist on the one side. You can read a full explanation here. https://www.sqlbi.com/articles/blank-row-in-dax/

How do I create a pivot table with weighted averages from a table in PowerBI?

I have data in the following format:
Building
Tenant
Type
Floor
Sq Ft
Rent
Term Length
1 Example Way
Jeff
Renewal
5
100
100
6
47 Fake Street
Tom
New
3
500
200
12
I need to create a visualisation in PowerBI that displays a pivot table of attribute by tenant, with a weighted averages (by square foot) column, like this:
Jeff
Tom
Weighted Average (by Sq Ft)
Building
1 Example Way
47 Fake Street
-
Type
Renewal
New
-
Floor
5
3
-
Sq Ft
100
500
433.3333333
Rent
100
200
183.3333333
Term Length (months)
6
12
11
I have unpivoted the original data, like this:
Tenant
Attribute
Value
Jeff
Building
1 Example Way
Jeff
Type
Renewal
Jeff
Floor
5
Jeff
Sq Ft
100
Jeff
Rent
100
Jeff
Term Length (months)
6
Tom
Building
47 Fake Street
Tom
Type
New
Tom
Floor
3
Tom
Sq Ft
500
Tom
Rent
200
Tom
Term Length (months)
12
I can almost create what I need from the unpivoted data using a matrix (as below), but I can't calculate the weighted averages column from that matrix.
Jeff
Tom
Building
1 Example Way
47 Fake Street
Type
Renewal
New
Floor
5
3
Sq Ft
100
500
Rent
100
200
Term Length (months)
6
12
I can also create a table with my attributes as headers (instead of in a column). This displays the right values and lets me calculate weighted averages (as below).
Building
Type
Floor
Sq Ft
Rent
Term Length (months)
Jeff
1 Example Way
Renewal
5
100
100
6
Tom
47 Fake Street
New
3
500
200
12
Weighted Average (by Sq Ft)
-
-
-
433.3333333
183.3333333
11
However, it's important that these values are displayed vertically instead of horizontally. This is pretty straightforward in Excel, but I can't figure out how to do it in PowerBI. I hope this is clear. Can anyone help?
Thanks!

Power BI - Showing Top 5 records in Metrix Table but total should show for all records

I have table with thousands of record. i want to create a table visual that will show top 5 records for each category. i created a measure to achieve this and i am getting the result exactly the same i am looking for but facing one issue there.
See below image where i am showing top 5 records for each category, but after each category i have total.
I don't want that total for top 5 records i am showing in the table instead i want the total of all the records which is there under each category.
How can i achieve that?
Measure I created is - Top 5 = RankX(AllSelected(table(Category), Table(account), table(name)),amount_measure,,,Dense)
for Top 5 measure i am putting the filter for top 5.
Category
Account
Name
P%
amount
country
owner
Food
A101
AA11
10%
105
India
A
Food
A102
AA12
20%
120
India
A
Food
A103
AA13
80%
100
India
A
Food
A104
AA14
30%
150
India
A
Food
A105
AA15
60%
90
India
A
Stat
B101
AA11
10%
205
India
A
Stat
B102
AA12
20%
220
India
A
Stat
B103
AA13
80%
200
India
A
Stat
B104
AA14
30%
250
India
A
Stat
B105
AA15
60%
190
India
A
Admn
D101
AD11
10%
305
India
A
Admn
D102
AD12
20%
320
India
A
Admn
D103
AD13
80%
300
India
A
Admn
D104
AD14
30%
350
India
A
Admn
D105
AD15
60%
290
India
A
Thanks,
SK
You can try this
Let's suppose you have the following measures
_sumAMT:= SUM('Table 1'[amount])
and this is your ranking measure
_sumAMTRank:= RANKX(ALLEXCEPT('Table 1','Table 1'[Category]),[_sumAMT],,DESC,Dense)
You can revise the subtotal by doing this
_sumAMT by CAT:= CALCULATE(SUM('Table 1'[amount]),ALLEXCEPT('Table 1','Table 1'[Category]))
_revisedTotal:= IF(HASONEVALUE('Table 1'[Name])=true(),[_sumAMT],[_sumAMT by CAT])

Effective way to store list of list of dict to csv

I've got dataframe like this :
Name Nationality Tall Age
John USA 190 24
Thomas French 194 25
Anton Malaysia 180 23
Chris Argentina 190 26
so let say i got incoming data structure like this. each element representing the data of each row. :
data = [{
'food':{'lunch':'Apple',
'breakfast':'Milk',
'dinner':'Meatball'},
'drink':{'favourite':'coke',
'dislike':'juice'}
},
..//and 3 other records
].
'data' is some variable that save predicted food and drink from my machine learning. There is more record(about 400k rows) but i process them by batch size (right now i process 2k data each iteration) through iteration. Expected result like:
Name Nationality Tall Age Lunch Breakfast Dinner Favourite Dislike
John USA 190 24 Apple Milk Meatball Coke Juice
Thomas French 194 25 ....
Anton Malaysia 180 23 ....
Chris Argentina 190 26 ....
Is there's an effective way to achive that dataframe? so far i've already tried to iterate the data variables and get the value of each predicted label. which its feels like that process took much time.
You need flatenning dictionaries first, create DataFrame and join to original:
data = [{
'a':{'lunch':'Apple',
'breakfast':'Milk',
'dinner':'Meatball'},
'b':{'favourite':'coke',
'dislike':'juice'}
},
{
'a':{'lunch':'Apple1',
'breakfast':'Milk1',
'dinner':'Meatball2'},
'b':{'favourite':'coke2',
'dislike':'juice3'}
},
{
'a':{'lunch':'Apple4',
'breakfast':'Milk5',
'dinner':'Meatball4'},
'b':{'favourite':'coke2',
'dislike':'juice4'}
},
{
'a':{'lunch':'Apple3',
'breakfast':'Milk8',
'dinner':'Meatball7'},
'b':{'favourite':'coke4',
'dislike':'juice1'}
}
]
#or use another solutions, both are nice
L = [{k: v for x in d.values() for k, v in x.items()} for d in data]
df1 = pd.DataFrame(L)
print (df1)
breakfast dinner dislike favourite lunch
0 Milk Meatball juice coke Apple
1 Milk1 Meatball2 juice3 coke2 Apple1
2 Milk5 Meatball4 juice4 coke2 Apple4
3 Milk8 Meatball7 juice1 coke4 Apple3
df2 = df.join(df1)
print (df2)
Name Nationality Tall Age breakfast dinner dislike favourite \
0 John USA 190 24 Milk Meatball juice coke
1 Thomas French 194 25 Milk1 Meatball2 juice3 coke2
2 Anton Malaysia 180 23 Milk5 Meatball4 juice4 coke2
3 Chris Argentina 190 26 Milk8 Meatball7 juice1 coke4
lunch
0 Apple
1 Apple1
2 Apple4
3 Apple3

city population difference

I have an input file
Chicago 500
NewWork 200
California 100
I need difference of second column as output for each city with each other
Chicago Newyork 300
Chicago California 100
Newyork Chicago -300
Newyork California 100
California Chicago -400
California Newyork -100
I tried alot but not able to figure out exact and correct way to implement in map reduce . Please give me some solution
Here is a pseudocode. I use Python often, so it looks more like it. For this to work, you must know the total number of lines (i.e., cities here) and use that for N prior to running the job.
map(dummy, line):
city, pop = line.split()
for idx in 1:N
emit(idx, (city, pop))
reduce(idx, city_data):
city_data.sort() # sort by city to ensure indices are consistent
city, pop = city_data[idx]
for i in 1:N
if idx != i:
c, p = city_data[i]
dist = pop - p
emit(city, (c, dist))