Hierarchical index in data frame missing columns - python-2.7

Im trying to learn Pandas by doing different exercises. I created a dataframe that looks like the example below. I'm trying to create a unique id by concatenating the fields, however when i get the data frame columns i only have fpd as a column. Could someone explain me why i don't see all the columns?
monthID pollutantID processID roadTypeID avgSpeedBinID Fpd
1 1 1 4 1 1.749101
2 0.935300
3 0.529701
4 0.393052
5 0.306381
6 0.261649
7 0.235040
I get the data frame by executing this:
fpd = data['fpd'].groupby([data['monthID'],data['pollutantID'],
data['processID'],data['roadTypeID'],data['avgSpeedBinID']]).sum()
fp = pd.DataFrame(fpd)

You could reset the multiindex to columns by:
fp.reset_index(inplace=True)

Related

Merging Tables Correctly in SAS

Hi I am trying to merge two tables the FormA scores table that I made that is now CalculatingScores with the domain number found in DomainsFormA. I need to merge them by QuestionNum. Here is my code.
proc sql;
create table combined as
select *
from CalculatingScores inner join DomainsFormA
on CalculatingScores.Scores=DomainsFormA.QuestionNum;
quit;
proc print data=combined (obs=15);
run;
This table is what I am trying to get my merged tables to look like but for 15 observations.
Form
Student
QuestionNum
Scores
DomainNum
A
1
1
0
5
A
1
2
1
4
A
1
3
0
5
But My tables look more like this
Form
Student
QuestionNum
Scores
DomainNum
A
1
2
1
5
A
1
4
1
5
A
1
5
1
5
My entire Scores column for these 15 observations have a value of 1. Also my DomainNum column only has values of 5. My Student and Form columns are correct but I need to have varied scores and varied domain numbers. Any ideas for how to solve my problem? Maybe I need a order by statement?
You appear to be joining on the incorrect columns
You coded
on CalculatingScores.Scores=DomainsFormA.QuestionNum
which is joining a score to a question number
perhaps you should be coding
on CalculatingScores.QuestionNum=DomainsFormA.QuestionNum
^^^^^^^^^^^ ^^^^^^^^^^^

Increment ID# by 1 if Same month ArrayFormula

I'm trying to set up an array formula in a google sheet to save filling in a simple formula for ID#s.
The sheet is populated by a google form, so it receives a timestamp. Let's say these are orders.
If the month of the order matches that of the previous I want to increase the ID# by one, essentially counting this months orders. The complete ID# is actually made up of several factors, the order count being just one of them (so that they are unique), but for the sake of this exercise, I'll keep it simple.
If the month of the order does not match the previous, then safe to say we've entered the new month and the ID should restart at 01.
I have a column that has the extracted month from the timestamp. So the data looks like this:
A B
ID# MONTH
1 1
2 1
3 1
4 1
5 1
6 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
I can't get the arrayformula to work! I've tried numerous countIfs and Ifs, something like
=ARRAYFORMULA(if(len(B2:B),if(B3:B<>B2:B,1,A2:A+1),""))
Does anyone have any suggestions for this?
I found it hard to Google for and have tried a few search terms!
try:
=ARRAYFORMULA(IF(B1:B<>"", COUNTIFS(B1:B, B1:B, ROW(B1:B), "<="&ROW(B1:B)), ))

How to replace null values from left join table in pyspark

I have two tables. table 1 has 5 million rows, table 2 has 3 millions. When I do table1.join(table2, ..., 'left_outer'), then all the columns from table 2 have null values in the new table. it looks like following (var3 and 4 from table 2 are arrays of varied length strings):
t1.id var1 var2 table2.id table2.var3 table2.var4
1 1.3 4 1 ['a','b','d'] ['x','y','z']
2 3.0 5 2 ['a','c','m','n'] ['x','z']
3 2.3 5
I plan to use countvectorizer after the join, which can't handle null values. So I want to replace the null values with empty arrays of string type.
it's a similar issue as discussed in PySpark replace Null with Array
But I have over 10 variables from table 2 and each has a different dimension.
Any suggestion as what I can do? Can I do countvectorizer before the join?
Dataframe have .na.fill() attribute.
replace_cols = {col:'' for col in df.columns}
df.na.fill(replace_cols)

pandas keep rows based on column values for repeated values

I have a pandas data frame and I have a list of values. I want to keep all the rows from my original DF that have a certain column value belonging to my list of values. However my list that I want to choose my rows from have repeated values. Each time I encounter the same values again I want to add the rows with that column values again to my new data frame.
lets say my frames name is: with_prot_choice_df and my list is: with_prot_choices
if I issue the following command:
with_prot_choice_df = with_df[with_df[0].isin(with_prot_choices)]
then this will only keep the rows once (as if for only unique values in the list).
I don't want to do this with for loops since I will repeat the process many times and it will be extremely time consuming.
Any advice will be appreciated. Thanks.
I'm adding an example here:
let's say my data frame is:
col1 col2
a 1
a 6
b 2
c 3
d 4
and my list is:
lst = [a,b,a,a]
I want my new data frame, new_df to be:
new_df
col1 col2
a 1
a 6
b 2
a 1
a 6
a 1
a 6
Seems like you need reindex
df.set_index('col1').reindex(lst).reset_index()
Out[224]:
col1 col2
0 a 1
1 b 2
2 a 1
3 a 1
Updated
df.merge(pd.DataFrame({'col1':lst}).reset_index()).sort_values('index').drop('index',1)
Out[236]:
col1 col2
0 a 1
3 a 6
6 b 2
1 a 1
4 a 6
2 a 1
5 a 6

PowerBi Change Card values to previous month Value if Current month value is not available

I am working on powerbi, using data over 2 months period, writing dax queries and trying to build reports.
I am trying to show the monthly values in cards visuals.
And i am able to do it using the measure below.
Sample Data:-
PlanPrevMon = CALCULATE([PlanSum],PREVIOUSMONTH('Month Year'[Date]))
CustomKPI = IF(ISBLANK([PlanSum]),"No Data Available ",[PlanSum])&" "&IF([PlanSum]=[PlanPrevMon],"",IF([PlanSum] > [PlanPrevMon],UNICHAR(8679),UNICHAR(8681))&IF([PlanSum]<=0,"",""))
But here i don't want user to choose the month values from slicer.
I would like to show the card values as
if current month value is not available then it should compare with the previous month value automatically as shown in an image below
For example, Apr-2017 value is not available; in this scenario I would like it to compare with the Mar-2017 value.
If Mar-2017 value also not available, then the previous month value and so on
Edit
I tried what #user5226582 suggested but still getting wrong values as below image.
As you can see i restricted to crosscheck whether the value is getting in a right way of not.
But not.
This is the measure i have used as #user5226582 suggested.
c =
var temp = 'Revenue Report'[Date]
return CALCULATE(LASTNONBLANK('Revenue Report'[Planned Rev],
'Revenue Report'[Planned Rev]),
ALL('Revenue Report'),
'Revenue Report'[Date]<=temp)`
Can you please correct me if i am doing anything wrong
Does this help...?
To get the PlanPrevMon column to show like this, I first created a new index column:
id = COUNTROWS(FILTER(Sheet1, EARLIER(Sheet1[Date],1)>Sheet1[Date]))
Then I used the index to help create the PlanPrevMon column in two steps:
Step 1: I made one column named PlanPrevMon1.
PlanPrevMon1 = SUMX(FILTER(Sheet1,Sheet1[id]=EARLIER(Sheet1[id])-1),Sheet1[PlanSum])
Step 2: I made another column named PlanPrevMon.
PlanPrevMon = if(ISBLANK(Sheet1[PlanPrevMon1]),
if(Sheet1[id]=1,0,CALCULATE(LASTNONBLANK(Sheet1[PlanPrevMon1],Sheet1[PlanPrevMon1]),ALL(Sheet1),ISBLANK(Sheet1[PlanSum]))),
Sheet1[PlanPrevMon1])
For the card, I used this measure:
Card = CALCULATE(LASTNONBLANK(Sheet1[PlanPrevMon],1),FILTER(Sheet1,Sheet1[id]=max(Sheet1[id])))
I hope this helps.
You can use LASTNONBLANK DAX function.
Example data:
a b
1 1
2
3 2
4
5 3
Calculated column:
c =
var temp = Table1[a]
return CALCULATE(LASTNONBLANK(Table1[b],Table1[b]),ALL(Table1),Table1[a]<=temp)
Results in:
a b c
1 1 1
2 1
3 2 2
4 2
5 3 3