Reformat a dataframe based on final empty columns in python - python-2.7

I am working on scraping a table that has major and minor column names. When I do this, the table comes in having read both the column names and column groups, so the column names are misaligned in the dataframe like so (simplified):
unnamed1 unnamed2 unnamed3 Year Passing Rushing Receiving
2015 NA 200 60 NA NA NA
2014 NA 180 70 NA NA NA
My challenge is in shifting the column names so that 'Year' aligns over '2015' and so forth. The problem is then that the number of columns to shift does not remain constant from table to table (this is only one of many). My code at the moment looks like the following:
table1=read_html('http://www.pro-football-reference.com/players/T/TyexWi00.htm')
df=table1[0]
to_shift=len(df.dropna(how='all', axis=1).columns) #Number of empty columns to shift by
df2=df.dropna(how='all',axis=1) #Drop the empty columns
df2.columns=df.columns[-to_shift:] #Shift all columns left by the number i've found
The problem is that for a player that has none of one stat (passing in this simple example), there are completely blank columns in the middle of the dataframe as well as at the right end, so that the code shifts too far. Is there a clean way of counting the columns from right to left until one is not completely empty?
Much thanks, and I hope my question is clear!

Is there a clean way of counting the columns from right to left until one is not completely empty?
from itertools import takewhile
len(df.columns) - len(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns)))) - 1
Explanation:
takewhile returns all elements of a list (beginning at the front) until the given condition is False. When we call it on reversed(df.columns), we get all elements from the end. With df[col].isnull().all() we can check whether all entries of a column are null (a.k.a. nan). Consequently the above takewhile expression returns the suffix of columns which are completely 'empty'. By calculating total_length - bad_suffix_length - 1, we get the first index for which the condition is not satisfied.

Adding to the correct response from Michael Hoff (Thank you very much!), the code has been edited to
to_shift=len(df.columns) - len(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns)))) #Index of origianl dataframe to keep
df2=df.drop(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns))),axis=1) #Drop the empty right side columns
colnames=df.columns[-to_shift:]
df2.columns=colnames

Related

Code to missing values if all Items of an Item battery have value 1

I have a large data set in Stata.
There are several item batteries in this data set.
One item battery consists of 8 items (v1 - v8), each scaled from 1 to 7.
I want to code all items that take the value 1 in all items as missing values.
If v1 to v8 have the value "1", all rows to which this applies are to be replaced with missings.
I know how to code missing values with the if qualifier, but the selection with the complex condition causes me difficulties.
The code for R would probably solve this via rowSums, but I need the solution for Stata.
(I assume in R it would work like this:
df[rowSums(df[,c("v1", ... "v8")]!=1)==0, c("v1", .... "v8")] <- NA
But I need a solution for Stata.
If I understood this correctly, you want
egen rowall = concat(v1-v8)
mvdecode v1-v8 if rowall == 8 * "1", mv(1)
That is, all instances in v1-v8 of 1 are recoded as missing if and only if the values of those variables are all 1 in any observation.

Google sheets IF stops working correctly when wrapped in ARRAYFORMULA

I want this formula to calculate a date based on input from two other dates. I first wrote it for a single cell and it gives the expected results but when I try to use ARRAYFORMULA it returns the wrong results.
I first use two if statements specifycing what should happen if either one of the inputs is missing. Then the final if statement calculates the date if both are present based on two conditions. This seems to work perfectly if I write the formula for one cell and drag it down.
=IF( (LEN(G19)=0);(U19+456);(IF((LEN(U19)=0) ;(G19);(IF((AND((G19<(U19+456));(G19>(U19+273)) ));(G19);(U19+456))))))
However, when I want to use arrayformula to apply it to the entire column, it always returns the value_if_false if neither cell is empty, regardless of whether the conditions in the if statement are actually met or not. I am specifically talking about the last part of the formula that calculates the date if both input values are present, it always returns the result of U19:U+456 even when the result should be G19:G. Here is how I tried to write the ARRAYFORMULA:
={"Date deadline";ARRAYFORMULA(IF((LEN(G19:G400)=0);(U19:U400+456);(IF((LEN(U19:U400)=0);
(G19:G400);(IF((AND((G19:G400<(U19:U400+456));(G19:G400>(U19:U400+273)) ));(G19:G400);(U19:U400+456)))))))}
I am a complete beginner who only learned to write formulas two weeks ago, so any help or tips would be greatly appreciated!
AND and OR are not compatible with ARRAYFORMULA
Replace them by * or +
Try
={"Date deadline";ARRAYFORMULA(
IF((LEN(G19:G400)=0),(U19:U400+456),
(IF((LEN(U19:U400)=0), (G19:G400),
(IF((((G19:G400<(U19:U400+456))*(G19:G400>(U19:U400+273)) )),(G19:G400),
(U19:U400+456)))
))
)
)}
Keep in mind you cannot use AND, OR operators in an arrayformula, so you must find an alternative method such as multiplying the values together and checking them for 0 or 1 (true*true=1)
I am gathering based on your formula's and work that you want to have the following:
If G19 is blank show U19 + 456
If U19 is blank show G19
If G19 is less than U19 + 456 but greater than U19 + 273 show G19
Otherwise show U19 + 456
I'm not too sure what you want to happen when both columns G and U are empty. Based on your current formula you are returning an empty cell + 456... but with this formula it returns an empty cell rather than Column U + 456
Formula
={"Date deadline";ARRAYFORMULA(TO_DATE(ARRAYFORMULA(IFS((($G19:$G400="")*($U19:$U400=""))>0,"",$G19:$G400="",$U19:$U400+456,$U19:$U400="",$G19:$G400,(($G19:$G400<$U19:$U400+456)*($G19:$G400>$U19:$U400+273))>0,$G19:$G400,TRUE,$U19:$U400+456))))}

How to count the number of blank cells in one column based on the first blank row in another column

I have a spreadsheet set up with tv program titles in column B, the next 20 or so columns are tracking different information about that title. I need to count the number of blank cells in column R relating to the range in column B that contains titles (ie, up to the first blank row in column B.)
I can easily set up a formula to count the number of empty cells in a given range in column R, the problem is as I add more titles to the sheet I would have to keep updating the range in the formula [a simple =COUNTIF(R3:R1108, "")]. I've done a little googling of the problem but haven't quite found anything that fits the situation. I thought I would be able to get the following to work but I didn't fully understand what was going on with them and they weren't giving the expected results.
I've tried these formulas:
=ArrayFormula(sum(MIN("B3:B"&MIN(IF((R3:R)>"",ROW(B3:B)-1)))))
=ArrayFormula(sum(INDIRECT("B3:B"&MIN(IF((R3:R)>"",ROW(B3:B)-1)))))
And
=if(SUM(B3:B)="","",SUM(R3:R))
All of the above formulas give "0" as the result. Based on the COUNTIF formula I have set up it should be 840, which is a number I would expect. Currently, there are 1106 rows containing data and 840 is a reasonable number to expect in this situation.
Is this what you're looking for?
=COUNTBLANK(INDIRECT(CONCATENATE("R",3,":R",(3+COUNTA(B3:B)))))
This counts the number of non-blank rows in the B column (starting at B3), and uses that to determine the rows to perform COUNTBLANK in, in column R (starting at R3). CONCATENATE is a way to give it a range by adding strings together, and the INDIRECT allows for the range reference to be a string.
a proper way would be:
=ARRAYFORMULA(COUNTBLANK(INDIRECT(ADDRESS(3, 18, 4)&":"&
ADDRESS(MAX(IF(B3:B<>"", ROW(B3:B), )), 18, 4)))
or shorter:
=ARRAYFORMULA(COUNTBLANK(INDIRECT("R3:"&
ADDRESS(MAX(IF(B3:B<>"", ROW(B3:B), )), 18, 4))))
or shorter:
=ARRAYFORMULA(COUNTBLANK(INDIRECT("R3:R"&MAX(IF(B3:B<>"", ROW(B3:B), ))))

Join strings from the same column in ´pandas´ using a placeholder condition

I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.