Superhuman Level - Pandas DataFrame Reshaping because of Duplicates - python-2.7

Do you like puzzles that only superhumans can solve? This is the final test to prove such an ability.
A single company might get different levels of funding (seed, a) from multiple banks possibly at different times.
Let's look at the data then the story to get a better picture.
import pandas as pd
data = {'id':[1,2,2,3,4],'company':['alpha','beta','beta','alpha','alpha'],'bank':['z', 'x', 'y', 'z', 'j'],
'rd': ['seed', 'seed', 'seed', 'a', 'a'], 'funding': [100, 200, 200, 300, 50],
'date': ['2006-12-01', '2004-09-01', '2004-09-01', '2007-05-01', '2007-09-01']}
df = pd.DataFrame(data, columns = ['id','company', 'round', 'bank', 'funding', 'date'])
df
Yields:
id company rd bank funding date
0 1 alpha seed z 100 2006-12-01
1 2 beta seed x 200 2004-09-01
2 2 beta seed y 200 2004-09-01
3 3 alpha a z 300 2007-05-01
4 4 alpha a j 50 2007-09-01
Desired Output:
company bank_seed funding_seed date_seed bank_a funding_a date_a
0 alpha z 100 2006-12-01 [z,j] 350 2007-09-01
1 beta [x,y] 200 2004-09-01 None None None
As you can see, I am not a superhuman but shall try to explain my thought process.
Let's look at company alpha
Company alpha first got seed money for $100 from bank z in late 2006. A few months later, their investors were very happy with their progress so bank z gave them money ($300 more!). However, Company alpha needed a little more cash but had to go to some random Swiss bank j to stay alive. Bank j reluctantly gave $50 more. Yay! They now have $350 from their updated 'a' round ending in September 2007.
Company beta is pretty new. They got funding totaling $200 from two different banks. But wait... there's nothing in here about their round 'a'. That's okay we'll put None for now and check back with them later.
The issue is that Company alpha sucks and got money from the Swiss...
This is my non-working code that had worked on a subset of my data - it won't work here.
import itertools
unique_company = df.company.unique()
df_indexed = df.set_index(['company', 'rd'])
index = pd.MultiIndex.from_tuples(list(itertools.product(unique_company, list(df.rd.unique()))))
reindexed = df_indexed.reindex(index, fill_value=0)
reindexed = reindexed.unstack().applymap(lambda cell: 0 if '1970-01-01' in str(cell) else cell)
working_df = pd.DataFrame(reindexed.iloc[:,
reindexed.columns.get_level_values(0).isin(['company', 'funding'])].to_records())
If you know how to solve part of this problem, go ahead and put it below. Thank you in advance for taking the time to look at this! :)
Lastly, if you want to see how my code does work. Then, do this but you lose so much valuable info...
df = df.drop_duplicates(subset='id')
df = df.drop_duplicates(subset='rd')

Take a pre-processing step to spread out the funding across records with the same 'id' and 'date'
df.funding /= df.groupby(['id', 'date']).funding.transform('count')
Then process
d1 = df.groupby(['company', 'round']).agg(
dict(bank=lambda x: tuple(x), funding='sum', date='last')
).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1
bank funding date bank funding date
round a a a seed seed seed
company
alpha (z, j) 350.0 2007-09-01 (z,) 100.0 2006-12-01
beta None NaN NaT (x, y) 200.0 2004-09-01

Groupby, aggregate and unstack will get you close to what you want
df.groupby(['company', 'round']).agg({'bank': lambda x: ','.join(x), 'funding': 'sum', 'date': 'max'}).unstack().reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
You get
company_ bank_a bank_seed funding_a funding_seed date_a date_seed
0 alpha z,j z 350.0 100.0 2007-09-01 2006-12-01
1 beta None x,y NaN 400.0 None 2004-09-01

Related

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Reducing the Sparsity of a One-Hot Encoded dataset

I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.
How can I avoid this? What should I do to make this code better?
# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)
adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)
#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])
target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)
selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)
for n,s in zip( data.head(0), selector.scores_):
print "F Score ", s,"for feature ", n
EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White
Expected Results:
F Score "f_score" for feature "race"
By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.
One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.
Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_bin.fit_transform(df)
print(df_trans)
Out[1]:
cat1_0 cat1_1 cat2
0 1 1 C
1 0 1 S
2 1 0 T
3 0 0 B
Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
Out[2]:
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 0 1 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 1 0 0 B
See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.

Bokeh figure doesn't show

I am new to python. I tried the example given in here http://docs.bokeh.org/en/latest/docs/gallery/color_scatter.html with my own dataset, which looks like this
Unnamed: 0 fans id stars
0 0 69 18kPq7GPye-YQ3LyKyAZPw 4.14
1 1 1345 rpOyqD_893cqmDAtJLbdog 3.67
2 2 105 4U9kSBLuBDU391x6bxU-YA 3.68
3 3 2 fHtTaujcyKvXglE33Z5yIw 4.64
4 4 5 SIBCL7HBkrP4llolm4SC2A 3.80
here's my code:
import pandas as pd
from bokeh.plotting import figure, show, output_file
op = pd.read_csv('FansStars.csv')
x = op.stars
y = op.fans
radii = 1.5
colors = ["#%02x%02x%02x" % (int(r), int(g), 150) for r, g in zip(50+2*x, 30+2*y)]
TOOLS="hover,crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select,"
p = figure(tools=TOOLS)
p.scatter(x, y, radius=radii,
fill_color=colors, fill_alpha=0.6,
line_color=None)
output_file("color_scatter.html", title="color_scatter.py example")
show(p)
However, when I run this code, I get no error and a webpage is opened, but BLANK. On reloading several times, I can finally see the tools, but that's all.
Can anyone tell me where am I going wrong?
Thanks!
I cant replicate this on Python 3.4 with Bokeh 0.12.3. So in that way, your code seems fine. I tried it both in the notebook (output_notebook) and to a file like you do and both seem to work fine.
The radius of 1.5 which you specify is taken in data units (x apparently), this makes the circles extremely big, covering the entire screen at first render. But using the wheelzoom to zoom out a bit reveals all circles as expected. Here is what your code looks like in Firefox for me (after zooming out):

How to count rating?

My question is more mathematical. there is a post in the site. User can like and dislike it. And below the post is written for example -5 dislikes and +23 likes. On the base of these values I want to make a rating with range 0-10 or (-10-0 and 0-10). How to make it correctly?
This may not answer your question as you need a rating between [-10,10] but this blog post describes the best way to give scores to items where there are positive and negative ratings (in your case, likes and dislikes).
A simple method like
(Positive ratings) - (Negative ratings), or
(Positive ratings) / (Total ratings)
will not give optimal results.
Instead he uses a method called Binomial proportion confidence interval.
The relevant part of the blog post is copied below:
CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:
(source: evanmiller.org)
(Use minus where it says plus/minus to calculate the lower bound.) Here p is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings.
Here it is, implemented in Ruby, again from the blog post.
require 'statistics2'
def ci_lower_bound(pos, n, confidence)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-(1-confidence)/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
This is extension to Shepherd's answer.
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
It depends on number of visitors to your app. Lets say if you expect about 100 users rate your app. When a first user click dislike, we will rate it as 0 based on above approach. But this is not logically right.. since our sample is very small to make it a zero. Same with only one positive - our app gets 10 rating.
A better thing would be to add a constant value to numerator and denominator. Lets say if our app has 100 visitors, its safe to assume that until we get 10 ups/downs, we should not go to extremes(neither 0 nor 10 rating). SO just add 5 to each likes and dislikes.
num_likes = num_likes + 5;
num_dislikes = num_dislikes + 5;
total_votes = num_likes + num_dislikes;
rating = round(10*(num_likes)/(total_votes));
It sounds like what you want is basically a percentage liked/disliked. I would do 0 to 10, rather than -10 to 10, because that could be confusing. So on a 0 to 10 scale, 0 would be "all dislikes" and 10 would be "all liked"
total_votes = num_likes + num_dislikes;
rating = round(10*num_likes/total_votes);
And that's basically it.