Optimise conversion to integer - pandas - python-2.7

I have a DataFrame with 80,000 rows. One column 'prod_prom' contains either null values or string representations of numbers, i.e. including ','. I need to convert these to integers. So far I have been doing this:
for row in DF.index:
if pd.notnull(DF.loc[row, 'prod_prom']):
DF.loc[row, 'prod_prom'] = int(''.join([char for char in DF.loc[row, 'prod_prom'] if char != ',']))
But it is extremely slow. Would it be quicker to do this in list comprehension, or with an apply function? What is best practice for this kind of operation?
Thanks

So if I understand right, you have data like the following:
data = """
A,B
100,"5,000"
200,"10,000"
300,"100,000"
400,
500,"2,000"
"""
If that is the case probably the easiest thing is to use the thousands option in read_csv (the type will be float instead of int because of the missing value):
df = pd.read_csv(StringIO(data),header=True,thousands=',')
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 2000
If that is not an possible you can do something like the following:
print df
A B
0 100 5,000
1 200 10,000
2 300 100,000
3 400 NaN
4 500 2,000
df['B'] = df['B'].str.replace(r',','').astype(float)
print df
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 200
I changed the type to float because there are no NaN integers in pandas.

Related

Drop all rows before first occurrence of a value

I have a df like so:
Year ID Count
1997 1 0
1998 2 0
1999 3 1
2000 4 0
2001 5 1
and I want to remove all rows before the first occurrence of 1 in Count which would give me:
Year ID Count
1999 3 1
2000 4 0
2001 5 1
I can remove all rows AFTER the first occurrence like this:
df=df.loc[: df[(df['Count'] == 1)].index[0], :]
but I can't seem to follow the slicing logic to make it do the opposite.
I'd do:
df[(df.Count == 1).idxmax():]
df.Count == 1 returns a boolean array. idxmax() will identify the index of the maximum value. I know the max value will be True and when there are more than one Trues it will return the position of the first one found. That's exactly what you want. By the way, that value is 2. Finally, I slice the dataframe for everything from 2 onward with df[2:]. I put all that in one line in the answer above.
you can use cumsum() method:
In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
Year ID Count
2 1999 3 1
3 2000 4 0
4 2001 5 1
Explanation:
In [14]: (df.Count == 1).cumsum()
Out[14]:
0 0
1 0
2 1
3 1
4 2
Name: Count, dtype: int32
Timing against 500K rows DF:
In [18]: df = pd.concat([df] * 10**5, ignore_index=True)
In [19]: df.shape
Out[19]: (500000, 3)
In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop
In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop
In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop
Conclusion: #piRSquared's idxmax() solution is a clear winner...
Using np.where:
df[np.where(df['Count']==1)[0][0]:]
Timings
Timings were performed on a larger version of the DataFrame:
df = pd.concat([df]*10**5, ignore_index=True)
Results:
%timeit df[np.where(df['Count']==1)[0][0]:]
100 loops, best of 3: 2.74 ms per loop
%timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 6.18 ms per loop
%timeit df[(df.Count == 1).cumsum() > 0]
10 loops, best of 3: 26.6 ms per loop
%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
Just slice the other way :
if idx is your index do :
df.loc[idx:]
Instead of
df.loc[:idx]
That means :
df.loc[df[(df['Count'] == 1)].index[0]:, :]

python dataframe - lambda X function - more efficient implementation possible?

in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data).
The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:
In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])
In [10]: s
Out[10]:
0 [A, B, C]
1 [D, E, F]
2 [A, B, E, F]
dtype: object
should be shaped into this:
Out[11]:
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.
The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
That turns [10] into out[11].
That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.
Does anyone know how to re-implement the above line:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
to make it more efficient, in terms of memory usage?
You could try breaking your dataframe into chunks and writing to a file as you go, something like this:
chunksize = 10000
def f(df):
return f.apply(lambda x: Series(1,index=x)).fillna(0)
with open('out.csv','w') as f:
f.write(df.ix[[]].to_csv()) #write the header
for chunk in df.groupby(np.arange(len(df))//chunksize):
f.write(f(chunk).to_csv(header=None))
If memory use is the issue, it seems like a sparse matrix solution would be better. Pandas doesn't really have sparse matrix support, but you could use scipy.sparse like this:
data = pd.Series([list('ABC'),list('DEF'),list('ABEF')])
from scipy.sparse import csr_matrix
cols, ind = np.unique(np.concatenate(data), return_inverse=True)
indptr = np.cumsum([0] + list(map(len, data)))
vals = np.ones_like(ind)
M = csr_matrix((vals, ind, indptr))
This sparse matrix now contains the same data as the pandas solution, but the zeros are not explicitly stored. We can confirm this by converting the sparse matrix to a dataframe:
>>> pd.DataFrame(M.toarray(), columns=cols)
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
Depending on what you're doing with the data from here, having it in a sparse form may help solve your problem without using excessive memory.

SAS - Selecting optimal quantities

I'm trying to solve a problem in SAS where I have quantities of customers across a range of groups, and the quantities I select need to be as even across the different categories as possible. This will be easier to explain with a small table, which is a simplification of a much larger problem I'm trying to solve.
Here is the table:
Customer Category | Revenue band | Churn Band | # Customers
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
Suppose I need to select 3000 customers from category A, and 3000 customers from category B. From the second category, within each A and B, I need to select an equal amount from 1 and 2. If possible, I need to select a proportional amount across each 1, 2, and 3 subcategories. Is there an elegant solution to this problem? I'm relatively new to SAS and so far I've investigated OPTMODEL, but the examples are either too simple or too advanced to be much use to me yet.
Edit: I've thought about using survey select. I can use this to select equal sizes across the Revenue Bands 1, 2, and 3. However where I'm lacking customers in the individual churn bands, surveyselect may not select the maximum number of customers available where those numbers are low, and I'm back to manually selecting customers.
There are still some ambiguities in the problem statement, but I hope that the PROC OPTMODEL code below is a good start for you. I tried to add examples of many different features, so that you can toy around with the model and hopefully get closer to what you actually need.
Of the many things you could optimize, I am minimizing the maximum violation from your "If possible" goal, e.g.:
min MaxMismatch = MaxChurnMismatch;
I was able to model your constraints as a Linear Program, which means that it should scale very well. You probably have other constraints you did not mention, but that would probably beyond the scope of this site.
With the data you posted, you can see from the output of the print statements that the optimal penalty corresponds to choosing 1500 customers from A,1,1, where the ideal would be 1736. This is more expensive than ignoring the customers from several groups:
[1] ChooseByCat
A 3000
B 3000
[1] [2] [3] Choose IdealProportion
A 1 1 1500 1736.670
A 1 2 0 135.882
A 1 3 0 78.762
A 2 1 28 9.934
A 2 2 1240 1003.330
A 2 3 232 82.310
B 1 1 1500 1580.210
B 1 2 0 193.358
B 1 3 0 161.072
B 2 1 1500 1608.593
B 2 2 0 153.976
B 2 3 0 161.072
Proportion MaxChurnMisMatch
0.35478 236.67
That is probably not the ideal solution, but figuring how to model exactly your requirements would not be as useful for this site. You can contact me offline if that is relevant.
I've added quotes from your problem statement as comments in the code below.
Have fun!
data custCounts;
input cat $ rev churn n;
datalines;
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
;
proc optmodel printlevel = 0;
set CATxREVxCHURN init {} inter {<'A',1,1>};
set CAT = setof{<c,r,ch> in CATxREVxCHURN} c;
num n{CATxREVxCHURN};
read data custCounts into CATxREVxCHURN=[cat rev churn] n;
put n[*]=;
var Choose{<c,r,ch> in CATxREVxCHURN} >= 0 <= n[c,r,ch]
, MaxChurnMisMatch >= 0, Proportion >= 0 <= 1
;
/* From OP:
Suppose I need to select 3000 customers from category A,
and 3000 customers from category B. */
num goal = 3000;
/* See "implicit slice" for the parenthesis notation, i.e. (c) below. */
impvar ChooseByCat{c in CAT} =
sum{<(c),r,ch> in CATxREVxCHURN} Choose[c,r,ch];
con MatchCatGoal{c in CAT}:
ChooseByCat[c] = goal;
/* From OP:
From the second category, within each A and B,
I need to select an equal amount from 1 and 2 */
con MatchRevenueGroupsWithinCat{c in CAT}:
sum{<(c),(1),ch> in CATxREVxCHURN} Choose[c,1,ch]
= sum{<(c),(2),ch> in CATxREVxCHURN} Choose[c,2,ch]
;
/* From OP:
If possible, I need to select a proportional amount
across each 1, 2, and 3 subcategories. */
con MatchBandProportion{<c,r,ch> in CATxREVxCHURN, sign in / 1 -1 /}:
MaxChurnMismatch >= sign * ( Choose[c,r,ch] - Proportion * n[c,r,ch] );
min MaxMismatch = MaxChurnMismatch;
solve;
print ChooseByCat;
impvar IdealProportion{<c,r,ch> in CATxREVxCHURN} = Proportion * n[c,r,ch];
print Choose IdealProportion;
print Proportion MaxChurnMismatch;
quit;

MATLAB: How to read PRE tag and create cellarray with NaN

I am trying to read data from html file
The data are delimmited by <PRE></PRE> tag
e.g.:
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I would to replace the blank space with NaN and to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
This data will be saved on mydata.dat file
If you have the HTML file hosted somewhere, then:
url = 'http://www.myDomain.com/myFile.html';
html = urlread(url);
% Use regular expressions to remove undesired HTML markup.
txt = regexprep(html,'<script.*?/script>','');
txt = regexprep(txt,'<style.*?/style>','');
txt = regexprep(txt,'<pre.*?/pre>','');
txt = regexprep(txt,'<.*?>','')
Now you should have the date in text format in txt variable. You can use textscan to parse the txt var and you can scan for the whitespace or for the numbers.
More Info:
- urlread
- regexprep
This isn't the perfect solution but it seems to get you there.
Assuming t is one long string, the delimiter is white space, and you know the number of columns:
numcols = 7;
sample = '1 2 3 4 5 7 1 3 5 7';
test = textscan(sample,'%f','delimiter',' ','MultipleDelimsAsOne',false);
test = test{:}; % Pull the double out of the cell array
test(2:2:end) = []; % Dump out extra NaNs
test2 = reshape(test,numcols,length(test)/numcols)'; % Have to mess with it a little to reshape rowwise instead of columnwise
Returns:
test2 =
1 2 3 4 5 NaN 7
1 NaN 3 NaN 5 NaN 7
This is assuming the delimiter is white space and constant. Textscan doesn't allow you to stack whitespace as a delimiter, so it throws a NaN after each white space character if there isn't data present. In your example data there are two white space characters between each data point, so every other NaN (or, more generically, n_whitespace - 1) can be thrown out, leaving you with the NaNs you actually want.

Sqlite (C API) and query (select) on cyclic/symmetric values with user defined functions

I'm using Sqlite with C++ and have two similar problems :
1) I need to select 4 entries to make an interpolation.
For example, my table could look like this :
angle (double) | color (double)
0 0.1
30 0.5
60 0.9
90 1.5
... ...
300 2.9
330 3.5
If I want to interpolate the value corresponding to 95°, I will use the entries 60°, 90°, 120° and 150°.
To get those entries, my request will be SELECT color FORM mytable WHERE angle BETWEEN 60 and 150, no big deal.
Now if I want 335°, I will need 300°, 330°, 360°(=0°) and 390°(=30°).
My query will then be SELECT color FORM mytable WHERE angle BETWEEN 300 and 330 OR angle BETWEEN 0 and 30.
I can't use SELECT color FORM mytable WHERE angle BETWEEN 300 and 390 because this will only return 2 colors.
Can I use the C API and user defined functions to include some kind of modulo meaning in my queries ?
It would be nice if I could use a user defined function to use the query [...] BETWEEN 300 and 390 and get as result the rows 300, 330, 0 and 30.
2) An other table looks like this :
speed (double) | color (double) | var (double)
0 0.1 0
10 0.5 1
20 0.9 2
30 1.5 3
... ... ...
In reality due to symmetry, color(speed) = color(-speed) but var(-speed) = myfunc(var(speed)).
I would like to make queries such as SELECT * FROM mytable WHERE speed BETWEEN -20 and 10 and be able to make a few operations with the API on the "virtual" rows with a negative speed and return them as a regular result.
For example I would like the result of the query SELECT * FROM mytable WHERE speed BETWEEN -20 and 10 to be like this :
speed (double) | color (double) | var (double)
-20 0.9 myfunc(2)
-10 0.5 myfunc(1)
0 0.1 0
10 0.5 1
Is that possible ?
Thanks for your help :)
I would suggest to use a query with two intervals :
SELECT * from mytable WHERE (speed >= MIN(?1,?2) AND speed <= MAX(?1,?2)) OR ((MAX(?1,?2) > 360) AND (speed >= 0 AND speed <= MAX(?1,?2)%360));
This example works fine if ?1 and ?2 are positive.