Pandas list comprehension in a dataframe - python-2.7

I would like to pull out the price at the next day's open currently stored in (row + 1) and store it in a new column, if some condition is met.
df['b']=''
df['shift']=''
df['shift']=df['open'].shift(-1)
df['b']=df[x for x in df['shift'] if df["MA10"]>df["MA100"]]

There are a few approaches. Using apply:
>>> df = pd.read_csv("bondstack.csv")
>>> df["shift"] = df["open"].shift(-1)
>>> df["b"] = df.apply(lambda row: row["shift"] if row["MA10"] > row["MA100"] else np.nan, axis=1)
which produces
>>> df[["MA10", "MA100", "shift", "b"]][:10]
MA10 MA100 shift b
0 16.915625 17.405625 16.734375 NaN
1 16.871875 17.358750 17.171875 NaN
2 16.893750 17.317187 17.359375 NaN
3 16.950000 17.279062 17.359375 NaN
4 17.137500 17.254062 18.640625 NaN
5 17.365625 17.229063 18.921875 18.921875
6 17.550000 17.200312 18.296875 18.296875
7 17.681250 17.177500 18.640625 18.640625
8 17.812500 17.159375 18.609375 18.609375
9 17.943750 17.142813 18.234375 18.234375
For a more vectorized approach, you could use
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = np.nan
>>> df["b"][df["MA10"] > df["MA100"]] = df["open"].shift(-1)
or my preferred approach:
>>> df = pd.read_csv("bondstack.csv")
>>> df["b"] = df["open"].shift(-1).where(df["MA10"] > df["MA100"])

Modifying DSM's approach 3, stating True/False values in np.where explicitly:
#numpy.where(condition, x, y)
df["b"] = np.where(df["MA10"] > df["MA100"], df["open"].shift(-1), np.nan)
Using list comprehension explicitly:
#[xv if c else yv for (c,xv,yv) in zip(condition,x,y)] #np.where documentation
df['b'] = [ xv if c else np.nan for (c,xv) in zip(df["MA10"]> df["MA100"], df["open"].shift(-1))]

Related

Null independent column wise mean calculation in Python

I am trying to calculate the mean of 3 three columns in Python. Here is the catch-
If all 3 row values of my 3 columns are not null then my mean will be (x+y+z)/3.
If one of my row value is null (suppose z), then my mean should be (x+y)/2.
I'm storing there mean values in a seperate column which is part of the pandas dataframe.
I'm looking for the best approach as my dataset has over 2 million rows.
My data is below.
Thanks in advance.
A B C
0 1 2 3 # = (1+2+3)/3 = 2
1 4 NaN 6 # = (4+6)/2 = 5
2 NaN 8 9 # = (8+9)/2 = 8.5
Just apply the numpy.nanmean function along axis 0 (columns). This is the default axis so you will get the same result with omitting axis = 0. If you want the means row-wise use axis = 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [2.3, 4.5, 2.1, np.nan, 6.7],
'b': [2.4, 5.6, np.nan, np.nan, 7.1],
'c': [np.nan, np.nan, np.nan, np.nan, 0.9]
})
colmeans = df.apply(np.nanmean, axis = 0)
# colmeans
# a 3.900000
# b 5.033333
# c 0.900000
# dtype: float64
rowmeans = df.apply(np.nanmean, axis = 1)
# 0 2.35
# 1 5.05
# 2 2.10
# 3 NaN
# 4 4.90
# dtype: float64

how to index np.nan in a DataFrame

I create a DataFrame df with some nan in the column label, how can i get the index of the nan??
I have trydf['label'] == np.nan, but it seems doesn't work, While I use sum(df['Adj. Volume'] == 5090527.0), I can get the right answer, what happened? why ==np.nan doesn't work?
The DataFrame is like this
use isnull to test for NaN values:
df[df['label'].isnull()]
This will return all rows in your df where the label is NaN
the equality operator doesn't work with NaN which is why == np.NaN doesn't work
NaN has the property that np.NaN != np.NaN which is counter-intuitive
Example:
In [5]:
s = pd.Series([0,np.NaN, 3])
s
Out[5]:
0 0.0
1 NaN
2 3.0
dtype: float64
In [6]:
s == np.NaN
Out[6]:
0 False
1 False
2 False
dtype: bool
In [7]:
s != s
Out[7]:
0 False
1 True
2 False
dtype: bool
You can see in the last example if we test whether s != s it returns True for the NaN entry
Using isnull also gives the same result:
In [8]:
s.isnull()
Out[8]:
0 False
1 True
2 False
dtype: bool
You can then access the index attribute of the above to get just the index values:
In [10]:
s[s.isnull()].index
Out[10]:
Int64Index([1], dtype='int64')
I think you need boolean indexing with isnull and then return index by .index:
print (df[df.label.isnull()].index)
Sample:
df = pd.DataFrame({'A':[1,2,3],
'label':[4,np.nan,np.nan],
'C':[7,8,9]})
print (df)
A C label
0 1 7 4.0
1 2 8 NaN
2 3 9 NaN
print (df[df.label.isnull()].index)
Int64Index([1, 2], dtype='int64')

Find category-of-4 for number starting at 2

I'm tasked with categorizing numbers into groups with a range of four, starting at 2 (2, 6, 10, 14...). So for the number 9, the category would be 6 (between 6 and 10). I've developed the following function but I'm guessing there's a more efficient means and one that isn't limited in range.
>>> def FindCategory (num):
categories = [2]
lastVal = 2
for i in range (100):
lastVal = lastVal + 4
categories += [lastVal]
try:
return [cat for cat in categories if cat < num and num < cat + 4] [0]
except:
return
>>> FindCategory (56)
54
>>> FindCategory (99999999999999999999999999)
>>>
Just use math?
def category(n):
return (((n + 2) // 4) * 4) - 2
Examples:
>>> category(2)
2
>>> category(56)
54
>>> category(99)
98
>>> category(99999999999999999999999999)
99999999999999999999999998
By way of explanation: without the shift-by-2, you're just looking for the closest (lower) multiple of four, which can be found just by integer-division and then multiplication by 4 (i.e. (n//4)*4). The +2 and -2 account for the shift in your categories.

Python read a file and make a nth list from the

I have a file that each line has 2 element like below which have nth lines:
1 2
2 3
3 4
4 5
1 6
2 7
1 8
I need to make a list in python.
list[1]=[2,6,8]
list[2]=[3,7]
list[3]=[4]
list[4]=[5]
How can I do?
Try
import pandas as pd
a = [[1,2], [2,3], [3,4], [4, 5], [1, 6], [2,7], [1,8]]
df = pd.DataFrame(a,columns=['b','c'])
print df
z = df.groupby(['b']).apply(lambda tdf:pd.Series(dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['b']])))
z = z.sort_index()
print z
print z['c'][1]
print z['c'][2]
print z['c'][3]
print z['c'][4]
z['d'] = 0.000
z[['d']] = z[['d']].astype(float)
len_b = len(z.index)
z['d'] = float(len_b)
z['e'] = 1/z['d']
z = z[['c', 'e']]
z.to_csv('your output folder')
print z
See this answer for more details: https://stackoverflow.com/a/24112443/2632856

Pandas Dataframe ValueError: Shape of passed values is (X, ), indices imply (X, Y)

I am getting an error and I'm not sure how to fix it.
The following seems to work:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df.apply(func = random, axis = 1)
and my output is:
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
However, when I change one of the of the columns to a value such as 1 or None:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df['E'] = 1
df.apply(func = random, axis = 1)
I get the the error:
ValueError: Shape of passed values is (5,), indices imply (5, 5)
I've been wrestling with this for a few days now and nothing seems to work. What is interesting is that when I change
def random(row):
return [1,2,3,4]
to
def random(row):
print [1,2,3,4]
everything seems to work normally.
This question is a clearer way of asking this question, which I feel may have been confusing.
My goal is to compute a list for each row and then create a column out of that.
EDIT: I originally start with a dataframe that hase one column. I add 4 columns in 4 difference apply steps, and then when I try to add another column I get this error.
If your goal is add new column to DataFrame, just write your function as function returning scalar value (not list), something like this:
>>> def random(row):
... return row.mean()
and then use apply:
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 -1.278878
1 -0.198460 0.544879 0.554407 -0.161357 0.184867
2 0.269807 1.132344 0.120303 -0.116843 0.351403
3 -1.131396 1.278477 1.567599 0.483912 0.549648
4 0.288147 0.382764 -0.840972 0.838950 0.167222
I don't know if it possible for your new column to contain lists, but it deinitely possible to contain tuples ((...) instead of [...]):
>>> def random(row):
... return (1,2,3,4,5)
...
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 (1, 2, 3, 4, 5)
1 -0.198460 0.544879 0.554407 -0.161357 (1, 2, 3, 4, 5)
2 0.269807 1.132344 0.120303 -0.116843 (1, 2, 3, 4, 5)
3 -1.131396 1.278477 1.567599 0.483912 (1, 2, 3, 4, 5)
4 0.288147 0.382764 -0.840972 0.838950 (1, 2, 3, 4, 5)
I use the code below it is just fine
import numpy as np
df = pd.DataFrame(np.array(your_data), columns=columns)