Removing some particular rows in pandas - python-2.7

I want to delete some rows in pandas dataframe.
ID Value
2012XY000 1
2012XY001 1
.
.
.
2015AB000 4
2015PQ001 5
.
.
.
2016DF00G 2
I want to delete rows whose ID does not start with 2015.
How should I do that?

Use startswith with boolean indexing:
print (df.ID.str.startswith('2015'))
0 False
1 False
2 True
3 True
4 False
Name: ID, dtype: bool
print (df[df.ID.str.startswith('2015')])
ID Value
2 2015AB000 4
3 2015PQ001 5
EDIT by comment:
print (df)
ID Value
0 2012XY000 1
1 2012XY001 1
2 2015AB000 4
3 2015PQ001 5
4 2015XQ001 5
5 2016DF00G 2
print ((df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X'))
0 False
1 False
2 True
3 True
4 False
5 False
Name: ID, dtype: bool
print (df[(df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X')])
ID Value
2 2015AB000 4
3 2015PQ001 5

Use str.match with regex string r'^2015':
df[df.ID.str.match(r'^2015')]
To exclude those that have an X afterwards.
df[df.ID.str.match(r'^2015[^X]')]
The regex r'^2015[^X]' translates into
^2015 - must start with 2015
[^X] - character after 2015 must not be X
consider the df
then
df[df.ID.str.match(r'^2015[^X]')]

Related

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

Transform a list to a list of average values (by step)

I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.
import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338

Keeping some data from duplicates and adding to existing python dataframe

I have an issue with keeping some data from duplicates and wanting to add valuable information to a new column in the dataframe.
import pandas as pd
data = {'id':[1,1,2,2,3],'key':[1,1,2,2,1],'value0':['a', 'b', 'x', 'y', 'a']}
frame = pd.DataFrame(data, columns = ['id','key','value0'])
print frame
Yields:
id key value0
0 1 1 a
1 1 1 b
2 2 2 x
3 2 2 y
4 3 1 a
Desired Output:
key value0_0 value0_1 value1_0
0 1 a b a
1 2 x y None
The "id" column isn't important to keep but could help with iteration and grouping.
I think this could be adapted to other projects where you don't know how many values exist for a set of keys.
set_index including a cumcount and unstack
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().add_prefix('value0_').reset_index()
key value0_0 value0_1 value0_2
0 1 a b a
1 2 x y None
I'm questioning your column labeling but here is an approach using binary
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().rename(
columns='{:02b}'.format
).add_prefix('value_').reset_index()
key value_00 value_01 value_10
0 1 a b a
1 2 x y None

Function I defined is not cleaning my list properly

Here is my minimal working example:
list1 = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] #len = 21
list2 = [1,1,1,0,1,0,0,1,0,1,1,0,1,0,1,0,0,0,1,1,0] #len = 21
list3 = [0,0,1,0,1,1,0,1,0,1,0,1,1,1,0,1,0,1,1,1,1] #len = 21
list4 = [1,0,0,1,1,0,0,0,0,1,0,1,1,1,1,0,1,0,1,0,1] #len = 21
I have four lists and I want to "clean" my list 1 using the following rule: "if any of list2[i] or list3[i] or list4[i] are equal to zero, then I want to eliminate the item I from list1. SO basically I only keep those elements of list1 such that the other lists all have ones there.
here is the function I wrote to solve this
def clean(list1, list2,list3,list4):
for i in range(len(list2)):
if (list2[i]==0 or list3[i]==0 or list4[i]==0):
list1.pop(i)
return list1
however it doesn't work. If you apply it, it give the error
Traceback (most recent call last):line 68, in clean list1.pop(I)
IndexError: pop index out of range
What am I doing wrong? Also, I was told Pandas is really good in dealing with data. Is there a way I can do it with Pandas? Each of these lists are actually columns (after removing the heading) of a csv file.
EDIT
For example at the end I would like to get: list1 = [4,9,11,15]
I think the main problem is that at each iteration, when I pop out the elements, the index of all the successor of that element change! And also, the overall length of the list changes, and so the index in pop() is too large. So hopefully there is another strategy or function that I can use
This is definitely a job for pandas:
import pandas as pd
df = pd.DataFrame({
'l1':list1,
'l2':list2,
'l3':list3,
'l4':list4
})
no_zeroes = df.loc[(df['l2'] != 0) & (df['l3'] != 0) & (df['l4'] != 0)]
Where df.loc[...] takes the full dataframe, then filters it by the criteria provided. In this example, your criteria are that you only keep the items where l2, l3, and l3 are not zero (!= 0).
Gives you a pandas dataframe:
l1 l2 l3 l4
4 4 1 1 1
9 9 1 1 1
12 12 1 1 1
18 18 1 1 1
or if you need just list1:
list1 = df['l1'].tolist()
if you want the criteria to be where all other columns are 1, then use:
all_ones = df.loc[(df['l2'] == 1) & (df['l3'] == 1) & (df['l4'] == 1)]
Note that I'm creating new dataframes for no_zeroes and all_ones and that the original dataframe stays intact if you want to further manipulate the data.
Update:
Per Divakar's answer (far more elegant than my original answer), much the same can be done in pandas:
df = pd.DataFrame([list1, list2, list3, list4])
list1 = df.loc[0, (df[1:] != 0).all()].astype(int).tolist()
Here's one approach with NumPy -
import numpy as np
mask = (np.asarray(list2)==1) & (np.asarray(list3)==1) & (np.asarray(list4)==1)
out = np.asarray(list1)[mask].tolist()
Here's another way with NumPy that stacks those lists into rows to form a 2D array and thus simplifies things quite a bit -
arr = np.vstack((list1, list2, list3, list4))
out = arr[0,(arr[1:] == 1).all(0)].tolist()
Sample run -
In [165]: arr = np.vstack((list1, list2, list3, list4))
In [166]: print arr
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
[ 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0]
[ 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1]
[ 1 0 0 1 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 1]]
In [167]: arr[0,(arr[1:] == 1).all(0)].tolist()
Out[167]: [4, 9, 12, 18]

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1