Pandas Strip numbers from a string column in python - python-2.7

I have a column in pandas which has string and numbers mixed
I want to strip numbers from the string.
A
11286011
11268163
C7DDA72897
C8ABC557
Abul
C80DAS577
C80DSS665
Want an output as
A
C7DDA72897
C8ABC557
Abul
C80DAS577
C80DSS665

In [52]: df
Out[52]:
A
0 11286011
1 11268163
2 C7DDA72897
3 C8ABC557
4 C80DAS577
5 C80DSS665
In [53]: df = pd.to_numeric(df.A, errors='coerce').dropna()
In [54]: df
Out[54]:
0 11286011.0
1 11268163.0
Name: A, dtype: float64
or using RegEx:
In [59]: df.loc[~df.A.str.contains(r'\D+')]
Out[59]:
A
0 11286011
1 11268163

You can use .str.isnumeric to use in boolean slicing.
df[df.A.astype(str).str.isnumeric()]
A
0 11286011
1 11268163
As pointed out by #MaxU, assuming every element is already a string, you can limit this to
df[df.A.str.isnumeric()]
A
0 11286011
1 11268163

Related

Remove part of string after a specific character in Python

Given a time column as follows:
time
0 2019Y8m16d10h
1 2019Y9m3d10h
2 2019Y9m3d10h58s
3 2019Y9m3d10h
How can I remove substrings start by d, I have tried with df['time'].str.split('d')[0], but it doesn't work.
My desired result will like this. Thank you.
time
0 2019Y8m16d
1 2019Y9m3d
2 2019Y9m3d
3 2019Y9m3d
You are close, need str[0] for select lists and then add d:
df['time'] = df['time'].str.split('d').str[0].add('d')
Or:
df['time'] = df['time'].str.split('(d)').str[:2].str.join('')
print (df)
time
0 2019Y8m16d
1 2019Y9m3d
2 2019Y9m3d
3 2019Y9m3d
Or use Series.str.extract:
df['time'] = df['time'].str.extract('(.+d)')
print (df)
time
0 2019Y8m16d
1 2019Y9m3d
2 2019Y9m3d
3 2019Y9m3d
One of possible solutions:
df['time'].str.extract(r'([^d]+d)')
Or you can simply use apply functionality to solve the purpose as follows:
df.apply(lambda x: x['time'].split('d')[0]+'d',axis=1)

Finding occurrences of substrings within pandas dataframe -- Python

I have a list of 'words' I want to count below
word_list = ['one','two','three']
And I have a column within pandas dataframe with text below.
TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."
The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.
Word | Count
one 5
two 3
three 2
Is there a way to do this in Python 2.7?
I would do this with vanilla python, first join the string:
In [11]: long_string = "".join(df[0]).lower()
In [12]: long_string[:50] # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"
In [13]: for w in word_list:
...: print(w, long_string.count(w))
...:
one 5
two 3
three 2
If you want to return a Series, you could use a dict comprehension:
In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one 5
three 2
two 3
dtype: int64
Use str.extractall + value_counts:
df
text
0 "Perhaps she'll be the one for me."
1 "Is it two or one?"
2 "Mayhaps it be three afterall..."
3 "Three times and it's a charm."
4 "One fish, two fish, red fish, blue fish."
5 "There's only one cat in the hat."
6 "One does not simply code into pandas."
7 "Two nights later..."
8 "Quoth the Raven... nevermore."
rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
one 5
two 3
three 2
Name: 0, dtype: int64
Details
rgx
'(one|two|three)'
df.text.str.lower().str.extractall(rgx).iloc[:, 0]
match
0 0 one
1 0 two
1 one
2 0 three
3 0 three
4 0 one
1 two
5 0 one
6 0 one
7 0 two
Name: 0, dtype: object
Performance
Small
# Zero's code
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop
Large
df = pd.concat([df] * 100000)
%%timeit
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
long_string.count(w)
10 loops, best of 3: 151 ms per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop
Use
In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one 5
three 2
two 3
dtype: int64
Or, to count multiple instances in each row
In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one 5
three 2
two 3
dtype: int64
Use sort_values
In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
In [56]: s.sort_values(ascending=False)
Out[56]:
one 5
two 3
three 2
dtype: int64

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1

How to split the data in the data frame in python?

I used the below code:
import pandas as pd
pandas_bigram = pd.DataFrame(bigram_data)
print pandas_bigram
I got output as below
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
3 the free**2
4 free encyclopedia**2
5 encyclopedia ashoka**1
6 ashoka from**2
7 from wikipedia,**1
8 wikipedia, the**2
9 the free**2
10 free encyclopedia**2
My question is How to split this data frame. So, that i will get data in two rows. the data here is separated by "**".
import pandas as pd
df= [" ashoka -**0","- wikipedia,**1","wikipedia, the**2"]
df=pd.DataFrame(df)
print(df)
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
Use split function: The method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
df1 = pd.DataFrame(df[0].str.split('*',1).tolist(),
columns = ['0','1'])
print(df1)
0 1
0 ashoka - *0
1 - wikipedia, *1
2 wikipedia, the *2

Removing some particular rows in pandas

I want to delete some rows in pandas dataframe.
ID Value
2012XY000 1
2012XY001 1
.
.
.
2015AB000 4
2015PQ001 5
.
.
.
2016DF00G 2
I want to delete rows whose ID does not start with 2015.
How should I do that?
Use startswith with boolean indexing:
print (df.ID.str.startswith('2015'))
0 False
1 False
2 True
3 True
4 False
Name: ID, dtype: bool
print (df[df.ID.str.startswith('2015')])
ID Value
2 2015AB000 4
3 2015PQ001 5
EDIT by comment:
print (df)
ID Value
0 2012XY000 1
1 2012XY001 1
2 2015AB000 4
3 2015PQ001 5
4 2015XQ001 5
5 2016DF00G 2
print ((df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X'))
0 False
1 False
2 True
3 True
4 False
5 False
Name: ID, dtype: bool
print (df[(df.ID.str.startswith('2015')) & (df.ID.str[4] != 'X')])
ID Value
2 2015AB000 4
3 2015PQ001 5
Use str.match with regex string r'^2015':
df[df.ID.str.match(r'^2015')]
To exclude those that have an X afterwards.
df[df.ID.str.match(r'^2015[^X]')]
The regex r'^2015[^X]' translates into
^2015 - must start with 2015
[^X] - character after 2015 must not be X
consider the df
then
df[df.ID.str.match(r'^2015[^X]')]