how to remove everything but letters, numbers and ! ? . ; , # ' using regex in python pandas df? - regex

I am trying to remove everythin but letters, numbers and ! ? . ; , # ' from my python pandas column text.
I have already read some other questions on the topic, but still can not make mine work.
Here is an example of what I am doing:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'text':['hey+ guys! wuzup',
'hello p3ople!What\'s up?',
'hey, how- thing == do##n',
'my name is bond, james b0nd']}
)
Then we have the following table:
id text
1 hey+ guys! wuzup
2 hello p3ople!What\'s up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Now, tryng to remove everything but letters, numbers and ! ? . ; , # '
First try:
df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\#\'\"]|\d))+)$",' ',regex=True)
output
id text
1 hey+ guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Second try
df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\#\'\"\:\d])))",' ',regex=True)
output
id text
1 ey+ uys uzup
2 ello 3ople hat p
3 ey ow- hing == o##
4 y ame s ond ames 0nd
Third try
df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\#\'\"\:\d])',' ',regex=True)
output
id text
1 ey+ uys! uzup
2 ello 3ople! hat' p?
3 ey, ow- hing == o##
4 y ame s ond, ames 0nd
Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:
id text
1 hey guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing don
4 my name is bond, james b0nd
Can anyone help me with that?
Links that I have seen over the topic:
Is there a way to remove everything except characters, numbers and '-' from a string
How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?
removing newlines from messy strings in pandas dataframe cells?
https://stackabuse.com/using-regex-for-text-manipulation-in-python/

Is this what you are looking for?
df.text.str.replace("(?i)[^0-9a-z!?.;,#' -]",'')
Out:
0 hey guys! wuzup
1 hello p3ople!What's up?
2 hey, how- thing don
3 my name is bond, james b0nd
Name: text, dtype: object

Related

Getting value between '-' in google sheets

Im trying to get the number between '-' and '-' in google sheets but after trying many things I still havent been able to find the solution.
Data record 1
England Premier League
West Ham vs Crystal Palace
2.090 - 3.47 - 3.770
Expected value = 3.47
Data record 2
England League Two
Carlisle vs Scunthorpe
2.830 - 3.15 - 2.820
Expected value = 3.15
Hopefully someone can help me out
Try either of the following
option 1.
=INDEX(IFERROR(REGEXEXTRACT(AE1:AE4," \d+\.\d+ ")*1))
option 2.
=INDEX(IFERROR(REGEXEXTRACT(AE1:AE4,".* - (\d+\.\d+) ")))
(Do adjust the formula according to your ranges and locale)
use:
=INDEX(IFNA(REGEXEXTRACT(A1:A, "- (\d+(?:.\d+)?) -")*1))

Strange output with Python Lists

I have a dataframe: Outlet_results
it goes something like this
index Calendar year/Week Material Sellthru Qty
0 37.2013 ABC 2
1 38.2913 ABC 7
2 37.2913 BCG 22
3 39.2013 XYZ 5
Now, I wanted a separate list for the Materials and week for further coding.
I used this code for the material list
mat_outlet = list(set(outlet_result['Material']))
It works perfectly and gives me 3 values (ABC, BCG, XYZ)
However, the week list shows a faulty output even though the code is same.
week_outlet_list = list(set(outlet_result['Calendar Year/Week']))
I am getting a list with 4 values
['38.2013', '37.2013', 'Calendar Year/Week', '39.2013']
Why is the string (header) included in the list? Please help me understand this concept.
I am using Python 2.7.... has it got something to do with it?

How to add multiple Sentences (which are stored in a list) into a pandas dataframe

I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

Python:How can you recursively search a .txt file, find matches and print results

I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")