I have the following data:
Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)
I am trying to split this into Q&A format like this:
Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ?
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)
This is one set of conversation with an unique ID. After the split I would like to have each the questions and answers as different columns appropriately matching each response.
I tried the following:
for i in d.split(':'):
if i:
print(i.strip().split('.'))
The output is as follows:
['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']
You can use a much simpler regex!!!
import re
p = re.compile('(\w*\s*:)')
input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
new_string = p.sub(r'\n\g<1>',input_string)
for line in new_string.split('\n')[1:]:
print line
Splitting with ':' is dangerous because the conversation itself may contain ':'.
You should have the names of the rep and the customer first so that you can search for their names followed by : in a regex pattern, with which you can use re.findall to parse the sample chat into:
[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]
Then use a loop to map the items into a dict data structure that you prefer:
import re
from pprint import pprint
def parse_chat(chat, rep, customer):
conversation = {}
rep_message = ''
for person, message in re.findall(r'({0}|{1}): (.*?)\s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
if person == rep:
rep_message = message
else:
conversation[rep_message] = message
return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))
This outputs:
{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
"i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}
Solution
Based on the assumption that there are only single, non-space-delimited words behind colons, the best approach would be to use regex to match the Customer and Rep strings before the colons, and then insert newlines so that the appropriate format is obtained.
The following is one working example:
import re
# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
# First insert the newlines before the first word before a colon
newlines = re.sub(r'(\S+)\s*:', r'\n\g<1>:', data)
# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'\(Chat ended\)', '\n(Chat ended)', newlines[1:])
print(solution)
> "Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ?
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)"
Explanation
The newlines = re.sub... line first searches the data string for any non-space-delimited word followed by a colon, and then replaces it with a \n character followed by whatever the sequence of non-space characters was matched, \S+ (which could be Customer, Rep, Bill, etc.), and then inserts the : at the end.
Finally, assuming that all conversations end with (Chat ended), the line of code afterwards matches for only that text and moves it onto a new line in the same manner as the newlines = re.sub... line does.
The output is a string, but if you need it to be anything else, you can split it based on '\n' and do what you have to do after that.
So you basically want to identify where you want to insert newlines - as such you could try a couple different patterns, if it's always "customer" and "rep":
(?<!^)(Customer:|Rep:|\(Chat ended) demo
We just check that we are not at start of the string and then match the constant tokens by OR'ing them together. Or more generically,
(?<=\s)([A-Z]\w+:|\(Chat ended) demo
We look back to see a space (we're not at start of string) and then match CapitalizedWord+COLON or ending sequence, then insert newlines before each match.
substitution for both:
\n$0
Related
In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex
I have a column in data frame which ex df:
A
0 Good to 1. Good communication EI : tathagata.kar#ae.com
1 SAP ECC Project System EI: ram.vaddadi#ae.com
2 EI : ravikumar.swarna Role:SSE Minimum Skill
I have a list of of strings
ls=['tathagata.kar#ae.com','a.kar#ae.com']
Now if i want to filter out
for i in range(len(ls)):
df1=df[df['A'].str.contains(ls[i])
if len(df1.columns!=0):
print ls[i]
I get the output
tathagata.kar#ae.com
a.kar#ae.com
But I need only tathagata.kar#ae.com
How Can It be achieved?
As you can see I've tried str.contains But I need something for extact match
You could simply use ==
string_a == string_b
It should return True if the two strings are equal. But this does not solve your issue.
Edit 2: You should use len(df1.index) instead of len(df1.columns). Indeed, len(df1.columns) will give you the number of columns, and not the number of rows.
Edit 3: After reading your second post, I've understood your problem. The solution you propose could lead to some errors.
For instance, if you have:
ls=['tathagata.kar#ae.com','a.kar#ae.com', 'tathagata.kar#ae.co']
the first and the third element will match str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
And this is an unwanted behaviour.
You could add a check on the end of the string: str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')
Like this:
for i in range(len(ls)):
df1 = df[df['A'].str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')]
if len(df1.index != 0):
print (ls[i])
(Remove parenthesis in the "print" if you use python 2.7)
Thanks for the help. But seems like I found a solution that is working as of now.
Must use str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
This seems to solve the problem.
Although thanks to #IsaacDj for his help.
Why not just use:
df1 = df[df['A'].[str.match][1](ls[i])
It's the equivalent of regex match.
Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.
One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.
Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.
thanks for the follow :)
hii... if u want to make a new friend just add me on facebook! :) xx
Just wanna say if you ever feel lonely or sad or bored, just come and talk to me. I'm free anytime :)
I hope she not a spy for someone. I hope she real on neautral side. Because just her who i trust. :-)
not always but sometimes maybe :)
\u201c Funny how you get what you want and pray for when you want the same thing God wants. :)
Thank you :) can you follow me on Twitter so I can DM you?
RT dj got us a fallin in love and yeah earth number one m\u00fcsic listen thank you king :-)
found a cheeky weekend for \u00a380 return that's flights + hotel.. middle of april, im still looking pal :)
RT happy birthday mary ! Hope you have a good day :)
Thank god twitters not blocked on the school computers cause all my data is gone on my phone :(
enjoy tmrro. saw them earlier this wk here in tokyo :)
UPDATE:
Oki, maybe my question was wrong. I have to do this:
Open file and read from it
Remove some links, names and stuff from it (I have used regex, but don't know if it the right way to do
After i got clean code (only tweets with sad face or happy face) i have to print each line out, cause i have to loop each like this:
for line in tweets:
if '' in line:
cl.train(line,'happy')
else if '' in line:
cl.train(line,'sad')
My code so far you see here, but it doesn't work yet.
import re
from pprint import pprint
tweets = []
tweets = open('englishtweet.txt').read()
regex_username = '#[^\s]*' # regex to detect username in file
regex_url = 'http[^\s]*' # regex to detect url in file
regex_names = '#[^\s]*' # regex to detect # in file
for username in re.findall(regex_username, tweets):
tweets = tweets.replace(username, '')
for url in re.findall(regex_url, tweets):
tweets = tweets.replace(url, '')
for names in re.findall(regex_names, tweets):
tweets = tweets.replace(names, '')
If you want to read the first line, use next
with open("englishtweet.txt","r") as infile:
print next(infile).strip()
# this prints the first line only, and consumes the first value from the
# generator so this:
for line in infile:
print line.strip()
# will print every line BUT the first (since the first has been consumed)
I'm also using a context manager here, which will automatically close the file once you exit the with block instead of having to remember to call tweets.close(), and also will handle in case of error (depending on what else you're doing in your file, you may throw a handled exception that doesn't allow you to get to the .close statement).
If your file is very small, you could use .readlines:
with open("englishtweet.txt","r") as infile:
tweets = infile.readlines()
# tweets is now a list, each element is a separate line from the file
print tweets[0] # so element 0 is the first line
for line in tweets[1:]: # the rest of the lines:
print line.strip()
However that's not really suggested to read a whole file object into memory, as with some files it can simply be a huge memory waster, especially if you only need the first line -- no reason to read the whole thing to memory.
That said, since it looks like you may be using these for more than just one iteration, maybe readlines IS the best approach
You almost have it. Just remove the .read() when you originally open the file. Then you can loop through the lines.
tweets = open('englishtweet.txt','r')
for line in tweets:
print line
tweets.close()
I'm about to break this down into two operations since I can't seem to figure out the regular expression to do it in one. However, I thought I would ask the brain trust here to see if anyone can do it (which I'm sure someone can).
Essentially I have a string containing a recipients field from an email in Exchange. I want to parse it out into individual recipients. I don't need to validate emails or anything. Essentially the data is comma separated except if the comma is in between a set of quotes. That's the part that's messing me up.
Right now I'm using: (?"[^"\r\n]*")
Which gives me the quoted names, and ([a-zA-Z0-9_-.]+)#(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
which gives me the email addresses
Here's what I have..
Data:
"George Washington" <gwashington#government.net>, "Abraham Lincoln" <alincoln#government.net>, "Carter, Jimmy" <jimmy.carter#presidents.com>, "Nixon, Richard M." <tricky.dick#presidents.com>
What I'd like to get back is this:
"George Washington" <gwashington#government.net>
"Abraham Lincoln" <alincoln#government.net>
"Carter, Jimmy" <jimmy.carter#presidents.com>
"Nixon, Richard M." <tricky.dick#presidents.com>
I dont know enough about the exchange to get the pattern that will match for any exchange recipients entries.
But based on information past for you as an example. I give you this:
["][^"]+["][^",]+(?=[,]?)
This match all for entries that you post.
And know a simple example in C# how to use:
var input = "\"George Washington\" <gwashington#government.net>, \"Abraham Lincoln\" <alincoln#government.net>, \"Carter, Jimmy\" <jimmy.carter#presidents.com>, \"Nixon, Richard M.\" <tricky.dick#presidents.com>";
var pattern = "[\"][^\"]+[\"][^\",]+(?=[,]?)";
var items = Regex.Matches(input, pattern)
.Cast<Match>()
.Select(s => s.Value)
.ToList();
If there is a input text that this pattern dont work please post the input here.
Regex.Match(input, #"\"[^\"]*\"\s\<[^>]*>");