Extract specific text within content - regex

I am working on email content which is dumped as dataframe value in row , and this has to be separated as different based on starting from: untill till next from:
Sample data:
"______________________________________________
From: Kumar M,
Sent: Tuesday, 21 October 2014 7:30 AM
To: Deo, Ravinesh; G S, Venkatesh;
Cc: Monteleone, Elif; Kabyanga, Isaac
Subject: FW: Please Approve the Qlik Access.
Hi Ravi,
We will work on the providing David access to Ql and an email will be sent out once the access is set up.
Regards,
Santhosh
______________________________________________
From: Deo, Ravinesh
Sent: Tuesday, 21 October 2014 7:20 AM
To: Kabyanga, Isaac; Kumar M, Santhosh
Cc: Monteleone, Elif
Subject: FW: Please Approve the Qlikview Access.
Hi Isaac/Santhosh,
Appreciate if you can grant access to David Dennis for GPA – Timor.
David is CEO Timor Leste.
Thanks
Ravi
_____________________________________________
From: Dennis, David (Timor)
Sent: Tuesday, 21 October 2014 11:34 AM
To: Deo, Ravinesh
Subject: FW: Please Approve the Q GPA Access.
Here you go - appreciate your help Rgds
______________________________________________
From: Dennis, David (Timor)
Sent: Thursday, 9 October 2014 11:33 AM
To: Buchanan, Geoffrey (Solomon Islands)
Subject: Please Approve the Qlikview Access.
Hello,
Can you please review the attached form and click ' Manager Approval' to approve.
Thanks"
i have refered here and i have used this below code
ex <- gsub("^[from:](.*?)[from:]$", "",impordata$Problem.Description[i] )
but this gives all the mails in particular row !
Desired Output:
1
From: Kumar M,
Sent: Tuesday, 21 October 2014 7:30 AM
To: Deo, Ravinesh; G S, Venkatesh;
Cc: Monteleone, Elif; Kabyanga, Isaac
Subject: FW: Please Approve the Qlik Access.
Hi Ravi,
We will work on the providing David access to Ql and an email will be sent out once the access is set up.
Regards,
Santhosh
[2]
From: Deo, Ravinesh
Sent: Tuesday, 21 October 2014 7:20 AM
To: Kabyanga, Isaac; Kumar M, Santhosh
Cc: Monteleone, Elif
Subject: FW: Please Approve the Qlikview Access.
Hi Isaac/Santhosh,
Appreciate if you can grant access to David Dennis for GPA – Timor.
David is CEO Timor Leste.
Thanks
Ravi
[3]
From: Dennis, David (Timor)
Sent: Tuesday, 21 October 2014 11:34 AM
To: Deo, Ravinesh
Subject: FW: Please Approve the Q GPA Access.
Here you go - appreciate your help Rgds
[4]
From: Dennis, David (Timor)
Sent: Thursday, 9 October 2014 11:33 AM
To: Buchanan, Geoffrey (Solomon Islands)
Subject: Please Approve the Qlikview Access.
Hello,
Can you please review the attached form and click ' Manager Approval' to approve.
Thanks"
and used regmatches
#Converted a row as vector to apply regmatches
vec <- as.vector(impordata$Problem.Description[1])
matc <-regmatches(vec, gregexpr("(^[from:]).*?($[from:])", vec, perl = TRUE))
No use of this too,
Can somebody correct it ! or provide some help

You could use strsplit fucntion.
> strsplit(gsub("(?s)^_+\\s+", "", x, perl=T) , "_+\\s*(?=From:)", perl=T)[[1]]
[1] "From: Kumar M, \nSent: Tuesday, 21 October 2014 7:30 AM\nTo: Deo, Ravinesh; G S, Venkatesh;\nCc: Monteleone, Elif; Kabyanga, Isaac\nSubject: FW: Please Approve the Qlik Access.\n\n\nHi Ravi,\n\nWe will work on the providing David access to Ql and an email will be sent out once the access is set up. \n\nRegards,\nSanthosh\n\n"
[2] "From: Deo, Ravinesh \nSent: Tuesday, 21 October 2014 7:20 AM\nTo: Kabyanga, Isaac; Kumar M, Santhosh\nCc: Monteleone, Elif\nSubject: FW: Please Approve the Qlikview Access.\n\nHi Isaac/Santhosh,\n\nAppreciate if you can grant access to David Dennis for GPA – Timor.\n\nDavid is CEO Timor Leste.\n\nThanks\nRavi\n\n"
[3] "From: Dennis, David (Timor) \nSent: Tuesday, 21 October 2014 11:34 AM\nTo: Deo, Ravinesh\nSubject: FW: Please Approve the Q GPA Access.\n\nHere you go - appreciate your help Rgds\n\n"
[4] "From: Dennis, David (Timor) \nSent: Thursday, 9 October 2014 11:33 AM\nTo: Buchanan, Geoffrey (Solomon Islands)\nSubject: Please Approve the Qlikview Access.\n\nHello,\n\nCan you please review the attached form and click ' Manager Approval' to approve.\n\nThanks"

Take a look at strsplit:
splits <- strsplit(paste0(vec, collapse = "\n"), "_{45}")[[1]][-1]
cat(splits)
cat(splits[1])
# _
# From: Kumar M,
# Sent: Tuesday, 21 October 2014 7:30 AM
# To: Deo, Ravinesh; G S, Venkatesh;
# Cc: Monteleone, Elif; Kabyanga, Isaac
# Subject: FW: Please Approve the Qlik Access.
#
#
# Hi Ravi,
#
# We will work on the providing David access to Ql and an email will be sent out once the access is set up.
#
# Regards,
# Santhosh

Related

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

Perl one liner using substitute with a regex

I have a file that looks like this :
7th Aug 2020 10:18:35 am Bill Smith:
NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am Bill Smith:
VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am Bill Smith:
*mailbox handover*
7th Aug 2020 11:06:04 am Tom Jones:
BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am Tom Jones:
DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm Tom Jones:
NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352
And I want it to look like this :
7th Aug 2020 10:18:35 am Bill Smith: NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am Bill Smith: VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am Bill Smith: *mailbox handover*
7th Aug 2020 11:06:04 am Tom Jones: BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am Tom Jones: DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm Tom Jones: NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352
So I run this over the file, the goal is to take the new line off of the string ending with the persons name, followed by a colon. I want to change, in this case "Bill Smith:\n" and "Tom Jones:\n" to "Bill Smith: " and Tom Jones: ". If you look at the one liner, it does not work on the replace.
cat incfile | perl -p -e 's/\w+\s\w+\:\n/\w+\s\w+\:/g'
7th Aug 2020 10:18:35 am w+sw+:NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am w+sw+:VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am w+sw+:*mailbox handover*
7th Aug 2020 11:06:04 am w+sw+:BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am w+sw+:DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm w+sw+:NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352
You were going for
perl -pe's/(\w+\s\w+:)\n/$1 /'
The substring matched by the first capture (()) is assigned to $1, which you can use in the replacement expression.
The above can be simplified/optimized to
perl -pe's/\w+\s\w+:\K\n/ /'
What matched before \K is "kept" (not replaced), so only the line feed is replaced (with a space).
Alternatively, you could simply replace the line feeds of odd-numbered lines.
perl -pe's/\n/ / if $. % 2'

Extract email address from multi lines with line break

I have a list which contains name, email address, location, date and time, etc.
From the list, I'd like to extract only name and email address.
The original text representation is like,
Email address: abc103#gmail.com
City/town: Hills, United States
Last access: Saturday, 6 January 2018, 8:46 PM (17 secs)
So, In the python list, it shows up like below.
import re
lst = [['name1', 'Email address: abc103#gmail.com\nCity/town: Hills , United States\nLast access: Saturday, 6 January 2018, 8:46 PM (17 secs)'], ['name2', 'Email address: cde123#example.com\nCity/town: San Francisco, United States\nLast access: Saturday, 6 January 2018, 8:46 PM (48 secs)'], ['name3', 'Email address: nnn9#something.com\nCity/town: Fremont, United States\nLast access: Saturday, 6 January 2018, 8:43 PM (3 mins 21 secs)'], ['name4', 'City/town: Tenafly, United States\nLast access: Saturday, 6 January 2018, 8:36 PM (10 mins 14 secs)'],... list goes on.
for i in range(0, len(lst)):
extract = re.findall(r'(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', lst[i][1],re.MULTILINE)
lst[i][1] = extract
print(lst)
However, the output is like,
[['name1', []], ['name2', []], ['name3', []], ....
What's wrong with my regex?
How do I apply re.findall to multi-line with line breaks?
This worked for me :
import re
lst = [['name1', 'Email address: abc103#gmail.com\nCity/town: Hills , United States\nLast access: Saturday, 6 January 2018, 8:46 PM (17 secs)'], ['name2', 'Email address: cde123#example.com\nCity/town: San Francisco, United States\nLast access: Saturday, 6 January 2018, 8:46 PM (48 secs)'], ['name3', 'Email address: nnn9#something.com\nCity/town: Fremont, United States\nLast access: Saturday, 6 January 2018, 8:43 PM (3 mins 21 secs)'], ['name4', 'City/town: Tenafly, United States\nLast access: Saturday, 6 January 2018, 8:36 PM (10 mins 14 secs)']]
#lst[0][1].findall('([a-zA-Z][a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', expand=True)
for i in range(0, len(lst)):
extract = re.findall(r'([a-zA-Z][a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', lst[i][1],re.MULTILINE)
lst[i][1] = extract
print(lst)
the output :
[['name1', ['abc103#gmail.com']], ['name2', ['cde123#example.com']], ['name3', ['nnn9#something.com']], ['name4', []]]

How can I read certain elements from a text file?

I have a text file that has multiple sets of book information (title, author, etc). I need to be able to use a loop to read from the file and assign each piece of info to a corresponding string. I have it working to where I goes through the entire file, it's just messing up while going through the file.
Book readOne(ifstream &fin) {
string titleOne;
getline(fin, titleOne, ',');
string firstOne;
getline(fin, firstOne, ',');
string lastOne;
getline(fin, lastOne, ',');
string formatOne;
getline(fin, formatOne, ',');
string pubDateOne;
getline(fin, pubDateOne, ',');
string priceOne;
getline(fin, priceOne);
Here is the text file:
Gone With the Wind, Margaret Mitchell, Hardcover, 1936, 17.49
The Adventures of Sherlock Holmes, Arthur Doyle, Paperback, 1892, 6.85
The Illustrated A Brief History of Time, Stephen Hawking, Hardcover, 1996, 9.59
Frankenstein, Mary Shelley, Paperback, 1818, 7.99
Command Authority, Tom Clancy, Paperback, 2013, 15.99
Origin, Dan Brown, Ebook, 2017, 14.99
The Lost Order, Steve Berry, Audiobook, 2017, 5.95
The Hunt for Red October, Tom Clacy, Audiobook, 1984, 7.00
Patriot Games, Tom Clancy, Audiobook, 1987, 22.50
The 14th Colony, Steve Berry, Paperback, 2016, 9.99
The Bishop's Pawn, Steve Berry, Ebook, 2018, 14.99
Pride and Prejudice, Jane Austen, Ebook, 1813, 8.99
Sense and Sensibility, Jane Austen, Hardcover, 1811, 19.99
Wuthering Heights, Emily Bronte, Paperback, 1847, 6.99
Jane Eyre, Charlotte Bronte, Hardcover, 1847, 10.95
Anna Karenina, Leo Tolstoy, Paperback, 1877, 5.99
Sahara, Clive Cussler, Ebook, 1992, 5.99
The Notebook, Nicholas Sparks, Hardcover, 1996, 12.59
A Walk to Remember, Nicholas Sparks, Ebook, 1999, 7.99
See Me, Nicholas Sparks, Ebook, 2015, 7.99
The Last Song, Nicholas Sparks, Paperback, 2009, 5.99
The Wedding, Nicholas Sparks, Ebook, 2003, 7.99
My thinking was that it would read until a comma, assign that piece of info to the string, then continue. Instead, it outputs as if it didn't see certain commas.
Gone With the Wind by Margaret Mitchell Hardcover on 1936. Published on 17.49
The Adventures of Sherlock Holmes. It costs $0
The Illustrated A Brief History of Time by Stephen Hawking Hardcover on 1996. Published on 9.59
Frankenstein. It costs $0
Command Authority by Tom Clancy Paperback on 2013. Published on 15.99
Origin. It costs $0
The Lost Order by Steve Berry Audiobook on 2017. Published on 5.95
The Hunt for Red October. It costs $0
Patriot Games by Tom Clancy Audiobook on 1987. Published on 22.50
The 14th Colony. It costs $0
The Bishop's Pawn by Steve Berry Ebook on 2018. Published on 14.99
Pride and Prejudice. It costs $0
Sense and Sensibility by Jane Austen Hardcover on 1811. Published on 19.99
Wuthering Heights. It costs $0
Jane Eyre by Charlotte Bronte Hardcover on 1847. Published on 10.95
Anna Karenina. It costs $0
Sahara by Clive Cussler Ebook on 1992. Published on 5.99
The Notebook. It costs $0
A Walk to Remember by Nicholas Sparks Ebook on 1999. Published on 7.99
See Me. It costs $0
The Last Song by Nicholas Sparks Paperback on 2009. Published on 5.99
The Wedding. It costs $0
This is nothing but a csv (comma separated value) file. And there are ample code samples to read it. Refer this sample.

javascript regular expression: how do I find date without year or date with year<2010

I need to find date without year, or date with year<2010.
basically,
Feb 15
Feb 20
Feb 20, 2009
Feb 20, 1995
should be accepted
Feb 20, 2010
Feb 20, 2011
should be rejected
How do I do it?
Thanks,
Cheng
Try this:
(Jan|Feb|Mar...Dec)\s\d{1,2},\s([1][0-9][0-9][0-9]|200[0-9])
Note: Expand the month list with proepr names. I was too lazy to spell it all out.