Extract email address from multi lines with line break - regex

I have a list which contains name, email address, location, date and time, etc.
From the list, I'd like to extract only name and email address.
The original text representation is like,
Email address: abc103#gmail.com
City/town: Hills, United States
Last access: Saturday, 6 January 2018, 8:46 PM (17 secs)
So, In the python list, it shows up like below.
import re
lst = [['name1', 'Email address: abc103#gmail.com\nCity/town: Hills , United States\nLast access: Saturday, 6 January 2018, 8:46 PM (17 secs)'], ['name2', 'Email address: cde123#example.com\nCity/town: San Francisco, United States\nLast access: Saturday, 6 January 2018, 8:46 PM (48 secs)'], ['name3', 'Email address: nnn9#something.com\nCity/town: Fremont, United States\nLast access: Saturday, 6 January 2018, 8:43 PM (3 mins 21 secs)'], ['name4', 'City/town: Tenafly, United States\nLast access: Saturday, 6 January 2018, 8:36 PM (10 mins 14 secs)'],... list goes on.
for i in range(0, len(lst)):
extract = re.findall(r'(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', lst[i][1],re.MULTILINE)
lst[i][1] = extract
print(lst)
However, the output is like,
[['name1', []], ['name2', []], ['name3', []], ....
What's wrong with my regex?
How do I apply re.findall to multi-line with line breaks?

This worked for me :
import re
lst = [['name1', 'Email address: abc103#gmail.com\nCity/town: Hills , United States\nLast access: Saturday, 6 January 2018, 8:46 PM (17 secs)'], ['name2', 'Email address: cde123#example.com\nCity/town: San Francisco, United States\nLast access: Saturday, 6 January 2018, 8:46 PM (48 secs)'], ['name3', 'Email address: nnn9#something.com\nCity/town: Fremont, United States\nLast access: Saturday, 6 January 2018, 8:43 PM (3 mins 21 secs)'], ['name4', 'City/town: Tenafly, United States\nLast access: Saturday, 6 January 2018, 8:36 PM (10 mins 14 secs)']]
#lst[0][1].findall('([a-zA-Z][a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', expand=True)
for i in range(0, len(lst)):
extract = re.findall(r'([a-zA-Z][a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.][a-zA-Z]+)', lst[i][1],re.MULTILINE)
lst[i][1] = extract
print(lst)
the output :
[['name1', ['abc103#gmail.com']], ['name2', ['cde123#example.com']], ['name3', ['nnn9#something.com']], ['name4', []]]

Related

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

How to identify a list of dates with a specific day value between 2 dates in powershell

So I have this code (thanks to #reeeky2001) that answer my other post Trying to find a way to list all friday dates between 2 dates in powershell :
$FiscalStart = [datetime]'2019-03-31'
$date2 = Get-Date -Hour 0 -Minute 0 -Second 0
$EvaluateDates = 1..($date2 - $FiscalStart).Days | % {$($FiscalStart).AddDays($_)} | ? {$_.DayOfWeek -eq 'monday'}
$EvaluateDates
So let's say we are on feb 09, 2020. if I run the following code, the last monday will be feb 03, 2020. How could I exclude the last date, So the next monday (feb 10, 2020) is in the future? How could I exclude any dates where the next monday would be in the future ? Assuming this code could be run anyday of the week...
At the end I added "-and $_ -lt (get-date)"
$FiscalStart = [datetime]'2019-12-31'
$date2 = Get-Date -Hour 0 -Minute 0 -Second 0
1..($date2 - $FiscalStart).Days | % {($FiscalStart).AddDays($_)} |
? { $_.DayOfWeek -eq 'monday' -and $_ -lt (get-date)}
Monday, January 6, 2020 12:00:00 AM
Monday, January 13, 2020 12:00:00 AM
Monday, January 20, 2020 12:00:00 AM
Monday, January 27, 2020 12:00:00 AM
Monday, February 3, 2020 12:00:00 AM

display each day's date python from today

I was able to display the week that starts every Saturday by:
today = now().date()
sat_offset = (today.weekday() - 5) % 7
week_start = today - datetime.timedelta(days=sat_offset)
This will display the week from last Saturday but how would I show the dates of each day forward as well? So if the week: Oct. 27, 2018 is display it should say:
Saturday : Oct. 27, 2018
Sunday: Oct. 28, 2018
Monday: Oct. 29, 2018
Tuesday: Oct. 30, 2018
Wednesday: Oct. 31, 2018
Thursday: Nov. 01, 2018
Friday: Nov. 02, 2018
Thank you for your help.
You can iterate through the days of the week using range and time delta like so:
for i in range(7):
week_start += datetime.timedelta(days=1)
print(week_start.strftime("%A %d. %B %Y"))
This will produce a dates like:
Monday : Oct. 28, 2018
Tuesday : Oct. 29, 2018
Wednesday : Oct. 30, 2018
Thursday : Oct. 31, 2018
Friday : Nov. 01, 2018
Saturday : Nov. 02, 2018
Sunday : Nov. 03, 2018
You can format the string how ever you want. Here is some info on dates in python.

Splitting a column that contains multiple date formats

I have a csv file that contains a column with multiple date formats. I need to split them and get the extracted result in the same format.
Wednesday 12 August 2015
Wednesday 12 August 2015
Friday April 1 2016
Friday April 1 2016
5/12/2016
5/12/2016
This is the file and i want it in the mm/dd/yy format. My code is as follows:
import re
import csv
import pandas as pd
#delimiters = " ", "/"
#f = open('merged_34.csv')
f = open('test3.csv')
df = pd.read_csv('test3.csv')
for item in df['serverDatePrettyFirstAction']:
if '/' in item:
newDate.append(item)
else:
item = item.split(' ', 1)[1]
newDate.append(item)
df['newDate'] = newDate
df.to_csv('D:/Python/10.36.202.64/newfile.csv', index = False)
And this is what i get:
serverDatePrettyFirstAction newDate
Wednesday 12 August 2015 12-Aug-15
Wednesday 12 August 2015 12-Aug-15
Friday April 1 2016 April 1 2016
Friday April 1 2016 April 1 2016
5/12/2016 5/12/2016
5/12/2016 5/12/2016
Also is there a way to overwrite the values in the same column itself
a faster approach would be to use pandas's method to_datetime():
In [2]: df
Out[2]:
Date
0 Wednesday 12 August 2015
1 Wednesday 12 August 2015
2 Friday April 1 2016
3 Friday April 1 2016
4 5/12/2016
5 5/12/2016
In [6]: df['newDate'] = pd.to_datetime(df['Date'])
Result:
In [7]: df
Out[7]:
Date newDate
0 Wednesday 12 August 2015 2015-08-12
1 Wednesday 12 August 2015 2015-08-12
2 Friday April 1 2016 2016-04-01
3 Friday April 1 2016 2016-04-01
4 5/12/2016 2016-05-12
5 5/12/2016 2016-05-12
You can use third party dateutil library as long as your data is not too big.( After all, It guesses format every time)
import pandas as pd
from dateutil import parser
df = pd.read_csv('test3.csv')
df['newDate'] = df['serverDatePrettyFirstAction'].apply(parser.parse)
df.to_csv('newfile.csv', index=False, date_format='%Y-%m-%d ')
to overwrite the values in the same column
Use
df['serverDatePrettyFirstAction']=df['serverDatePrettyFirstAction'].apply(parser.parse)

How do I get coupon payment dates for a simple fixed bond using quantlib, quantlib-swig and python

I am trying yo learn quantlib (1.3) & python bindings using quantlib-swig (1.2) in ubuntu 13.04. As a starter I am trying to determine the payment dates for a very simple bond as given below using 30/360 European day counter
from QuantLib import *
faceValue = 100.0
doi = Date(31, August, 2000)
dom = Date(31, August, 2008)
coupons = [0.05]
dayCounter = Thirty360(Thirty360.European)
schedule = Schedule(doi, dom, Period(Semiannual),
India(),
Unadjusted, Unadjusted,
DateGeneration.Backward, False)
Following are my questions:
Which method of schedule object will give me the payment dates?
Where do I need to specify the dayCounter object so that the dates are appropriately calculated?
Using Dimitri Reiswich' Presentation, I tried mimicking C++ code, but schedule.dates() returns an error as no such method.
The payment dates for this Fixed Rate bond are, (obtained by using oocalc)
Feb 28, 2001; Aug 31, 2001
Feb 28, 2002; Aug 31, 2002
Feb 28, 2003; Aug 31, 2003
Feb 29, 2004; Aug 31, 2004
Feb 28, 2005; Aug 31, 2005
Feb 28, 2006; Aug 31, 2006
Feb 28, 2007; Aug 31, 2007
Feb 29, 2008; Aug 31, 2008
How do I get the payment dates for this simple bond using python & quantlib? Can someone please help?
regards
K
If you want to look at the schedule you just generated, you can iterate over it:
>>> for d in schedule: print d
...
August 31st, 2000
February 28th, 2001
August 31st, 2001
February 28th, 2002
August 31st, 2002
February 28th, 2003
August 31st, 2003
February 29th, 2004
August 31st, 2004
February 28th, 2005
August 31st, 2005
February 28th, 2006
August 31st, 2006
February 28th, 2007
August 31st, 2007
February 29th, 2008
August 31st, 2008
or call list(schedule) if you want to store them. However, are you sure that those are the payment dates? They are the start and end date for accrual calculation; but some of these fall on a Saturday or a Sunday, and the bond will be paying on the next business day. You can see the effect if you instantiate the bond and retrieve the coupons:
>>> settlement_days = 3
>>> bond = FixedRateBond(settlement_days, faceValue, schedule, coupons, dayCounter)
>>> for c in bond.cashflows():
... print c.date()
...
February 28th, 2001
August 31st, 2001
February 28th, 2002
September 2nd, 2002
February 28th, 2003
September 1st, 2003
March 1st, 2004
August 31st, 2004
February 28th, 2005
August 31st, 2005
February 28th, 2006
August 31st, 2006
February 28th, 2007
August 31st, 2007
February 29th, 2008
September 1st, 2008
September 1st, 2008
(that is, unless Saturdays and Sundays shouldn't be holidays for the Indian calendar. If you think they shouldn't, file a bug report with QuantLib).