Perl one liner using substitute with a regex - regex

I have a file that looks like this :
7th Aug 2020 10:18:35 am Bill Smith:
NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am Bill Smith:
VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am Bill Smith:
*mailbox handover*
7th Aug 2020 11:06:04 am Tom Jones:
BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am Tom Jones:
DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm Tom Jones:
NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352
And I want it to look like this :
7th Aug 2020 10:18:35 am Bill Smith: NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am Bill Smith: VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am Bill Smith: *mailbox handover*
7th Aug 2020 11:06:04 am Tom Jones: BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am Tom Jones: DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm Tom Jones: NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352
So I run this over the file, the goal is to take the new line off of the string ending with the persons name, followed by a colon. I want to change, in this case "Bill Smith:\n" and "Tom Jones:\n" to "Bill Smith: " and Tom Jones: ". If you look at the one liner, it does not work on the replace.
cat incfile | perl -p -e 's/\w+\s\w+\:\n/\w+\s\w+\:/g'
7th Aug 2020 10:18:35 am w+sw+:NW: RE: Matt Reid - EUC23284 - INC1020721599
7th Aug 2020 10:22:02 am w+sw+:VK: RE: don't think we send the price, pls help check what happened - INC1020721668
7th Aug 2020 11:00:06 am w+sw+:*mailbox handover*
7th Aug 2020 11:06:04 am w+sw+:BJ - RE: Megan Holleran Unmatched Trader Trades 08/06/2020 17:35 [Restricted - External] INC1020722335
7th Aug 2020 11:07:37 am w+sw+:DS - RE: All summit books missing from multiple reports in ICE INC1020722348
7th Aug 2020 12:36:10 pm w+sw+:NW - confirm trade receipt for Jon Lett from GFI ID: 1922979 INC1020723352

You were going for
perl -pe's/(\w+\s\w+:)\n/$1 /'
The substring matched by the first capture (()) is assigned to $1, which you can use in the replacement expression.
The above can be simplified/optimized to
perl -pe's/\w+\s\w+:\K\n/ /'
What matched before \K is "kept" (not replaced), so only the line feed is replaced (with a space).
Alternatively, you could simply replace the line feeds of odd-numbered lines.
perl -pe's/\n/ / if $. % 2'

Related

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

All CSV values in column 0 are strings

For some reason a csv file I wrote (win7) with Python has all the values as a string in column 0 and cannot perform any operation.
It has no labels.
The format is (I would like to keep the last value - date - as a date format):
"Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
EDIT - When I read it with the csv module it prints it out like:
['Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0," date: Feb 04, 2016\t\t\t"']
What is the best way to convert the strings into comma separated values like this?
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date:, Feb 04, 2016
Thanks a lot.
s="Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
print(s)
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date: Feb 04, 2016
to add a comma after "date:" you need to add some logic (like replace ":" with ":,"; or after first word etc.
First, your date field is quoted, which is ok (and needed) because there is a comma inside:
" date: Feb 04, 2016 "
But then the whole line also gets quoted (and thus seen as a single field). And because there are already quotes around the date field, those get escaped with another quote:
"Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
So, if you remove that last quoting, everything should be fine (but you might want to trim the date field):
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0," date: Feb 04, 2016 "
If you want it exactly like this, you need another comma after date: :
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date:,"Feb 04, 2016"
On the other hand, it would be better to use a header instead:
Name,Name2,Ave,Max,Min,analist disp,date
Rob,Avanti,12.83,4.0,-21.9,-1.0,"Feb 04, 2016"

Extract specific text within content

I am working on email content which is dumped as dataframe value in row , and this has to be separated as different based on starting from: untill till next from:
Sample data:
"______________________________________________
From: Kumar M,
Sent: Tuesday, 21 October 2014 7:30 AM
To: Deo, Ravinesh; G S, Venkatesh;
Cc: Monteleone, Elif; Kabyanga, Isaac
Subject: FW: Please Approve the Qlik Access.
Hi Ravi,
We will work on the providing David access to Ql and an email will be sent out once the access is set up.
Regards,
Santhosh
______________________________________________
From: Deo, Ravinesh
Sent: Tuesday, 21 October 2014 7:20 AM
To: Kabyanga, Isaac; Kumar M, Santhosh
Cc: Monteleone, Elif
Subject: FW: Please Approve the Qlikview Access.
Hi Isaac/Santhosh,
Appreciate if you can grant access to David Dennis for GPA – Timor.
David is CEO Timor Leste.
Thanks
Ravi
_____________________________________________
From: Dennis, David (Timor)
Sent: Tuesday, 21 October 2014 11:34 AM
To: Deo, Ravinesh
Subject: FW: Please Approve the Q GPA Access.
Here you go - appreciate your help Rgds
______________________________________________
From: Dennis, David (Timor)
Sent: Thursday, 9 October 2014 11:33 AM
To: Buchanan, Geoffrey (Solomon Islands)
Subject: Please Approve the Qlikview Access.
Hello,
Can you please review the attached form and click ' Manager Approval' to approve.
Thanks"
i have refered here and i have used this below code
ex <- gsub("^[from:](.*?)[from:]$", "",impordata$Problem.Description[i] )
but this gives all the mails in particular row !
Desired Output:
1
From: Kumar M,
Sent: Tuesday, 21 October 2014 7:30 AM
To: Deo, Ravinesh; G S, Venkatesh;
Cc: Monteleone, Elif; Kabyanga, Isaac
Subject: FW: Please Approve the Qlik Access.
Hi Ravi,
We will work on the providing David access to Ql and an email will be sent out once the access is set up.
Regards,
Santhosh
[2]
From: Deo, Ravinesh
Sent: Tuesday, 21 October 2014 7:20 AM
To: Kabyanga, Isaac; Kumar M, Santhosh
Cc: Monteleone, Elif
Subject: FW: Please Approve the Qlikview Access.
Hi Isaac/Santhosh,
Appreciate if you can grant access to David Dennis for GPA – Timor.
David is CEO Timor Leste.
Thanks
Ravi
[3]
From: Dennis, David (Timor)
Sent: Tuesday, 21 October 2014 11:34 AM
To: Deo, Ravinesh
Subject: FW: Please Approve the Q GPA Access.
Here you go - appreciate your help Rgds
[4]
From: Dennis, David (Timor)
Sent: Thursday, 9 October 2014 11:33 AM
To: Buchanan, Geoffrey (Solomon Islands)
Subject: Please Approve the Qlikview Access.
Hello,
Can you please review the attached form and click ' Manager Approval' to approve.
Thanks"
and used regmatches
#Converted a row as vector to apply regmatches
vec <- as.vector(impordata$Problem.Description[1])
matc <-regmatches(vec, gregexpr("(^[from:]).*?($[from:])", vec, perl = TRUE))
No use of this too,
Can somebody correct it ! or provide some help
You could use strsplit fucntion.
> strsplit(gsub("(?s)^_+\\s+", "", x, perl=T) , "_+\\s*(?=From:)", perl=T)[[1]]
[1] "From: Kumar M, \nSent: Tuesday, 21 October 2014 7:30 AM\nTo: Deo, Ravinesh; G S, Venkatesh;\nCc: Monteleone, Elif; Kabyanga, Isaac\nSubject: FW: Please Approve the Qlik Access.\n\n\nHi Ravi,\n\nWe will work on the providing David access to Ql and an email will be sent out once the access is set up. \n\nRegards,\nSanthosh\n\n"
[2] "From: Deo, Ravinesh \nSent: Tuesday, 21 October 2014 7:20 AM\nTo: Kabyanga, Isaac; Kumar M, Santhosh\nCc: Monteleone, Elif\nSubject: FW: Please Approve the Qlikview Access.\n\nHi Isaac/Santhosh,\n\nAppreciate if you can grant access to David Dennis for GPA – Timor.\n\nDavid is CEO Timor Leste.\n\nThanks\nRavi\n\n"
[3] "From: Dennis, David (Timor) \nSent: Tuesday, 21 October 2014 11:34 AM\nTo: Deo, Ravinesh\nSubject: FW: Please Approve the Q GPA Access.\n\nHere you go - appreciate your help Rgds\n\n"
[4] "From: Dennis, David (Timor) \nSent: Thursday, 9 October 2014 11:33 AM\nTo: Buchanan, Geoffrey (Solomon Islands)\nSubject: Please Approve the Qlikview Access.\n\nHello,\n\nCan you please review the attached form and click ' Manager Approval' to approve.\n\nThanks"
Take a look at strsplit:
splits <- strsplit(paste0(vec, collapse = "\n"), "_{45}")[[1]][-1]
cat(splits)
cat(splits[1])
# _
# From: Kumar M,
# Sent: Tuesday, 21 October 2014 7:30 AM
# To: Deo, Ravinesh; G S, Venkatesh;
# Cc: Monteleone, Elif; Kabyanga, Isaac
# Subject: FW: Please Approve the Qlik Access.
#
#
# Hi Ravi,
#
# We will work on the providing David access to Ql and an email will be sent out once the access is set up.
#
# Regards,
# Santhosh

regexp to wrap a line with ${color} and $color

Is there a way to have this regex put ${color orange} at the beginning, and $color at the end of the line where the date is found?
DJS=`date +%_d`;
cat thisweek.txt | sed s/"\(^\|[^0-9]\)$DJS"'\b'/'\1${color orange}'"$DJS"'$color'/
With this expression I get this:
Saturday Aug 13 12pm - 9pm 4pm - 5pm
Sunday Aug 14 9:30am - 6pm 1pm - 2pm
Monday Aug 15 6:30pm - 11:30pm None
Tuesday Aug 16 6pm - 11pm None
Wednesday Aug 17 Not Currently Scheduled for This Day
Thursday Aug ${color orange}18$color Not Currently Scheduled for This Day
Friday Aug 19 7am - 3:30pm 10:30am - 11:30am
What I want to have is this:
Saturday Aug 13 12pm - 9pm 4pm - 5pm Sunday Aug 14 9:30am - 6pm 1pm - 2pm
Monday Aug 15 6:30pm - 11:30pm None
Tuesday Aug 16 6pm - 11pm None
Wednesday Aug 17 Not Currently Scheduled for This Day
${color orange}Thursday Aug 18 Not Currently Scheduled for This Day$color
Friday Aug 19 7am - 3:30pm 10:30am - 11:30am
Acually, it works for me. Depending on your version of sed, you might need to pass -r. Also, as tripleee says, don't use cat here
DJS=`date +%_d`
sed -r s/"\(^\|[^0-9]\)$DJS"'\b'/'\1${color orange}'"$DJS"'$color'/ thisweek.txt
EDIT: Ok, so with the new information I arrived at this:
sed -r "s/([^0-9]+19.+)/\${color orange}\1\$color/" thisweek.txt
This gives me the output
Saturday Aug 13 12pm - 9pm 4pm - 5pm
Sunday Aug 14 9:30am - 6pm 1pm - 2pm
Monday Aug 15 6:30pm - 11:30pm None
Tuesday Aug 16 6pm - 11pm None
Wednesday Aug 17 Not Currently Scheduled for This Day
Thursday Aug 18 Not Currently Scheduled for This Day
${color orange}Friday Aug 19 7am - 3:30pm 10:30am - 11:30am $color
(Note that it differs from your's since it's friday at least in my time zone)

javascript regular expression: how do I find date without year or date with year<2010

I need to find date without year, or date with year<2010.
basically,
Feb 15
Feb 20
Feb 20, 2009
Feb 20, 1995
should be accepted
Feb 20, 2010
Feb 20, 2011
should be rejected
How do I do it?
Thanks,
Cheng
Try this:
(Jan|Feb|Mar...Dec)\s\d{1,2},\s([1][0-9][0-9][0-9]|200[0-9])
Note: Expand the month list with proepr names. I was too lazy to spell it all out.