RegEx String Validator - regex

In my MVC3 application, on one of the entities I am saving the Date of Birth as a string. Why ? because my application allows the storage of the date of birth of people long dead, eg. Socrates, Plato, Epicurus ... etc and as far as I know the DateTime class doesn't allow that.
Now obviously we don't know the exact date of birth of Epicurus for example, we just know the year of birth [ 341 BCE ], so what I am thinking of doing is building a custom validator, that will validate the input string for the Date of Birth and make sure that they all match the following format:
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
I need to write a regular expression that will match any of the above, and of course not match anything else.
Update
Thank you very much, I wish I was as good as you lot in building RegExes ! Since my application is with ASP.net MVC3, I would like to stick with the .NET RegEx class (for convenience's sake).
luastoned answer seems to work; I can't seem to break its logic with all the test data I've thrown at it.
One thing though, can I also allow BC? Because some people use BC and others use BCE < would that be possible? And, am I right that the regular expression can not replace BC with BCE? I have to do that manually through my C# code - the RegEx would just either match or not, is that correct?
Update 2
M42's Regular Expression seems to be working better. I've just copied it and used it in my Custom Validator (code in PasteBin link below).

How about :
/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/
Here is a perl script with test cases:
#!/usr/local/bin/perl
use strict;
use warnings;
use Test::More;
my $re1 = qr/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/;
while(<DATA>) {
chomp;
if (/$re1/) {
ok(1, "input = $_");
} else {
ok(0, "input = $_");
}
}
done_testing;
__DATA__
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
12D09
1s909
A3 43 4 BCE
a 1
3F9
abc
BCE
123b456
output:
# Looks like you failed 9 tests of 15.
ok 1 - input = 12 Feb 1809
ok 2 - input = Feb 1809
ok 3 - input = 341
ok 4 - input = 341 BCE
ok 5 - input = Oct 341 BCE
ok 6 - input = 11 Mar 5 BCE
not ok 7 - input = 12D09
not ok 8 - input = 1s909
not ok 9 - input = A3 43 4 BCE
not ok 10 - input = a 1
not ok 11 - input = 3F9
not ok 12 - input =
not ok 13 - input = abc
not ok 14 - input = BCE
not ok 15 - input = 123b456

this looks likes the weirdest Regex I've ever made:
(\d+\s?)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?\s(\d+\s)?(BCE)?
I have no idea how many false positives would go through though..
You can check the sample on Regexr

Not quite a friend of RegExr (and not knowing the limitations of regexes in MVC3), allow me to present a PHP version with named captures (demo):
(?:(?:(?P<date>\d{1,2})\s)?(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))?(?:(?:^|\s)(?P<year>\d+))?(?:\s(?P<bce>BCE))?
This is based on #luastoned's answer.

Related

How to isolate specific words and years in group 1 no matter the combination

I got some values coming in with various look such as
autumn ux-s 2021
2021 pes-3 autumn P-S
pes-3 autumn 2021 32
autumn usd- fosd 2021 2
I really want to isolate "autumn" and "2021" - both in group 1
Of course the "autumn" could also be "spring", "summer", "winter" while the year of course should match the year.
It doesnt matter if i get "2021 autumn" and "autumn 2021" as long as i can isolate it within the same group 1
How could i achieve this? I simply cant see how i can keep it within one single group ?
I can isolate the location here, but of course still match the whole thing
((?:(?:autumn|spring)(?:\s*[a-zA-Z]*\s*)\d{4})|(?:\d{4}(?:\s*[a-zA-Z]*\s*)(?:autumn|spring)))
Can i somehow substract only partials from here and combine them into a single group result?
I did not find a regex to capture what you want in 1 group but maybe this one-liner solution helps you?
import re
text = ["autumn ux-s 2021", "2021 pes-3 autumn P-S", "pes-3 autumn 2021 32" ,"autumn usd- fosd 2021 2"]
pattern = r"(autumn|summer|winter|spring).*(\d{4})|(\d{4}).*(autumn|summer|winter|spring)"
print([' '.join(filter(None, re.search(pattern, txt, re.IGNORECASE).groups())) for txt in text])
output: ['autumn 2021', '2021 autumn', 'autumn 2021', 'autumn 2021']

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

Pandas: select rows from columns using Regex

I want to extract rows from column feccandid that have a H or S as the first value:
cid amount date catcode feccandid
0 N00031317 1000 2010 B2000 H0FL19080
1 N00027464 5000 2009 B1000 H6IA01098
2 N00024875 1000 2009 A5200 S2IL08088
3 N00030957 2000 2010 J2200 S0TN04195
4 N00026591 1000 2009 F3300 S4KY06072
5 N00031317 1000 2010 B2000 P0FL19080
6 N00027464 5000 2009 B1000 P6IA01098
7 N00024875 1000 2009 A5200 S2IL08088
8 N00030957 2000 2010 J2200 H0TN04195
9 N00026591 1000 2009 F3300 H4KY06072
I am using this code:
campaign_contributions.loc[campaign_contributions['feccandid'].astype(str).str.extractall(r'^(?:S|H)')]
Returns error:
ValueError: pattern contains no capture groups
Does anyone with experience using Regex know what I am doing wrong?
Why not just use str.match instead of extract and negate?
ie df[df['col'].str.match(r'^(S|H)')]
(I came here looking for the same answer, but the use of extract seemed odd, so I found the docs for str.ops.
W
For something this simple, you can bypass the regex:
relevant = campaign_contributions.feccandid.str.startswith('H') | \
campaign_contributions.feccandid.str.startswith('S')
campaign_contributions[relevant]
However, if you want to use a regex, you can change this to
relevant = ~campaign_contributions['feccandid'].str.extract(r'^(S|H)').isnull()
Note that the astype is redundant, and that extract is enough.

Pandas dataframe applying NA to part of the data

Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object
Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan
The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.

Need to filter log to search for the lines from the last 5 minutes

2011-04-13 00:09:07,731 INFO [STDOUT] 04/13 00:09:07 Information...
Hi everyone. I would post some of my code, but I don't even think it's worthy of posting. What I'm trying to do is that I've got a log file with lines like above. What I need to do is take the last lines timestamp, and keep all the lines from the last 5 minutes (rather than the last 200 lines or whatever....which would be easier). Could anyone help? I've searched the web, some decent tips, but still nothing going and frustrated as hell. Thanks!
Here's a simple Perl script that iterates over the file and prints every line whose timestamp is within 5 minutes of the time at the start of execution. For more efficiency, and assuming that the lines are in timestamp order, you could modify this to set a boolean flag when it encounters the first printable line and skip the testing from that point forwards.
#!/usr/bin/perl
use POSIX qw(mktime);
$now = time();
while(<>)
{
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$t = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
print "$_" if ($t >= $now-300);
}
I take it by your latest comment that you are interested in finding out how to find the timestamp that is last in your log, and the entries that are 5 minutes before that.
I think Jim Garrison's solution could be patched to replace this:
$now = time();
with this:
open F, "<server.log" or die $!;
seek F,-1000,2; # set pos to last 1000 bytes
my #f = <F>;
$_ = $f[$#f];
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$now = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
$now should now contain the last timestamp in the log.
I approximated "-1000" to be long enough to go past at least one line in the log. You could set it much higher if you expect to have long lines in the log, but from what I saw, the last log entry "should" be fairly short.
If you have a huge log file and want to increase performance in the following search, you can use an estimation and perform a seek to find the last, say, 1000000 bytes in the file with:
seek F, -1000000, 2;
Good luck!
Iterate over all the lines, using regexp grab: 00:09:07, and check against current time (localtime, etc...).
if the file contains entries from different dates, then also grab the dates using regexp, and again compare using the output of locatime
How to modify your script to make it work with the logs below
Dec 18 09:41:18 sd
Dec 18 09:46:29 sds
Dec 18 09:48:39 sds
Dec 18 09:48:54 sds
Dec 18 09:54:47 sds
Dec 18 09:55:33 sds
Dec 18 09:55:38 sds
Dec 18 09:57:58 sds
Dec 18 09:58:10 sds
Dec 18 10:00:50 sdsd
Dec 18 10:03:43 sds
Dec 18 10:03:50 sdsd
Dec 18 10:04:06 sdsd
Dec 18 10:04:15 sdsd
Dec 18 10:14:50 wdad
Dec 18 10:19:16 sdadsa
Dec 18 10:19:23 dsds
Dec 18 10:21:03 sadsd
Dec 18 10:22:54 adas
Dec 18 10:27:32 qadad