Python re.match not finding characters in the middle of the string - regex

I have a list of website links that are exactly the same except for the changing year, which is what I am trying to find. I'm using re.match to try and find it since the string is the exact same except for the 4 characters (20xx). For some reason it is only returning None though, and I don't know why.
I have tried to use other re methods such as findall and fullmatch, but it doesn't help.
state_links = ["https://2009-2017.state.gov/r/pa/prs/ps/2009/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2010/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2011/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2012/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2013/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2014/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2015/index.htm",
"https://2009-2017.state.gov/r/pa/prs/ps/2016/index.htm"]
for link in state_links:
year = re.match(r"https://2009-2017.state.gov/r/pa/prs/ps/(.*)/index.htm", link)
print(year)

Your example as shown works, printing a series of re.Match instances. (Though, the . is not doing what you think it's doing, and it may be sounder practice to use \d{4} inside the capturing group. A plain . is a pattern for any character; you probably want a literal period, \..)
Regardless, if your links are always that cleanly formatted, you can also use a just a str method here:
>>> [int(i.rsplit("/", 2)[-2]) for i in state_links]
[2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
This splits each link into 3 parts, where each intermediate element will look like:
>>> state_links[0].rsplit("/", 2)
['https://2009-2017.state.gov/r/pa/prs/ps', '2009', 'index.htm']
The [-2] indexer then takes the year component.

as #Drubio indicated, your regex pattern is correct. However, check your code. The following works:
regex = r"https://2009-2017.state.gov/r/pa/prs/ps/(\d{4})/index.htm"
years = re.finditer(regex, state_links, re.MULTILINE)
for year in years:
for j in range(0, len(year.groups())):
j += 1
print ("{year}".format(year = year.group(j)))
Output
## 2009 2010 2011 2012 2013 2014 2015 2016
Credit to #Brad for \d{4} suggestion/correction and also .split option

Related

How to creating a regex pattern in VBA to extract dates from string and exclude false matches

I am trying to use Regex to parse a series of strings to extract one or more text dates that may be in multiple formats. The strings will look something like the following:
24 Aug 2016: nno-emvirt010a/b; 16 Aug 2016 nnt-emvirt010a/b nnd-emvirt010a/b COSI-1.6.5
24.16 nno-emvirt010a/b nnt-emvirt010a/b nnd-emvirt010a/b EI.01.02.03\
9/23/16: COSI-1.6.5 Logs updated at /vobs/COTS/1.6.5/files/Status_2016-07-27.log, Status_2016-07-28.log, Status_2016-08-05.log, Status_2016-08-08.log
I am not concerned about validating the individual date fields; just extracting the date string. The part I am unable to figure out is how to not match on number sequences that match the pattern but aren’t dates (‘1.6.5’ in ex. (1) and 01.02.03 in ex. (2)) and dates that are part of a file name (2016-07-27 in ex. (3)). In each of these exception cases in my input data, the initial numbers are preceded by either a period(.), underscore (_) or dash (-), but I cannot determine how to use this to edit the pattern syntax to not match these strings.
The pattern I have that partially works is below. It will only ignore the non date matches if it starts with 1 digit as in example 1.
/[^_\.\(\/]\d{1,4}[/\-\.\s*]([1-9]|0[1-9]|[12][0-9]|3[01]|[a-z]{3})[/\-\.\s*]\d{1,4}/ig`
I am not sure about vba check if this works . seems they have given so much options : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html
^(?:(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])|↵
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9]))/(?:[0-9]{2})?[0-9]{2}$
^(?:
# m/d or mm/dd
(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])
|
# d/m or dd/mm
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9])
)
# /yy or /yyyy
/(?:[0-9]{2})?[0-9]{2}$
According to the test strings you've presented, you can use the following regex
See this regex in use here
(?<=[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
This regex ensures that specific date formats are met and are preceded by nothing (beginning of the string) or by a non-word character (specifically a-z, A-Z, 0-9) or dot .. The date formats that will be matched are:
24 Aug 2016
24.16
9/23/16
The regex could be further manipulated to ensure numbers are in the proper range according to days/month, etc., however, I don't feel that is really necessary.
Edits
Edit 1
Since VBA doesn't support lookbehinds, you can use the following. The date is in capture group 1.
(?:[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
Edit 2
As per bulbus's comment below
(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{‌1,2}\/){2}\d{2,4})|(‌​?:\d{2,4}(?:-\d{2}){‌​2})|\d{2}\.\d{2})
Took liberty to edit that a bit.
replaced [^a-zA-Z\d.] with [^\w.], comes with added advantage of excluding dates with _2016-07-28.log
Due to 1 removed trailing condition (?=[^a-zA-Z\d.]).
Forced year digits from \d+ to \d{2,4}
Edit 3
Due to added conditions of the regex, I've made the following edits (to improve upon both previous edits). As per the OP:
The edited pattern above works in all but 2 cases:
it does not find dates with the year first (ex. 2016/07/11)
if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016)
Can you provide the edit to fix these?
In the below regexes, I've changed years to \d+ in order for it to work on any year greater than or equal to 0.
See the code in use here
(?:[^\w.]|^)((?:\d{1,2}\s+[A-Z][a-z]{2}\s+\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:\/\d{1,2}){2})|(?:\d+(?:-\d{2}){2})|\d{2}\.\d+)
This regex adds the possibility of dates in the XXXX/XX/XX format where the date may appear first.
The reason you are getting ( as a match before the regex is the nature of the Full Match. You need to, instead, grab the value of the first capture group and not the whole regex result. See this answer on how to grab submatches from a regex pattern in VBA.
Also, note that any additional date formats you need to catch need to be explicitly set in the regex. Currently, the regex supports the following date formats:
\d{1,2}\s+[A-Z][a-z]{2}\s+\d+
12 Apr 17
12 Apr 2017
(?:\d{1,2}\/){2}\d+
1/4/17
01/04/17
1/4/2017
01/04/2017
\d+(?:\/\d{1,2}){2}
17/04/01
2017/4/1
2017/04/01
17/4/1
\d+(?:-\d{2}){2}
17-04-01
2017-04-01
\d{2}\.\d+ - Although I'm not sure what this date format is even used for and how it could be considered efficient if it's missing month
24.16

Remove everything after a certain word in OpenRefine

I'd like to remove everything after a certain word ("am") in a cell with OpenRefine.
My data:
Workshop im Rahmen des Weiterbildungsprogramms am 02. November 2015
Brainstorming am 09. November 2015 in Bremen
Workshop "Auswählen und bewerten" am 17. November 2015 in Hamburg
Example for Regex: [\n\r].*am\s*([^\n\r]*)
See it in action here: http://rubular.com/r/bBlXOMoos1
That works. I'd like to have the following result.
Workshop im Rahmen des Weiterbildungsprogramms
Brainstorming
Workshop "Auswählen und bewerten"
I tried: value.replace(/[\n\r].*am\s*([^\n\r]*)/, '')
The problem is not so much the regex, I could remove the "am" in a 2nd step, if necessary. But I can't get the regex to work in combination with value.replace.
Could you try this with Python/Jython ?
import re
return re.sub(r"am.+","", value)
I think the Python's regular expressions are often more consistent than those of GREL. But if you want to use GREL, does this not work?
value.replace(/\s+am.+/, '')
I feel you are mixing the syntax of value.match() (which requires you to match the whole string in a cell, then select the substring you want) and value.replace() (where you can only match the substring you need).
The issue is pretty simple actually, your missing the . before your * to remove all the trailing stuff, right now your regex is saying 0 or more spaces are following the am, but you want it to clean off everything else after it...This works:
value.replace(/\sam.*/,'')

Regex newbie: How to isolate 'num-num-num' in a string

I'm sure this is a super simple question for many of you, but I've only just started learning regex and at the moment can't for the life of me isolate what I'm after from the following:
June 2015 - Won / Void / Lost = 3-0-1
I need a solution to isolate the 'num-num-num' part at the end of the string that would work for any positive integers.
Thanks for any help
EDIT
So this line of code from a scrapy spider I'm writing produces the line above:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
I've tried to isolate the part I'm after with:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').re(r'\d+-\d+-\d+$').extract()[0]
No luck though :(
The regex to capture that is:
\d+-\d+-\d+$
It works as follows:
\d+- means: capture 1 or more digits (the numbers [0-9]), and then a "-".
$ means: you should now be at the end of the line.
Translating that into the full regex pattern:
Capture 1 or more digits, then a hyphen, then 1 or more digits, then a hyphen, then 1 or more digits, and we should now be at the end of the string.
EDIT: Addressing your edits and comments:
I'm not so sure what you mean by "isolate". I'll assume that you mean you want tips_str to equal "3-0-1".
I believe the easiest way would be to first use xpath extract the string for the entire line without doing any regex. Then, when we're simply dealing with a string (instead of xpath stuff), it should be nice and easy to use regex and get the pattern out.
As far as I understand, sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0] (without .re()) is providing you with the string: "June 2015 - Won / Void / Lost = 3-0-1".
So then:
full_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
Now that we've got the full string, we can use standard string regex to pluck the part we want out:
tips_str = false
search = re.search(r'\d+-\d+-\d+$', full_str)
if(search):
tips_str = search.group(0)
Now tips_str will equal "3-0-1". If the pattern wasn't matched at all, it'd instead equal false.
If any of my assumptions are wrong then let me know what's actually happening (like if .extract()[0] isn't giving back a string, then what is it giving back?) and I'll try to adjust this response.
Any and all numbers, so negatives, scientific notation, etc? This will match it.
/(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)$/ig
Tested with these:
June 2015 - Won / Void / Lost = -1.1e+3-1.01-0.1e+2
June 2015 - Won / Void / Lost = 1-2-3
June 2015 - Won / Void / Lost = 0.1--5-5.6
If you take $ out if it, it will match on all lines at the same time.

Parsing dates from string - regex

I'm terible with regex and I can't seem to wrap my head around this simple task.
I need to parse out the two dates in a string which always has one of two formats:
"Inquiry at your property for December 29, 2013 - January 03, 2014"
OR
"Inquiry at your property for 29 December , 2013 - 03 January, 2014"
the 2 different date formats are throwing me off. Any insights would be appreciated!
/(\d+ \w+, \d+|\w+ \d+, \d+)/ for example. Try it out on Rubular.
For sure, it would pickup more stuff, like 2013 NotReallyAMonth, 12345. But if you don't have things in the input that look like a date, but not actually a date this might work.
You could make the regexp stronger, but applying more restrictions on what is matched:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4})/
In this case the day is always two digits, the year is 4. Months are listed explicitly (you would have to list all of them).
Update: For ranges it would be a different regexp:
/((?:Jan|Dec) \d+ - \d+, \d{4})/
Obviously they can all be combined together:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4}|(?:Jan|Dec) \d+ - \d+, \d{4})/

Regular Expression Match to test for a valid year

Given a value I want to validate it to check if it is a valid year. My criteria is simple where the value should be an integer with 4 characters. I know this is not the best solution as it will not allow years before 1000 and will allow years such as 5000. This criteria is adequate for my current scenario.
What I came up with is
\d{4}$
While this works it also allows negative values.
How do I ensure that only positive integers are allowed?
Years from 1000 to 2999
^[12][0-9]{3}$
For 1900-2099
^(19|20)\d{2}$
You need to add a start anchor ^ as:
^\d{4}$
Your regex \d{4}$ will match strings that end with 4 digits. So input like -1234 will be accepted.
By adding the start anchor you match only those strings that begin and end with 4 digits, which effectively means they must contain only 4 digits.
The "accepted" answer to this question is both incorrect and myopic.
It is incorrect in that it will match strings like 0001, which is not a valid year.
It is myopic in that it will not match any values above 9999. Have we already forgotten the lessons of Y2K? Instead, use the regular expression:
^[1-9]\d{3,}$
If you need to match years in the past, in addition to years in the future, you could use this regular expression to match any positive integer:
^[1-9]\d*$
Even if you don't expect dates from the past, you may want to use this regular expression anyway, just in case someone invents a time machine and wants to take your software back with them.
Note: This regular expression will match all years, including those before the year 1, since they are typically represented with a BC designation instead of a negative integer. Of course, this convention could change over the next few millennia, so your best option is to match any integer—positive or negative—with the following regular expression:
^-?[1-9]\d*$
This works for 1900 to 2099:
/(?:(?:19|20)[0-9]{2})/
Building on #r92 answer, for years 1970-2019:
(19[789]\d|20[01]\d)
To test a year in a string which contains other words along with the year you can use the following regex: \b\d{4}\b
In theory the 4 digit option is right. But in practice it might be better to have 1900-2099 range.
Additionally it need to be non-capturing group. Many comments and answers propose capturing grouping which is not proper IMHO. Because for matching it might work, but for extracting matches using regex it will extract 4 digit numbers and two digit (19 and 20) numbers also because of paranthesis.
This will work for exact matching using non-capturing groups:
(?:19|20)\d{2}
Use;
^(19|[2-9][0-9])\d{2}$
for years 1900 - 9999.
No need to worry for 9999 and onwards - A.I. will be doing all programming by then !!! Hehehehe
You can test your regex at https://regex101.com/
Also more info about non-capturing groups ( mentioned in one the comments above ) here http://www.manifold.net/doc/radian/why_do_non-capture_groups_exist_.htm
you can go with sth like [^-]\d{4}$: you prevent the minus sign - to be before your 4 digits.
you can also use ^\d{4}$ with ^ to catch the beginning of the string. It depends on your scenario actually...
/^\d{4}$/
This will check if a string consists of only 4 numbers. In this scenario, to input a year 989, you can give 0989 instead.
You could convert your integer into a string. As the minus sign will not match the digits, you will have no negative years.
I use this regex in Java ^(0[1-9]|1[012])[/](0[1-9]|[12][0-9]|3[01])[/](19|[2-9][0-9])[0-9]{2}$
Works from 1900 to 9999
If you need to match YYYY or YYYYMMDD you can use:
^((?:(?:(?:(?:(?:[1-9]\d)(?:0[48]|[2468][048]|[13579][26])|(?:(?:[2468][048]|[13579][26])00))(?:0?2(?:29)))|(?:(?:[1-9]\d{3})(?:(?:(?:0?[13578]|1[02])(?:31))|(?:(?:0?[13-9]|1[0-2])(?:29|30))|(?:(?:0?[1-9])|(?:1[0-2]))(?:0?[1-9]|1\d|2[0-8])))))|(?:19|20)\d{2})$
You can also use this one.
([0-2][0-9]|3[0-1])\/([0-1][0-2])\/(19[789]\d|20[01]\d)
In my case I wanted to match a string which ends with a year (4 digits) like this for example:
Oct 2020
Nov 2020
Dec 2020
Jan 2021
It'll return true with this one:
var sheetName = 'Jan 2021';
var yearRegex = new RegExp("\b\d{4}$");
var isMonthSheet = yearRegex.test(sheetName);
Logger.log('isMonthSheet = ' + isMonthSheet);
The code above is used in Apps Script.
Here's the link to test the Regex above: https://regex101.com/r/SzYQLN/1
You can try the following to capture valid year from a string:
.*(19\d{2}|20\d{2}).*
Works from 1950 to 2099 and value is an integer with 4 characters
^(?=.*?(19[56789]|20\d{2}).*)\d{4}$