I have a regular expression that matches time ranges:
(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9]))
However, I need a regex that can extract everything except the time range in a string such as: "June 12, 2015 13:30 - 14:00" (i.e., "June 12, 2015"), but I can't seem to do it.
I tried using both a lookahead and lookbehind, such as the following, but they don't seem to work (at least for me ;).
(?<!(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9])))
Thanks to #OozeMeister for the hint. I'm coding in Ruby, and code substitution is the solution.
t = "June 12, 2015 13:30 - 14:00"
reg = /(([0-2]?[0-9]:[0-5][0-9]\s*)-(\s*[0-2]?[0-9]:[0-5][0-9]))/
date = t.gsub(reg, '')
puts date
Output: "June 12, 2015 "
Related
I am using a Regex to pull dates out of a series of strings. The format varies slightly, but it always contains the full month. The strings usually contain two dates to represent a range like so:
February 1, 2020 - March 18, 2020
or
February 1st 2020 - March 18th 2020
And this is working great until I come across dates like:
June 1 - July 22, 2018
where a year is not presented in the "starting" part of the range because it is the same as the "ending" year.
Below is the Regex I crudely copied and applied to my code. It is Javascript but I really think this is more of a Regex question...
const regex = /((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*/gm;
var myDateString1 = "January 8, 2020 - January 27, 2020"; // THIS WORKS GREAT!
var myDateString2 = "January 8 - January 27, 2020"; // THIS DOES NOT WORK GREAT!
var dates = myDateString1.match(regex);
// returns ["January 8, 2020","January 27, 2020"]
var dates2 = myDateString2.match(regex);
// returns ["January 8 - J"]
Is there a way I can modify this so if it is met with a hyphen it discontinues that given match? So myDateString2 would return ["January 8", "January 27, 2020"]?
The strings sometimes have words before or after, like
Presented from January 8, 2020 - January 27, 2020 at such and such place
so I don't think simply having a regex based on the hyphen before/after would work.
You could use 2 capture groups and make the pattern more specific to match the format of the strings.
The /m flag can be omitted as there are no anchors in the pattern.
Note that the pattern matches a date like pattern, and does not validate the date itself.
\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b
See a regex101 demo.
const regex = /\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s+\d{4})?)\s+[,./-]\s+\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s+\d{4})\b/g;
const str = `January 8, 2020 - January 27, 2020
January 8 - January 27, 2020
Presented from January 8, 2020 - January 27, 2020 at such and such place
June 1 - July 22, 2018`;
console.log(Array.from(str.matchAll(regex), m => [m[1], m[2]]))
Note- original regex, tried to be all match of forms, which is not possible like this. I reformed it to do 75% of original intent. But is fools gold et all, in the end ..
The capture groups were used for debug.
Simply taking out the hyphen in the class and making the year optional at the end with a single ? should get what you want.
/((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?(((\s*[,./]))?\s+(19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/6NiNxy/1
And replacing the capture groups with clusters (?: ) then giving it one more level of factoring will make it quicker.
/(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/NTR0WD/1
const regex = /(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:(?:\s*[,./])?\s+(?:19[0-9]\d|20\d{2})|\d{2})?/g;
var myDateString1 = "January 8, 2020 - January 27, 2020";
var myDateString2 = "January 8 - January 27, 2020";
console.log(myDateString1.match(regex));
console.log(myDateString2.match(regex));
I need a regular expression for a string like this:
ex. 1234-1234-12345
where the first two numbers must be between 01-18 and the whole string must be 15 characters long
example: 0511-xxxx-xxxxx.
I tried using [RegularExpression(#"^[0-9]{1,18}$",
ErrorMessage = "Invalid Id.")]
but it doesnt work, it even gives me an error that says ',' is missing.
Lets make it even easier, a numeric string 13 character long where the first two digits must be between 01-18.
Ex. 1234567890123
(I would prefer the first format but this one work too).
I don't know how to use Regex so if someone can kindly give me a link to somewhere I can learn I would very much appreciate it.
And, most importantly, if there is a better way to get around this without using Regex I would appreciate it as well.
Apparently, my request is a little unclear. What I want it that the first two digits (XXxx-xxxx-xxxxx) be 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16, 17, 18.
Your "first two numbers" is a little unclear, but how about:
var pattern = #"(0\d|1[0-8])\d\d-\d{4}-\d{5}";
If you want to match the whole string and not just find the substring, you need
var pattern = #"^(0\d|1[0-8])\d\d-\d{4}-\d{5}$";
If you didn't have the groups separated by hyphens, use:
var pattern = #"^(0\d|1[0-8])\d{11}$";
You can use it like
Regex.IsMatch(aString, pattern)
I am trying to extract dates from several articles. When I test the regular expression the pattern match only part of the information of interest. As you can see:
https://regex101.com/r/ATgIeZ/2
This is a sample of the text file:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] 3004
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo JULY 14, 2034
The extraction pattern that I am using and the code is this one:
import re
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
pattern = ("[A-Z]+\.*\s(\d+)\,\s(\d+){4}")
result = re.findall(pattern,text_read)
print(result)
And the output from Anaconda is:
[('5', '6'), ('7', '5'), ('1', '6'), .....]
The expected output is:
OCT. 5, 2016, FEB. 8, 2016, JULY 14, 2034 .....
Problem is the repeat command {4} which is outside your last group. Also, the regex to capture the month was not within a group
Fix it like this:
pattern = r"([A-Z]+)\.?\s(\d+)\,\s(\d{4})"
result with your data sample:
[('OCT', '5', '2016'), ('FEB', '8', '2016'), ('JULY', '14', '2034')]
Small extra fixes:
there can be 0 or 1 dot. So removed \.* for \.?
used "raw" prefix, always better when defining regexes string (no problems here, but can happen with \b for instance)
Thanks for the suggestions, it helped to understand the use of parenthesis in regex.
I solved my self with this:
pattern=("([A-Z]+\.*\s)(\d+)\,\s(\d{4})")
I'm trying to use str_extract to find dates in a text document. However, I've run into a bit of a conundrum. Generally I expect dates to come in one of two forms: 1) June 15th, 1914 2) June 15, 1914. But when I try to build a pattern to catch both of these options, I get an NA result.
For example, if I try to str_extract("No. 1. June 20th, 1914.", "[:alpha:]{3,8} [0-9]{1,2}[[a-z]{2}]?, [0-9]{4}"), I get NA. But if I remove the brackets around [a-z]{2} it works. However, if I remove the brackets, I of course get an NA for the string "No. 1. June 20, 1914.". This does, however, work if I leave the brackets.
I can of course work around this by using a simple if/else if statement, but I'm curious as to why this isn't working, and if there is a better way to handle these combined cases.
If you're trying to extract dates, why not use the lubridate package?
> lubridate::mdy("No. 1. June 20th, 1914.")
[1] "1914-01-20 UTC"
(where mdy is telling lubridate that the date-data appears in month-day-year order).
It's not working because of the following reasons:
Your POSIX character class is not properly wrapped inside a bracketed expression.
You're trying to use a character class as an optional group construct.
Your regular expression fixed would look like:
x <- 'No. 1. June 20th, 1914.'
str_extract(x, '[[:alpha:]]{3,8} [0-9]{1,2}([a-z]{2})?, [0-9]{4}')
## [1] "June 20th, 1914"
You could modify your regular expression:
str_extract(x, '[a-zA-Z]+ \\d{1,2}([a-z]{2})?, \\d{4}')
>str_extract("No. 1. June 20, 1914.", "[[:alpha:]]{3,8} [[:digit:]]{1,2}.+?, [[:digit:]]{4}")
[1] "June 20, 1914"
> str_extract("No. 1. June 20th, 1914.", "[[:alpha:]]{3,8} [[:digit:]]{1,2}.+?, [[:digit:]]{4}")
[1] "June 20th, 1914"
As the . matches any character, the function returns the greatest possible sequence of any characters before ',' and then we use quantifiers + and ? for the condition
I'm terible with regex and I can't seem to wrap my head around this simple task.
I need to parse out the two dates in a string which always has one of two formats:
"Inquiry at your property for December 29, 2013 - January 03, 2014"
OR
"Inquiry at your property for 29 December , 2013 - 03 January, 2014"
the 2 different date formats are throwing me off. Any insights would be appreciated!
/(\d+ \w+, \d+|\w+ \d+, \d+)/ for example. Try it out on Rubular.
For sure, it would pickup more stuff, like 2013 NotReallyAMonth, 12345. But if you don't have things in the input that look like a date, but not actually a date this might work.
You could make the regexp stronger, but applying more restrictions on what is matched:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4})/
In this case the day is always two digits, the year is 4. Months are listed explicitly (you would have to list all of them).
Update: For ranges it would be a different regexp:
/((?:Jan|Dec) \d+ - \d+, \d{4})/
Obviously they can all be combined together:
/(\d{2} (?:January|December), \d{4}|(?:January|December) \d{2}, \d{4}|(?:Jan|Dec) \d+ - \d+, \d{4})/