Regular expression bracket mystery in R - regex

I'm trying to use str_extract to find dates in a text document. However, I've run into a bit of a conundrum. Generally I expect dates to come in one of two forms: 1) June 15th, 1914 2) June 15, 1914. But when I try to build a pattern to catch both of these options, I get an NA result.
For example, if I try to str_extract("No. 1. June 20th, 1914.", "[:alpha:]{3,8} [0-9]{1,2}[[a-z]{2}]?, [0-9]{4}"), I get NA. But if I remove the brackets around [a-z]{2} it works. However, if I remove the brackets, I of course get an NA for the string "No. 1. June 20, 1914.". This does, however, work if I leave the brackets.
I can of course work around this by using a simple if/else if statement, but I'm curious as to why this isn't working, and if there is a better way to handle these combined cases.

If you're trying to extract dates, why not use the lubridate package?
> lubridate::mdy("No. 1. June 20th, 1914.")
[1] "1914-01-20 UTC"
(where mdy is telling lubridate that the date-data appears in month-day-year order).

It's not working because of the following reasons:
Your POSIX character class is not properly wrapped inside a bracketed expression.
You're trying to use a character class as an optional group construct.
Your regular expression fixed would look like:
x <- 'No. 1. June 20th, 1914.'
str_extract(x, '[[:alpha:]]{3,8} [0-9]{1,2}([a-z]{2})?, [0-9]{4}')
## [1] "June 20th, 1914"
You could modify your regular expression:
str_extract(x, '[a-zA-Z]+ \\d{1,2}([a-z]{2})?, \\d{4}')

>str_extract("No. 1. June 20, 1914.", "[[:alpha:]]{3,8} [[:digit:]]{1,2}.+?, [[:digit:]]{4}")
[1] "June 20, 1914"
> str_extract("No. 1. June 20th, 1914.", "[[:alpha:]]{3,8} [[:digit:]]{1,2}.+?, [[:digit:]]{4}")
[1] "June 20th, 1914"
As the . matches any character, the function returns the greatest possible sequence of any characters before ',' and then we use quantifiers + and ? for the condition

Related

Extract just the part of string that matches a regex pattern in R

I build a data frame scraped automatically from a webpage on which one of the variables is a date in the text form “May 12”.
Nevertheless, sometimes observations came with some characters (in some cases weird ones) attached after the date, for example: “May 20 õ", "Dez 1", "Oct 12ABCdáé".
For those cases, I want to replace the value with the correct characters, thus: “Dec 24”, “Oct 1”.
After googling for a solution several times and trying functions like: sub, gsub and grep , I could not find the way to find a correct function to work.
I see that regular expressions has a steep learning curve, but after using the tool http://regexr.com/ I could define the regular expression to match the pattern in the observations where the problems appears. ([A-Z]{1}[a-z]{2})\s\d+.*
At this moment, I have the following example:
vector = c("May 20", "Dez 1", "Oct 12ABCdáé”)
And the last solution I tried is:
dateformat = gsub(pattern = "([A-Z]{1}[a-z]{2})\\s\\d+.*", replacement = "([A-Z]{1}[a-z]{2})\\s\\d+", x = vector)
But of course this gives me a replacement with the text string "([A-Z]{1}[a-z]{2})\s\d+” on each of them.
dateformat
[1] "([A-Z]{1}[a-z]{2})sd+" "([A-Z]{1}[a-z]{2})sd+"
[3] "([A-Z]{1}[a-z]{2})sd+"
I really do not understand what I have to include in the replacement argument to remove the bad characters if they exists.
I added a capture group and a back-reference "\\1":
sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector)
[1] "May 20" "Dez 1" "Oct 12"
The replacement argument accepts back-references like '\\1', but not typical regex patterns as you used. The back-reference refers back to the pattern you created and the capture group you defined. In this case our capture group was the abbreviated month and day which we outlined with parantheticals (..). Any text captured within those brackets are returned when "\\1" is placed in the replacement argument.
This quick-start guide may help
We could also try
sub("\\s*[^0-9]+$", "", vector)
#[1] "May 20" "Dez 1" "Oct 12"
In case anyone else is interested in the performance of these different approaches, here is a repeatable example comparing Pierre's approach to akrun's approach.
This shows akrun's approach is faster:
library(microbenchmark)
set.seed(1234)
# Original poster's data
# vector <- c("May 20", "Dez 1", "Oct 12ABCdáé")
# Increased the size to 200
vector <- sample(c("May 20", "Dez 1", "Oct 12ABCdáé"), 200L, replace = TRUE)
# Comparison of timings with 10000 repetitions
microbenchmark(
pierre_l = sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector),
akrun = sub("\\s*[^0-9]+$", "", vector),
times = 10000L
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> pierre_l 164.201 169.201 233.5096 173.302 220.2515 17809.1 10000
#> akrun 159.001 164.202 228.9020 168.200 212.7010 13443.5 10000
Created on 2022-03-24 by the reprex package (v2.0.1)

Regex to match some dates matching non-dates

I'm using some Regex to find date strings of the form Jan 12, 2015 or Feb 3, 1999.
The regex I'm using is \w+\s\d{1,2},\s\d{4} and it's working correctly, but the thing is that on the file are also some strings with the form:
Weg 58, 4047 or Strasse 1, 4482 and I also match them.
How can I avoid those non-date matches? My approach is:
The first string (the one of the month, Jan, Feb, etc.) has to have always length 3.
The year has to start with 1 or 2.
The thing is that I dont know how can I add these two options to my regex. Any help please?
You can make the test right here: https://regex101.com/r/bN2pO0/1
Thanks in advance.
Since the months won't change (ie: consistent values between January - Decemeber, we can put the 3 starting characters).
We can then use a OR | operator to select years starting with 1 or 2
/((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s(1|2)\d{3})/ig
https://regex101.com/r/bN2pO0/3
Just as you used \d{1,2} to match a digit 1 or 2 times and \d{4} to match a digit 4 times, you can use \w{3} to match a word character 3 times.
For the year, you can use the pipe "or" operator |.
\w{3}\s\d{1,2},\s(?:1|2)\d{3}
Although, this will also match non-dates of form Abc xy, 1xyz
If you want, you can go with brute force approach or just get rid of regex and use code to capture the dates.
Brute force:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s[0-2]?[0-9],\s[12]\d{3}

Error trapping with regex

I have the following dataframe
ColumnA=c("Kuala Lumpur Sector 2 new","old Jakarta Sector31", "Sector 9, 7 Hong Kong","Jakarta new Sector22")
and am extracting the Sector number to a separate column
gsub(".*Sector ?([0-9]+).*","\\1",ColumnA)
Is there a more elegant way to capture errors if 'Sector' does not appear on one line than an if else statement?
If the word 'Sector' does not appear on one line I simply want to set the value of that row to blank.
I thought of using str_detect first to see if 'Sector' was there TRUE/FALSE, but this is quite an ugly solution.
Thanks for any help.
If the word 'Sector' does not appear on one line I simply want to set the value of that row to blank.
To achieve that, use alternation operator |:
ColumnA=c("Kuala Lumpur 2 new","old Jakarta Sector31", "Sector 9, 7 Hong Kong","Jakarta new Sector22")
gsub("^(?:.*Sector ?([0-9]+).*|.*)$","\\1",ColumnA)
Result: [1] "" "31" "9" "22" (as Kuala Lumpur 2 new has no Sector, the second part with no capturing group matched the whole string).
See IDEONE demo
library(stringr)
as.vector(sapply(str_extract(ColumnA, "(?<=Sector\\s{0,10})([0-9]+)"),function(x) replace(x,is.na(x),'')))
I think this is what you need.

Regex: How to match a unix datestamp?

I'd like to be able to match this entire line (to highlight this sort of thing in vim): Fri Mar 18 14:10:23 ICT 2011. I'm trying to do it by finding a line that contains ICT 20 (first two digits of the year of the year), like this: syntax match myDate /^*ICT 20*$/, but I can't get it working. I'm very new to regex. Basically what I want to say: find a line that contains "ICT 20" and can have anything on either side of it, and match that whole line. Is there an easy way to do this?
.*ITC 20.*
should do the trick. . is a wildcard that matches any character, and * means you can have 0 or more of the pattern it follows. (i.e. ba(na)* will match ba, banana, bananananana and so on)

Modify regex to match dates with ordinals "st", "nd", "rd", "th"

How can the regex below be modified to match dates with ordinals on the day part? This regex matches "Jan 1, 2003 | February 29, 2004 | November 02, 3202" but I need it to match also: "Jan 1st, 2003 | February 29th, 2004 | November 02nd, 3202 | March 3rd, 2010"
^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))
Thank you.
This will depend on your use case, but in the interest of pragmatism, you might do well to just match anything matching:
(1) any month name or abbreviation;
(2) whitespace;
(3) any one or two digits;
(4) whitespace;
(5) any st,nd,rd,th;
(6) whitespace OR comma + optional whitespace;
(7) any four digits;
I'm not sure what you're matching in, but if I had Jan 35nd,3001, I think I'd rather capture it now and invalidate it later than to just skip over it right at the get-go.
Also, depending on your data set, consider case sensitivity issues and common international English variants, like 1 Jan 2004 or 1st Jan, 2004 or January, 2004 etc.
line breaks added
^(?:j(?:an(?:uary)?|un(?:e)?|ul(?:y)?)?|feb(?:ruary)?|ma(?:r(?:ch)?|y)
|a(?:pr(?:il)?|ug(?:ust)?)|sep(?:t|tember)?|oct(?:ober)?|(?:nov|dec)(?:ember)?)
\s+\d{1,2}(?:st|nd|rd|th)?(?:\s+|,\s*)\d{4}\b
Even more pragmatic (and readable), unless you have a very bizarre dataset, is to allow anything after the common prefixes:
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*?\s+\d{1,2}(?:[a-z]{2})?(?:\s+|,\s*)\d{4}\b
Would this match octagenarianism 99xx, 0000 ? Yes. Is that likely to be an issue? I doubt it.
That regex is doing waaaaay too much. You'd be much better off using your language's equivalent of strptime(). However, the regex below will match ordinals:
^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31(st)?)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))(st|nd|rd|th)?|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(th)?(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))(st|nd|rd|th)?))\,\ ((1[6-9]|[2-9]\d)\d{2}))
Note that it will also match things like "20nd" but the likelihood of encountering that in real data is way too low to bother caring in most cases.