Issue with REGEXP_SUBSTR - regex

I have text in a column like /AB/25MAR92/ and /AB/25MAR1992/. I am trying to extract just 25MAR92 and 25MAR1992 from the column for a date calculation that I have to work on. Can you please help with the REGEXP_SUBSTR function for this issue?
Thanks!

You could try:
\b\d{1,2}[A-Z]{3}\d{2,4}\b
but this will also match 02MAR992. To exclude this possibility use:
\b\d{1,2}[A-Z]{3}(?:\d{2}|\d{4})\b
This will match 02MAR1992 and02MAR92 but will not match02MAR992.

I suggest using a pattern like this:
\/(\d{2}[A-Z]{3}(19|20)?\d{2})\/
Years are limited to 1900-2099.
Demo
If you do not want to allow any 2-digit value for the day \d{2},
you could add this pattern instead (0[1-9]|[12][0-9]|3[01]) that matches 01-31;
\/((0[1-9]|[12][0-9]|3[01])[A-Z]{3}(19|20)?\d{2})\/
Or if you allow dates like /AB/2MAR92/ that have days without a leading zero
add (0[1-9]|[12][0-9]|3[01]|[1-9]) instead:
\/((0[1-9]|[12][0-9]|3[01]|[1-9])[A-Z]{3}(19|20)?\d{2})\/
I've used / as anchors. If you don't like that, you can use \b.
In reaction to your latest comments, my recommended pattern looks like this:
\b\d{1,2}[A-Z]{3}(?:19|20)?\d{2}\b

Related

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

How to extract the year from a URL path using REGEXP_EXTRACT in Google Data Studio?

I'm building out a Google Data Studio dashboard and I need to create a calculated field for the year a post was published. The year is in the URI path, but I'm not sure how to extract it using REGEXP_EXTRACT. I've tried a number of solutions proposed on here but none of them seem to work on Data Studio.
In short, I have a URI like this: /theme/2019/jan/blog-post-2019/
How do I use the REGEXP_EXTRACT function to get the first 2019 after theme/ and before /jan?
Try this:
REGEXP_EXTRACT(Page, 'theme\/([0-9]{4})\/[a-z]{3}\/')
where:
theme\/ means literally "theme/";
([0-9]{4}) is a capturing group containing 4 characters from 0 to 9 (i.e. four digits);
\/[a-z]{3}\/ means a slash, followed by 3 lowercase letters (supposing that you want the regex to match all the months), followed by another slash. If you want something more restrictive, try with \/(?:jan|feb|mar|...)\/ for the last part.
See demo.
As you mentioned, I think you only want to extract the year between the string. The following will achieve that for you.
fit the query as per your needs
SELECT *
FROM Sample_table
WHERE REGEXP_EXTRACT(url, "(?<=\/theme\/)(?<year>\d{4})(?=\/[a-zA-Z]{3})")

Regex that matches a pattern only if string does not begin with 'N'

I need to put together a regex that matches a patter only if string does not begin with 'N'.
Here is my pattern so far [A-E]+[-+]?.
Now I want to make sure that it does not match something like:
N\A
NA
NB+
NB-
NCAB
This is for REGEXP_SUBSTR command in Oracle SQL DB
UPDATE
It looks like I should have been more specific, sorry
I want to extract from a string [A-E]+[-+]? but if the string also matches ^(N|n) then I want my regex to return nothing.
See examples below:
String Returns
N/A
F1/AAA AAA
NABC
FABC ABC
To match a character between A and E not preceded by N, you can use:
([^N]|^)[A-E]+
If you want to avoid fields that contains N[A-E] use a negation in your query using the pattern N[A-E] (in other words, use two predicates, this one to exclude NA and the first to find A)
To be more clear:
WHERE NOT REGEXP_LIKE(coln, 'N[A-E]') AND REGEXP_LIKE(coln, '[A-E]')
Ok I figured it out, I broadened the scope of the problem a little, I realized that I can also play with other parameters of REGEXP_SUBSTR in this case that I can have returned only second substring.
REGEXP_SUBSTR(field1, '^([^NA-D][^A-D]*)?([A-D]+[-+]?)',1,1,'i',2)
I still have to give you guys the credit, lot of good ideas that led me to here.
Just throw a [^N]? in front. That should do it.
OOPS...
That actually needs to include an " OR ^ "...
It should look like this:
([^N]|^)[A-E]+[-+]?
Sorry about that...It looks like the right answer already got posted anyway.

Writing regex to exclude time in a date

I am trying to write a regex to capture something but having a bit of trouble. The text is:
Generated from 10-16-2015 00:00 to 11-13-2015 23:59
I would like to capture 10-16-2015 to 11-13-2015. So basically I want to exclude the time in each one. I got it down to where I can capture the first date using:
(?<=Generated from )\d*-\d*-\d*
But how do I also include the second date and also the word "to"? within one regex. Any help is appreciated
Use groups:
^Generated from (\d{2}-\d{2}-\d{4}) \d{2}:\d{2} to (\d{2}-\d{2}-\d{4}) \d{2}:\d{2}$
Demo
I would use two capture groups and combine them. Something like this
(?<=Generated from )(\d*-\d*-\d* )\s*\d{2}:\d{2}\s*(to \d*-\d*-\d*)

regex for repeating values

I am trying to find the correct regex (for use with Java and JavaScript) to validate an array of day-of-week and 24-hour time formats. I figured out the time format but am struggling to come up with the full solution.
The regex needs to validate patterns which include one or more of the following, separated by a comma.
{two-character day} HH:MM-HH:MM
Three examples of valid strings would be:
M 5:30-7:00
M 5:30-7:00, T 5:30-7:00, W 18:00-19:30
F 12:00-14:30, Sa 6:45-8:15, Su 6:45-8:15
This should validate a 24-hour time:
/^((M|T|W|Th|Fr|Sa|Su) ([01]?[0-9]|2[0-3]):[0-5][0-9]-([01]?[0-9]|2[0-3]):[0-5][0-9](, )?)+$/
Credit for the time bit goes to mkyong: http://www.mkyong.com/regular-expressions/how-to-validate-time-in-24-hours-format-with-regular-expression/
you can try this
[A-Za-z]{1,2}[ ]\d+:\d+-\d+:\d+
You could try this: ([MTWFS][ouehra]?) ([0-9]|[1-2][0-9]):([0-6][0-9])-([0-9]|[1-2][0-9]):([0-6][0-9])
I'd go with this:
(((M|T(u|h)|W|F|S(a|u)) ((1*\d)|(2[0-3])):[1-5]\d-((1*\d)|(2[0-3])):[1-5]\d(, )?)+
This should do the trick:
^(M|Tu|W|Th|F|Sa|Su) \d{1,2}:\d{2}-\d{1,2}:\d{2}(, (M|Tu|W|Th|F|Sa|Su) \d{1,2}:\d{2}-\d{1,2}:\d{2})*$
Note that you show T in your example above which is ambiguous. You might want to enforce Tu and Th as shown in my regex.
This will capture all sets in an array. The T in the short day of week list is debatable (tuesday or thursday?).
^((?:[MTWFS]|Tu|Th|Sa|Su)\s(?:[0-9]{1,2}:[0-9]{2})-(?:[0-9]{1,2}:[0-9]{2})(?:,\s)?)+$
The (?:) are non-capturing groups, so your actual matches will be (for example):
M 5:30-7:00
T 5:30-7:00
W 18:00-19:30
But the entire line will validate.
Added ^ and $ for line boundaries and an explicit time-time match because some regular expression parsers may not work with the previous way that I had it.