Writing regex to exclude time in a date - regex

I am trying to write a regex to capture something but having a bit of trouble. The text is:
Generated from 10-16-2015 00:00 to 11-13-2015 23:59
I would like to capture 10-16-2015 to 11-13-2015. So basically I want to exclude the time in each one. I got it down to where I can capture the first date using:
(?<=Generated from )\d*-\d*-\d*
But how do I also include the second date and also the word "to"? within one regex. Any help is appreciated

Use groups:
^Generated from (\d{2}-\d{2}-\d{4}) \d{2}:\d{2} to (\d{2}-\d{2}-\d{4}) \d{2}:\d{2}$
Demo

I would use two capture groups and combine them. Something like this
(?<=Generated from )(\d*-\d*-\d* )\s*\d{2}:\d{2}\s*(to \d*-\d*-\d*)

Related

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Issue with REGEXP_SUBSTR

I have text in a column like /AB/25MAR92/ and /AB/25MAR1992/. I am trying to extract just 25MAR92 and 25MAR1992 from the column for a date calculation that I have to work on. Can you please help with the REGEXP_SUBSTR function for this issue?
Thanks!
You could try:
\b\d{1,2}[A-Z]{3}\d{2,4}\b
but this will also match 02MAR992. To exclude this possibility use:
\b\d{1,2}[A-Z]{3}(?:\d{2}|\d{4})\b
This will match 02MAR1992 and02MAR92 but will not match02MAR992.
I suggest using a pattern like this:
\/(\d{2}[A-Z]{3}(19|20)?\d{2})\/
Years are limited to 1900-2099.
Demo
If you do not want to allow any 2-digit value for the day \d{2},
you could add this pattern instead (0[1-9]|[12][0-9]|3[01]) that matches 01-31;
\/((0[1-9]|[12][0-9]|3[01])[A-Z]{3}(19|20)?\d{2})\/
Or if you allow dates like /AB/2MAR92/ that have days without a leading zero
add (0[1-9]|[12][0-9]|3[01]|[1-9]) instead:
\/((0[1-9]|[12][0-9]|3[01]|[1-9])[A-Z]{3}(19|20)?\d{2})\/
I've used / as anchors. If you don't like that, you can use \b.
In reaction to your latest comments, my recommended pattern looks like this:
\b\d{1,2}[A-Z]{3}(?:19|20)?\d{2}\b

regex using positive lookahead

My source data text looks something like this:
a1,a2,a3
a4,a5,a6
a7,a8,a9
test="1"
b1,b2,b3
b4,b5,b6
b7,b8,b9
test="2"
c1,c2,c3
c4,c5,c6
c7,c8,c9
test="3"
I need to parse this so the end result looks like this (appropriate “test” field included in each row):
a1,a2,a3,1
a4,a5,a6,1
a7,a8,a9,1
b1,b2,b3,2
b4,b5,b6,2
b7,b8,b9,2
c1,c2,c3,3
c4,c5,c6,3
c7,c8,c9,3
...etc
this what I started with and captures the fields correctly:
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+
I understand I need to use lookarounds to capture and include the “test” field on each line.
So something like this added (using a positive lookahead)…
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+(?=test="(?<test>.*?)")
This seems close but is not yielding all rows of data, but instead only the last row of data with the included test value as if it is consuming the look ahead row.
This expression with its captured groups are input into a .NET application that inserts these captured groups as fields within a database table. Number of fields is always static (4 in the example above; field1=f1, field2=f2, field3=f3, field4=test), but the number of records will be variable.
Any guidance would be appreciated.
Parsing your data to extract the relevant values
You are almost there, but need to allow the look ahead to skip the rows between the current one and the test line:
(?ms)(?<f1>\w+),(?<f2>\w+),(?<f3>\w+)\R(?=.*?^test="(?<test>\d+)")
\R matches all sort of newlines, (?ms) is the inline way of turning on the multiline and dot match all modifiers, so that .*?^test matches every line up to the test one, see demo here.
Again, your issue was that \s+ forced the lookahead to be on the line right after the one your were matching.

regex for repeating values

I am trying to find the correct regex (for use with Java and JavaScript) to validate an array of day-of-week and 24-hour time formats. I figured out the time format but am struggling to come up with the full solution.
The regex needs to validate patterns which include one or more of the following, separated by a comma.
{two-character day} HH:MM-HH:MM
Three examples of valid strings would be:
M 5:30-7:00
M 5:30-7:00, T 5:30-7:00, W 18:00-19:30
F 12:00-14:30, Sa 6:45-8:15, Su 6:45-8:15
This should validate a 24-hour time:
/^((M|T|W|Th|Fr|Sa|Su) ([01]?[0-9]|2[0-3]):[0-5][0-9]-([01]?[0-9]|2[0-3]):[0-5][0-9](, )?)+$/
Credit for the time bit goes to mkyong: http://www.mkyong.com/regular-expressions/how-to-validate-time-in-24-hours-format-with-regular-expression/
you can try this
[A-Za-z]{1,2}[ ]\d+:\d+-\d+:\d+
You could try this: ([MTWFS][ouehra]?) ([0-9]|[1-2][0-9]):([0-6][0-9])-([0-9]|[1-2][0-9]):([0-6][0-9])
I'd go with this:
(((M|T(u|h)|W|F|S(a|u)) ((1*\d)|(2[0-3])):[1-5]\d-((1*\d)|(2[0-3])):[1-5]\d(, )?)+
This should do the trick:
^(M|Tu|W|Th|F|Sa|Su) \d{1,2}:\d{2}-\d{1,2}:\d{2}(, (M|Tu|W|Th|F|Sa|Su) \d{1,2}:\d{2}-\d{1,2}:\d{2})*$
Note that you show T in your example above which is ambiguous. You might want to enforce Tu and Th as shown in my regex.
This will capture all sets in an array. The T in the short day of week list is debatable (tuesday or thursday?).
^((?:[MTWFS]|Tu|Th|Sa|Su)\s(?:[0-9]{1,2}:[0-9]{2})-(?:[0-9]{1,2}:[0-9]{2})(?:,\s)?)+$
The (?:) are non-capturing groups, so your actual matches will be (for example):
M 5:30-7:00
T 5:30-7:00
W 18:00-19:30
But the entire line will validate.
Added ^ and $ for line boundaries and an explicit time-time match because some regular expression parsers may not work with the previous way that I had it.

Regex Lookaround with only one match where multiple possible matches available

Got this:
<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>
I want only match: something two
I try: (?<=<TAG>)(.*two.*)(?=<\/TAG>)
but got:
something one</TAG><TAG>something two</TAG><TAG>something three
Maybe I give another example
RECORDsomething beetwenRECORD RECORDanything beetwenRECORD etc.
want to get words beetwen RECORD
You can use
<TAG>.+?<TAG>(.*?)</TAG>
Your something two is in the first match in $1
Try this:
(?<=</TAG><TAG>)[^<]*(?=</TAG><TAG>)
As already said, parsing HTML using regular expressions is discouraged! There are plenty of HTML parsers for doing this. But if you want a regex at all costs, here is how I would it in Python:
In [1]: import re
In [2]: s = '<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>'
In [3]: re.findall(r'(?<=<TAG>).*?(?=</TAG>)', s)[1]
Out[3]: 'something two'
However, this solution only works if you always want to extract the content of the second tag pair. But as I said, don't do this.
If you know that the TAG is not the first and not the last, you can do
(?<=.+<TAG>)(.*two.*)(?=<\/TAG>.+)
Of course, it's much better to capture the tags as well and use a capturing group
.*<TAG>(.*two.*?)<\/TAG>