How to uncapture string in regex? - regex

I would like to capture the date and time from the text below.
!----------------------------------------------
! 16/Oct/2020 10:11:14 12/Nov/2020 11:21:32
!----------------------------------------------
! 17/Oct/2020 10:11:14
!----------------------------------------------
! 18/Oct/2020 11:00:00 21/Oct/2020 12:00:00
!----------------------------------------------
My regex query:
(?P<StartDate>(?<=!\s)[^\s]+)\s+(?P<StartTime>[^\s]+)\s*(?P<EndDate>[^\s]+)\s+(?P<EndTime>[^\s]+)
However, for the second row it is capturing the exclamation mark and hyphen as well. How can I uncapture those things?

Use the following regex. This is more specific than just the generic [^\s].
Side note, [^\s] can be replaced with \S.
You can make the end date and time optional by wrapping them in a non-capturing group (?:) and then adding a question mark after it to make that group optional: (?:)?
Regex
(?<=!\s)(?P<StartDate>[0-3]?[0-9]\/[A-Za-z]+\/\d+)\s+(?P<StartTime>[0-2]?[0-9]:[0-5]?[0-9]:[0-5]?[0-9])\s+(?:(?P<EndDate>[0-3]?[0-9]\/[A-Za-z]+\/\d+)\s+(?P<EndTime>[0-2]?[0-9]:[0-5]?[0-9]:[0-5]?[0-9]))?
Formatted
(?<=!\s) # Look behind if starts with "! "
(?P<StartDate>
[0-3]?[0-9]
\/
[A-Za-z]+
\/
\d+
)
\s+
(?P<StartTime>
[0-2]?[0-9]
:
[0-5]?[0-9]
:
[0-5]?[0-9]
)
\s+
(?: # non capturing group
(?P<EndDate>
[0-3]?[0-9]
\/
[A-Za-z]+
\/
\d+
)
\s+
(?P<EndTime>
[0-2]?[0-9]
:
[0-5]?[0-9]
:
[0-5]?[0-9]
)
)? # Make this group optional
Demo
https://regex101.com/r/Ky7g45/1
Cons
This will also match invalid dates from 32 - 39 and time hours from 24 - 29. If that matters, you'll need to add more regex with the | operator.

You should make your capture groups optional (at least the end time/date).
(?P<StartDate>(?<=!\s)[^\s]+)\s+(?P<StartTime>[^\s]+)\s*(?P<EndDate>[^!\s]+)?\s+(?P<EndTime>[^!\s]+)?
Here I make the EndDate and EndTime capture groups optional and also explicitely exclude exclamation marks (this is another avenue to explore, making the capture groups more specific to match only a date/time and not any non-whitespace characters).
For example, the dates can be matched with
[0-9]{2}\/[A-Za-z]{3}\/[0-9]{4}
and the times with
[0-9]{2}:[0-9]{2}:[0-9]{2}

A pretty similar RegExp of yours
with a lookbehind (?<=) and an optional non-capturing group (?:)?:
(?<=!\s)(?P<StartDate>\S+)\s+(?P<StartTime>\S+)(?:\s+(?P<EndDate>\S+)\s+(?P<EndTime>\S+))?
Description and example at: Regex101.com

Related

Regex : how to optional capture a group

I'm trying to make an substring optional.
Here is the source :
Movie TOTO S09 E22 2022 Copyright
I want to optionally capture the substring : S09 E22
What I have tried so far :
/(Movie)(.*)(S\d\d\s*E\d\d)?/gmi
The problem is that it ends up by matching S09 E22 2022 Copyright instead of just S09 E22 :
Match 1 : 0-33 Movie TOTO S09 E22 2022 Copyright
Group 1 : 0-5 Movie
Group 2: 5-33 TOTO S09 E22 2022 Copyright
Is there anyway to fix this issue ?
Regards
You get that match because the .* is greedy and will first match until the end of the string.
Then your (S\d\d\s*E\d\d)? is optional so this will stay matched and does not backtrack.
If you don't want partial matches for S09 or E22 and the 4 digits for the year are not mandatory and you have movies longer than 1 word, with pcre you could use:
\b(Movie)\b\h+((?:(?!\h+[SE]\d+\b).)*)(?:\h(S\d+\h+E\d+))?
\b(Movie)\b Capture the word Movie
( Capture group
(?: Non capture group to repeat as a whole part
(?!\h+[SE]\d+\b). Match any character if either the S01 or E22 part is not directly to the right (where [SE] matches either a S or E char, and \h matches a horizontal whitespace char)
)* Close the non capture group and optionall repeat it
) Close capture group
(?:\h(S\d+\h+E\d+)) Optionally capture the S01 E22 part (where \d+ matches 1 or more digits)
Regex demo
Another option with a capture group for the S01 E22 part, or else match the redt of the line
\b(Movie)\h+([^S\n]*(?:S(?!\d+\h+E\d+\b)[^S\n]*)*+)(S\d+\h+E\d+)?
Regex demo
With your shown samples and attempts please try following regex.
^Movie\s+\S+\s+(S\d{2}\s+E\d{2}(?=\s+\d{4}))
Here is the Online Demo for used regex.
Explanation: Adding detailed explanation for used regex above.
^Movie\s+\S+\s+ ##Matching string Movie from starting of value followed by spaces non-spaces and spaces.
(S\d{2}\s+E\d{2} ##Creating one and only capturing group where matching:
##S followed by 2 digits followed by spaces followed by E and 2 digits.
(?=\s+\d{4}) ##Making sure by positive lookahead that previous regex is followed by spaces and 4 digits.
) ##Closing capturing group here.
An idea to make the dot lazy .*? and force it to match up to $ end if other part doesn't exist.
Movie\s*(.*?)\s*(S\d\d\s*E\d\d|$)
See this demo at regex101 (further I added some \s* spaces around captures)
There are several errors in your regex:
Blank space after Movie is not considered.
(.*) matches everything after Movie.
Try online at https://regex101.com/
(Movie\s*)(\w*\s*)(S\d{2}\s*E\d{2}\s*)?((?:\w*\s*)*)

Pick only the alphabets and not the description from a given string

I am a newbie to Regex and require help with the following:
I have strings like - B - Comp-Band Disk,C - Check Oncoming Private,D - DL Procurement Outer. Is there a Regex expression which I could use to change string to B,C,D?
You can use
(?:(?:^|,)(\w))
Regex Explanation
(?: Non-capturing group
(?: Non-capturing group
^|, Match start of the string or ,
) Close non-capturing group
( Capturing group
\w Match any word character
) Close group
) Close non-capturing group
See the demo

Validate string # followed by digits but # increases after every occurance

I have a string looks like this
#123##1234###2356####69
It starts with # and followed by any digits, every time the # appears, the number of # increases, first time 1, second time 2, etc.
It's similar to this regex, but since I don't know how long this pattern goes, so it's not very useful.
^#\d+##\d+###\d+$
I'm using PCRE regex engine, it allows recursion (?R) and conditions (?(1)...) etc.
Is there a regex to validate this pattern?
Valid
#123
#12##235
#1234##12###368
#1234##12###368####22235#####723356
Invalid
##123
#123###456
#123##456##789
I tried ^(?(1)(?|(#\1)|(#))\d+)+$ but it doesn't seem to work at all
You can do this using PCRE conditional sub-pattern matching:
^(?:((?(1)\1)#)\d+)++$
RegEx Demo
RegEx Details:
^: Start
(?:: Start non-capture group
(: Start capture group #1
(?(1)\1): if/then/else directive that means match back-reference \1 only if 1st capture group is available otherwise match null
#: Match an additional #
): End capture group #1
\d+: Match 1+ digits
)++: End non-capture group. Match 1+ of this non-capture group.
$: End
One option could be optionally matching a backreference to group 1 inside group 1 using a possessive quantifier \1?+# adding # on every iteration.
^(?:(\1?+#)\d+)++$
^ Start of string
(?: Non capture group
(\1?+#)\d+ Capture group 1, match an optional possessive backreference to what is already captured in group 1 and add matching a # followed by 1+ digits
)++ Close the non capture group and repeat 1+ times possessively
$ End of string
Regex demo
I think you can use forward-referencing here:
^(?:((?:\1(?!^)|^)#)\d+)+$
See the regex demo.
Details:
^ - start of string
(?:((?:\1(?!^)|^)#)\d+)+ - one or more occurrences of
((?:\1(?!^)|^)#) - Group 1 (the \1 value): start of string or an occurrence of the Group 1 value if it is not at the string start position
\d+ - one or more digits
$ - end of string.
NOTE: This technique does not work in regex flavors that do not support forward referencing, like ECMAScript based flavors (e.g. JavaScript, VBA, C++ std::regex)
Despite there are already working answers, and inspired by Wiktor's answer, I came up this idea:
(?:(^#|#\1)\d+)+$
Which is also quite short and effective(also works for non pcre environment).
See the test cases

Regex doesn't ignore the optionnals groups

I'm trying the create a regex to catch my url and his, optionnals, groups. The regex works fine if the url is complete. The optionnals groups are not optionnals at all.
Regex :
\/(.+)(?:\/(.+))(?:(?:\?(.+)))
Urls to catch :
/taxi
/taxi/lyon
/taxi/lyon?coordinates=7542
https://regex101.com/r/NKFkwq/4/
As you can see, the third line is catched. But i'd like the first and second too.
I thought the ?: will be enought to do that, but i missed something...
Thanks a lot for your help !
Cheers
EDIT and answer
Thanks in the comments for helping me. Here the great regex (the one i expected) : https://regex101.com/r/NKFkwq/8
Indeed ?: is about ignoring a match, not made him optionnal.
Your pattern consists of capturing and non capturing groups. The (?: denotes a non capturing group.
If you want to match all 3 lines, you could use match the part starting from the first forward slash and make the part starting from the second forward slash optional.
^/[^\s/]+(?:/[^\s/]+)?$
^ Start of string
/[^\s/]+ Match / and match 1+ times any char except a whitespace or /
(?: Non capturing group
/[^\s/]+ Match / and match 1+ times any char except a whitespace or /
)? Close non capturing group and make it optional
$ End of string
Regex demo
If you want to have capturing groups, but don't want to match /taxi?coordinates=7542 you could nest the groups and make them optional as well.
^/\w+(/\w+(\?\S*)?)?$
^ Start of string
/\w+ Match / and 1+ word chars
( Capture group 1
/\w+ Match / and 1+ word chars
( Capture group 2
\?\S* Match ? and 0+ times a non whitespace char
)? Close group 2
)? Close group 1
$ End of string
Regex demo

How can I match everything between 2 commas?

I want to match basically any text that has a comma separated list of weekdays.
(?i)(every (mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5}, .*+,
(mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5})
Above is what what I have and I want to make it match the following strings. I don't need help in the case that only 2 weekdays are supplied.
Every mon, tue, wednesday
Every wed, Saturday, Friday, sun.
Try pattern: (?<=,|^)[^,\n]+
Explanation
(?<=,|^) - positive lookbehind: assert what preceeds is comma , or beginning of the string ^
[^,\n]+ - match one or more characters other than comma , or newline \n
Demo
You might list the abbreviations and optionally match the full name by listing them using an alternation followed by a comma and a space.
Add that to a group and repeat that 0+ times. After that add the group without a comma to make sure you match at least a single day.
(?i)\bevery (?:(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), )*(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?)\b
Explanation
(?i)\bevery Case insensitive modifier
(?: No capturing group
(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), Match any of the listed followed by a comma and space
)* Close non capturing group and repeat 0+ times
(?: Non capturing group
mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)? Match any of the listed
)\b Close non capturing group and add a word boundary to prevent being part of a larger word
Regex demo
To not match only multiple days, you could update the * quantifier for the first non capturing groupe to for example + or {2,}.