Regex - optional capture group after wildcard - regex

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.

Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo

You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo

Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo

Related

Regex that matches two or three words, but does no catpure the third if it is a specific word

I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
Mr. Snow
Mr. John Snow
Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].
My best try is as follows:
(M[rs].\s\w+(?:\s[\w-]+)(?:\s\([^\)]*\))?).
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.
Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*)\b
See an online demo
\b - A word-boundary;
M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0+ 2nd names concatenated through whitespace or hyphen;
\b - A word-boundary.
As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w+(?:\s[A-Z]\w+(?:\s\([^\)]*\))?)?
See the regex demo
Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w+)*
Explanation
\b A word boundary
M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
(?: Non capture group
Match a space (Or \s+ if you want want to allow newlines)
(?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
\w+ Match 1+ word characters
)* Close the non capture group and optionally repeat it
Regex demo
This captures the first name in group 1 and the second in group 2if the second name exists and is not and:
(?<=M[rs]\. )(\w+)(?: (?!and)(\w+))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w+)(?: (?!and)(\w+))?

Using regex to parse/reduce text file

edit
I've realized I made a mistake when explaining myself. Apologies for that.
Most of the artifacts come from this path:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\
then breaks into Artifact folders and its sub-folders like this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
I would appreciate help with following thing:
I have this list (around 5k rows) of paths to different artifacts and they have different versions, to give you an example:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.3\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.3\data.xxx
And my goal to achieve is this:
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx
Basically to scope it down to just 1 version.
I've tried using ^(.*)(\n\1)+$ and $1. but that obviously didn't work. So I was wondering if you have an idea how to approach this. Greatly appreciate help, thanks!
You can use
Find what: ^(.*\.)(\d+)\\[^\\\n]+(\n\1\d+\\[^\\\n]+)+$
Replace: $1$2\\
See the regex demo. Details:
^ - start of a line (it is the default ^ behavior in Visual Studio Code)
(.*\.) - Group 1: any one or more chars other than line break chars as many as possible and then a .
(\d+) - Group 2:
\\ - a \ char
[^\\\n]+ - one or more chars other than \ and a line break
(\n\1\d+\\[^\\\n]+)+ - Group 3 capturing one or more sequences of a line break and then the value captured into Group 1, one or more digits, a \ char and then one or more chars other than \ and a line break
$ - end of a line.
Here is another attempt, see regex101 demo.
The basic idea is to isolate someText-\d?. in capture group 2.
Then look for $2 in following lines. What precedes $2 or follows $2 in those following lines can vary.
Find: ^(.*\\(?=.*\\))(.*-\d+\.)(.*\\?.*)(\n.*\2.*)*
Replace: $1$2$3
So here is the most interesting part: ^(.*\\(?=.*\\))(.*-\d+\.)
This will get your Artifact-1. or Artifact-17. or someText-2. into capture group 2. Because using a positive lookahead (?=.*\\) the following group 2 (.*-\d+\.) will be in the last directory only. And then (.*\\?.*) gathers the rest of that line into group 3.
Finally (\n.*\2.*)* checks to see if there is a backreference to group 2, \2, in any following lines. [Technically, that backreference could be anywhere in a line, even the beginning, that can be fixed if necessary - let me know if you need that for your data. See safer regex101 demo if 'someText-/d.' could appear anywhere and should be ignored if not last directory and use that find.]
You can not use a single capture group for the whole line using ^(.*), as you want to repeat only the part before the last dot using a backreference and that will not work capturing the whole line.
Therefore you have to capture the digits in the first match in a separate capture group to keep it in the replacement.
If you want to match all following lines with the same text before the last dot, you can use a repeating group:
^\s*(.*\.)(\d+\\[^\\\r\n]*)(?:\r?\n\s*\1\d*\\[^\\\r\n]*)+
The pattern matches:
^ Start of string
\s* Match optional whitespace chars
(.*\.) Capture group 1, match till the last dot
(\d+\\[^\\\r\n]*) Capture group 2, match 1+ digits, \ and optional chars other than \ or a newline
(?: Non capture group
\r?\n\s*\1 Match a newline and a backreference to group 1
\d+\\[^\\\r\n]* Same pattern as in the first part
)+ Close the non capture group and repeat 1+ times
See a regex demo.
In the replacement use the 2 capture groups $1$2
The replacement will look like
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder\Artifact\Artifact-1.0\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder2\Artifact\Artifact-1.1\data.xxx
D:\Folder1\Folder2\Folder3\Folder4\Folder5\ArtifactFolder3\Artifact\Artifact-1.2\data.xxx

Regular expression with multiline matching (subtitles strings)

Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.
Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.
Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo

JScript Regex - extract dates preceded by substrings

I've got oneline string that includes several dates. In JScript Regex I need to extract dates that are proceded by case insensitive substrings of "dat" and "wy" in the given order. Substrings can be preceded by and followed by any character (except new line).
reg = new RegExp('dat.{0,}wy.{0,}\\d{1,4}([\-/ \.])\\d{1,2}([\-/ \.])\\d{1,4}','ig');
str = ('abc18.Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30')
result = str.match(reg).toString()
Received result: 'Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30'
Expected result: 'Dat wy.03/12/2019,data wy2020-09-30' or preferably: '03/12/2019,2020-09-30'
Thanks.
Several issues.
You want to match as few as possible between the substrings and date, but your current regex uses greed .{0,} (same like .*). See this Question and use .*? instead.
dat.*?wy.*?FOO can still skip over any other dat. To avoid skipping over, use what some call a Tempered Greedy Token. The .*? becomes (?:(?!dat).)*? for NOT skipping over.
Not really an issue, but you can capture the date separator and reuse it.
If you want to extract only the date part, also use capturing groups. I put a demo at regex101.
dat(?:(?!dat).)*?wy.*?(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})
There are many ways to achieve your desired outcome. Another idea, I would think of - if you know, there will never appear any digits between the dates, use \D for non-digit instead of the .
dat\D*?wy\D*(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})
You might use a capturing group with a backreference to make sure the separators like - and / are the same in the matched date.
\bdat\w*\s*wy\.?(\d{4}([-/ .])\d{2}\2\d{2}|\d{2}([-/ .])\d{2}\3\d{4})
\bdat\w*\s*wy\.? A word boundary, match dat followed by 0+ word chars and 0+ whitespace chars. Then match wy and an optional .
( Capture group 1
\d{4}([-/ .])\d{2}\2\d{2} Match a date like format starting with the year where \2 is a backreference to what is captured in group 2
| Or
\d{2}([-/ .])\d{2}\3\d{4} Match a date like format ending with the year where \3 is a backreference to what is captured in group 3
) Close group
The value is in capture group 1
Regex demo
Note That you could make the date more specific specifying ranges for the year, month and day.

RegEx - if then else

I am trying to work out a regex expression but struggle with conditionals. I have a list of 100s of URLs that look like this:
/name/something/details/55334
/name/page/1/2
/name/somethingdifferent/34523
/name/page/1
/name/something/553/1
Bottom line is that I want to remove everything when a number appears apart from a scenario where the last thing before the number is a word 'page'.
1. /name/something/details/
2. /name/page/1/2
3. /name/somethingdifferent/
4. /name/page/1
5. /name/something
I will be removing it with Google Analytics Content Grouping or potentially with DataStudio. I already removed /name/ so I have:
1. /something/details/55334
2. /page/1/2
3. /somethingdifferent/34523
4. /page/1
5. /something/553/1
but want to add another rule and remove the numbers so I get:
1. /something/details/
2. /page/1/2
3. /somethingdifferent/
4. /page/1
5. /something
have already tried:
\(?(?=(page\/[0-9]+))(\2)|(\/\d+)
following the syntax of:
(?(?=condition))(IF)|(ELSE)
but it highlights all numbers after text.
Thanks for your help.
sampak
Try ^(\/page.*|[^0-9]*), works with your example.
A Version incl. name: ^(page[\/\d]*|[^\d\s])*
One option might be to match not a whitespace or digit while not matching /page.
Then match a forward slash and 1+ digits followed by any char 0+ times to omit that from the result.
^((?:(?!\/page)[^\d\s])*\/)\d.*
In parts
^ Start of string
( Capture group 1
(?: Non capturing group
(?!\/page) Negative lookahead, assert what is directly to the right is not
[^\d\s] Match any char except a digit or whitespace char
)* Close non capturing group and repeat 0+ times
\/ Match /
) Close group 1
\d.* Match a digit followed by any char except a newline 0+ times
In the replacement use the first capturing group
Regex demo
If you also want to remove /name you could use:
^\/name((?:(?!\/page)[^\d\s])*\/)\d.*
Regex demo