Conditional Regexp: return only one group - regex

Two types of URLs I want to match:
(1) www.test.de/type1/12345/this-is-a-title.html
(2) www.test.de/category/another-title-oh-yes.html
In the first type, I want to match "12345".
In the second type I want to match "category/another-title-oh-yes".
Here is what I came up with:
(?:(?:\.de\/type1\/([\d]*)\/)|\.de\/([\S]+)\.html)
This returns the following:
For type (1):
Match group 1: 12345
Match group 2:
For type (2):
Match group:
Match group 2: category/another-title-oh-yes
As you can see, it is working pretty well already.
For various reasons I need the regex to return only one match-group, though. Is there a way to achieve that?

Java/PHP/Python
Get both the matched group at index 1 using both Negative Lookahead and Positive Lookbehind.
((?<=\.de\/type1\/)\d+|(?<=\.de\/)(?!type1)[^\.]+)
There are two regex pattern that are ORed.
First regex pattern looks for 12345
Second regex pattern looks for category/another-title-oh-yes.
Note:
Each regex pattern must match exactly one match in each URL
Combine whole regex pattern inside the parenthesis (...|...) and remove parenthesis from the [^\.]+ and \d+ where:
[^\.]+ find anything until dot is found
\d+ find one or more digits
Here is online demo on regex101
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. [18-23] `12345`
MATCH 2
1. [57-86] `category/another-title-oh-yes`
JavaScript
try this one and get both the matched group at index 2.
((?:\.de\/type1\/)(\d+)|(?:\.de\/)(?!type1)([^\.]+))
Here is online demo on regex101.
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. `.de/type1/12345`
2. `12345`
MATCH 2
1. `.de/category/another-title-oh-yes`
2. `category/another-title-oh-yes`

Maybe this:
^www\.test\.de/(type1/(.*)\.|(.*)\.html)$
Debuggex Demo
Then for example:
var str = "www.test.de/type1/12345/this-is-a-title.html"
var regex = /^www\.test\.de/(type1/(.*)\.|(.*)\.html)$/
console.log(str.match(regex))
This will output an array, the first element is the string, the second one is whatever is after the website address, the third is what matched according to type1 and the fourth element is the rest.
You can do something like var matches = str.match(regex); return matches[2] || matches[3];

Related

Regex to get 2nd or 3rd level domain with path in it

I created this regex: [^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Given: https://this.is.my.nice.service.co.uk
My regex will return: service.co.uk
But it will not work for: https://this.is.my.nice.service.co.uk/sample
I would like my regex to always return the 2nd or 3rd level domain regardless if there's a path or not.
So given https://this.is.my.nice.service.co.uk and https://this.is.my.nice.service.co.uk/sample the result should be: service.co.uk.
How can I achieve that?
Demo: https://regex101.com/r/kygHUa/1
For your example strings, you can exclude matching the dot and forward slash, and optionally assert / followed by optional non whitespace chars till the end of the string.
Then get the first match, in case pattern can match multiple times in the string.
[^./\s/]*(?:\.[^\s./]{2,3}){1,2}(?=(?:\/\S*)?$)
See a regex 101 demo.
const regex = /[^./\s/]*(?:\.[^\s./]{2,3}){1,2}(?=(?:\/\S*)?$)/;
[
"https://this.is.my.nice.service.co.uk",
"https://service.co.uk",
"https://this.is.my.nice.service.co.uk/sample",
"https://service.com",
"https://this.is.my.nice.service.co.uk/"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[0]);
}
});
If all the parts start with https:// you could make the pattern a bit more specific, starting with the protocol and optional non greedy repetitions of the allowed characters followed by a dot.
Then get the capture group 1 value.
https?:\/\/(?:[^./\s/]*\.)*?([^./\s/]*(?:\.[^\s./]{2,3}){1,2}(?=(?:\/\S*)?$))
Regex demo

Regex match the unknown characters with dash between

I'm struggling with the following combination of characters that I'm trying to parse:
I have two types of text:
1. AF-B-W23F4-USLAMC-X99-JLK
2. LS-V-A23DF-SDLL--X22-LSM
I want to get the last two combination of characters devided by - within dash.
From the 1. X99-JLK and from the 2. X22-LSM
I accomplished the 2. with the following regex '--(.*-.*)'
How can I parse the 1. sample and is there any option to parse it at one time with something like OR operator?
Thanks for any help!
The pattern --(.*-.*) that you tried matches the second example because it contains -- and it matches the first occurrence.
Then it matches until the end of the string and backtracks to find another hyphen.
As .* can match any character (also -) and there are no anchors or boundaries set, this is a very broad match.
If there have to be 2 dashes, you can match the first one, and use a capture group for the part with the second one using a negated character class [^-]
The character class can also match a newline. If you don't want to match a newline you can use [^-\r\n] or also not matching spaces [^-\s] (as there are none in the example data)
-([^-]+-[^-]+)$
Explanation
- Match -
( Capture group 1
[^-]+-[^-]+ Match the second dash between chars other than -
) Close group 1
$ End of string
See a regex demo
For example using Javascript:
const regex = /-([^-]+-[^-]+)$/;
[
"AF-B-W23F4-USLAMC-X99-JLK",
"LS-V-A23DF-SDLL--X22-LSM"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
You can try lookahead to match the last pair before the new line. JavaScript example:
const str = `
AF-B-W23F4-USLAMC-X99-JLK
LS-V-A23DF-SDLL--X22-LSM
`;
const re = /[^-]*-[^-]*(?=\n)/g;
console.log(str.match(re));

Regex - Match Collection Must Consume Entire String

Let's say I have the following string:
var goodStr = "abcabcabc";
I want to write a Regex pattern for it to return three (3) matches, which each one's value being "abc". However, if the string deviates from the repeated "abc" pattern at all, I do not want to return ANY matches.
Also, I do not know if there will always be 3 repetitions (there could be any number of repetitions).
For example, the following string should fail, and it should not have any matches:
var badStr = "abcabcabc123";
What pattern should I use that would return 3 matches in goodStr, but 0 matches in badStr?
In other words, the sum of the last match's index and length should equal the total length of the subject string.
I am trying not to use captures/back-references in this scenario also.
EDIT:
The pattern ^(?:abc)+$ does not suffice since it only returns 1 match.
The pattern ^abcabcabc$ does not suffice since it assumes there will only be 3 repetitions of "abc", and I don't know how many repetitions there will be in my scenario. Also, it only returns 1 match.
Solved
With Aaron's and anubhava's help we made this pattern that works for my scenario:
\Gabc(?=(?:abc)*$)
PHP doesn't support dynamic length lookbehind so you may use this regex using \G:
(?:^(?=(?:abc)+$)|(?!^)\G)abc
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match.
After some more iterations this regex turns out to be most efficient:
\Gabc(?=(?:abc)*$)
RegEx Demo 2
You should be able to use the following in C# :
(?<=^(?:abc)*)abc(?=(?:abc)*$)
This matches occurences of abc from which you're able to reach both the start and the end of the string using only other repetitions of abc. This relies on the capacity to use variable-width lookbehinds which is quite rare but that C#'s regex engine implements.
I've been able to test it on http://regexstorm.net/tester where it does return 3 matches for abcabcabc but 0 for abcabcabc123.
You could use \G in combination with a positive lookahead. \G matches at the start of the string or asserts the position at the end of the previous match.
You can capture abc and check if what is on the right is a repetition of the group until the end of the string.
\G(abc)(?=\1*$)
Regex demo

Can regular expression assert that 2 of submatches to be equal?

Let say for this simple regexp,
(?P<first>\d+)\.(?P<second>\d+)
it can match strings like "123.456" so that,
first -> 123, second -> 456
Based on this example, is there a way to assert "first" should equal "second", otherwise the input string won't be a match?
You could capture the first digits before the dot in a capturing group and use a backreference after the dot to group 1:
(?P<first>\d+)\.(?P<second>\1)
Or you can referer to the first capturing group by name:
(?P<first>\d+)\.(?P<second>(?P=first))
As per comment from UnbearableLightness you could use word boundaries \b or use anchors ^ and $ to assert the start and the end of the line.
\b(?P<first>\d+)\.(?P<second>(?P=first))\b
You can backreference to the matched group in capture one with expression:
^(?P<first>\d+)\.(?P<second>\1)$
You can check it live here.

How to optionally match a group?

I have two possible patterns:
1.2 hello
1.2.3 hello
I would like to match 1, 2 and 3 if the latter exists.
Optional items seem to be the way to go, but my pattern (\d)\.(\d)?(\.(\d)).hello matches only 1.2.3 hello (almost perfectly: I get four groups but the first, second and fourth contain what I want) - the first test sting is not matched at all.
What would be the right match pattern?
Your pattern contains (\d)\.(\d)?(\.(\d)) part that matches a digit, then a ., then an optional digit (it may be 1 or 0) and then a . + a digit. Thus, it can match 1..2 hello, but not 1.2 hello.
You may make the third group non-capturing and make it optional:
(\d)\.(\d)(?:\.(\d))?\s*hello
^^^ ^^
See the regex demo
If your regex engine does not allow non-capturing groups, use a capturing one, just you will have to grab the value from Group 4:
(\d)\.(\d)(\.(\d))?\s*hello
See this regex.
Note that I replaced . before hello with \s* to match zero or more whitespaces.
Note also that if you need to match these numbers at the start of a line, you might consider pre-pending the pattern with ^ (and depending on your regex engine/tool, the m modifier).