Let say for this simple regexp,
(?P<first>\d+)\.(?P<second>\d+)
it can match strings like "123.456" so that,
first -> 123, second -> 456
Based on this example, is there a way to assert "first" should equal "second", otherwise the input string won't be a match?
You could capture the first digits before the dot in a capturing group and use a backreference after the dot to group 1:
(?P<first>\d+)\.(?P<second>\1)
Or you can referer to the first capturing group by name:
(?P<first>\d+)\.(?P<second>(?P=first))
As per comment from UnbearableLightness you could use word boundaries \b or use anchors ^ and $ to assert the start and the end of the line.
\b(?P<first>\d+)\.(?P<second>(?P=first))\b
You can backreference to the matched group in capture one with expression:
^(?P<first>\d+)\.(?P<second>\1)$
You can check it live here.
Related
I have the following type of strings:
This is a test: 1, two again,three test2: what is, this
test: acid, kool-aid word: some more info
Another test: face, 3, & yes
What I'd like to do is remove test: and everything after until it hits another word that has a colon.
The result set from above would look like:
This is a test2: what is, this
word: some more info
Another
Here's what I've attempted, but this fails when there is NO word with a colon (so example 3 fails)
test:.+?(?=\w+:)
You can use this regex for matching:
*\btest:.*?\b(?=\w+:|$)
And replace with empty string.
RegEx Demo
RegEx Details:
*: Match 0 or more spaces
\btest: Match full word test:
.*?\b: Match 0 or more of any characters (lazy match) followed by a word boundary
(?=\w+:|$): Positive lookahead to assert that we have a word + : or end of line ahead.
With your shown samples, please try following regex. This will create 1 to 2 capturing groups, this is having 3 matches 1st from starting to just before text with colon's 1st occurrence comes, 2nd match: From text followed by colon to next occurrence of text followed by colon(no capturing group is created for this match). 3rd match: rest of the value. So in case line has only 2 matches found(nothing in value after 2nd occurrence of text colon) then it will create 1 capturing group else it will be having 2 capturing groups. Perform substitution accordingly.
^(.*?)\s*\w+:.*?(?:\w+:|$)\s*(.*)$
Online demo for above regex
You were on the right track. For the last case where there is no second word with a colon, you need to match on the end-of-line character $. So you can use:
test:.*?(?=$|\b\w+:).
Demo
I'm tying to write some regex matching a start and end of a string.
start:
https://www.example.com.au/
end:
-end
Example input/match:
Input IsMatch
https://www.example.com.au/hithere-end Y
https://www.example.com.au/hi-there-end Y
https://www.example.com.au/hithere-endx N
https://www.example.com.au/end N
This is what i have so far:
^https?://(www\.)?example\.com\.au/[A-z](\-end)$
Any help?
Thanks.
Try this pattern:
^https?:\/\/(?:www\.)?example\.com\.au\/(.+)-end$
Changes from your pattern:
/ are escaped (with \, 3 times).
The first group changed to a non-capturing one (?:).
[A-z] matches a single capital letter. Changed to (.+)
(a capturing group).
Removed parentheses from the last group (you don't want to capture it), hence \ is also not needed.
The "middle part" you want to capture is in group 1.
Check this:
^(https?://(www\.)?example\.com\.au/)[A-z]*(-end)$
Should work.
Try this C# code
Somestring.StartsWith("https://www.example.com.au/")
Somestring.EndsWith("-end")
How to write a generic regular expression that will
1) capture string after first _ and before second _ as group 1
2) capture string after last _ as group 2
Example
ASIA_JAP_TOKYO_201109
OUTPUT Would be
group 1 - JAP
group 2 - 201109
You can do:
^[^_]*_([^_]*).*_([^_]*)$
Here first captured group will be "JAP" and second will be "201109".
^[^_]*_ matches upto the first _ from start
The first captured group, ([^_]*) captures the string upto next _
.*_ greedily matches upto the last _
([^_]*)$ matches the string after last _ and put it in captured group 2.
Demo
You can use regex like
/^[^_]+_([^_]+).*_([^_]+)$/
Regex explanation here.
For readability purposes, I might use two separate regular expression for this:
First regex:
^[^_]*_([^_]*?)_(.*)$
Second regex:
^(.*)_([^_]*)$
But if you are using a tool such as Java or Perl, I would much rather split the string on underscore and extract out the pieces you want.
Two types of URLs I want to match:
(1) www.test.de/type1/12345/this-is-a-title.html
(2) www.test.de/category/another-title-oh-yes.html
In the first type, I want to match "12345".
In the second type I want to match "category/another-title-oh-yes".
Here is what I came up with:
(?:(?:\.de\/type1\/([\d]*)\/)|\.de\/([\S]+)\.html)
This returns the following:
For type (1):
Match group 1: 12345
Match group 2:
For type (2):
Match group:
Match group 2: category/another-title-oh-yes
As you can see, it is working pretty well already.
For various reasons I need the regex to return only one match-group, though. Is there a way to achieve that?
Java/PHP/Python
Get both the matched group at index 1 using both Negative Lookahead and Positive Lookbehind.
((?<=\.de\/type1\/)\d+|(?<=\.de\/)(?!type1)[^\.]+)
There are two regex pattern that are ORed.
First regex pattern looks for 12345
Second regex pattern looks for category/another-title-oh-yes.
Note:
Each regex pattern must match exactly one match in each URL
Combine whole regex pattern inside the parenthesis (...|...) and remove parenthesis from the [^\.]+ and \d+ where:
[^\.]+ find anything until dot is found
\d+ find one or more digits
Here is online demo on regex101
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. [18-23] `12345`
MATCH 2
1. [57-86] `category/another-title-oh-yes`
JavaScript
try this one and get both the matched group at index 2.
((?:\.de\/type1\/)(\d+)|(?:\.de\/)(?!type1)([^\.]+))
Here is online demo on regex101.
Input:
www.test.de/type1/12345/this-is-a-title.html
www.test.de/category/another-title-oh-yes.html
Output:
MATCH 1
1. `.de/type1/12345`
2. `12345`
MATCH 2
1. `.de/category/another-title-oh-yes`
2. `category/another-title-oh-yes`
Maybe this:
^www\.test\.de/(type1/(.*)\.|(.*)\.html)$
Debuggex Demo
Then for example:
var str = "www.test.de/type1/12345/this-is-a-title.html"
var regex = /^www\.test\.de/(type1/(.*)\.|(.*)\.html)$/
console.log(str.match(regex))
This will output an array, the first element is the string, the second one is whatever is after the website address, the third is what matched according to type1 and the fourth element is the rest.
You can do something like var matches = str.match(regex); return matches[2] || matches[3];
I am using this regex:
((?:[a-z][a-z]+))_(\d+)_((?:[a-z][a-z]+)\d+)_(\d{13})
to match strings like this:
SH_6208069141055_BC000388_20110412101855
separating into 4 groups:
SH
6208069141055
BC000388
20110412101855
Question: How do I make the first group optional, so that the resulting group is a empty string?
I want to get 4 groups in every case, when possible.
Input string for this case: (no underline after the first group)
6208069141055_BC000388_20110412101855
Making a non-capturing, zero to more matching group, you must append ?.
(?: ..... )?
^ ^____ optional
|____ group
You can easily simplify your regex to be this:
(?:([a-z]{2,})_)?(\d+)_([a-z]{2,}\d+)_(\d+)$
^ ^^
|--------------||
| first group ||- quantifier for 0 or 1 time (essentially making it optional)
I'm not sure whether the input string without the first group will have the underscore or not, but you can use the above regex if it's the whole string.
regex101 demo
As you can see, the matched group 1 in the second match is empty and starts at matched group 2.