I have an Url formatted as follow : https://www.mywebsite.com/subdomain/123456789.htm. I know that the webpage number is built with exactly 9 or 10 digits. I would like to extract this number using a Regex.
The Regex I use to perform this operation is :
^https://www.mywebsite.com/[A-Za-z0-9_.-~/]+([0-9]{9,10}).htm$
The problem is that when the number is 10 digits long, I get a match which is good but only the last 9 digits are captured. For example : https://www.mywebsite.com/subdomain/1234567890.htm captures 234567890 only.
I could easily create two regexes (one with 9 digits and one with 10) and take the longest number if both matches, but is there any elegant way to solve this problem using Regex?
EDIT
Following remarks which have been made below, there is actually a mistake in my original Regex : the first character group matches the first digit of the 10, and leaves only the 9 others for the capturing group. I've added a screenshot below. Adding a forward slash to the Regex before the capturing group solved the issue, thanks!
As per #TheFourthBird, you are missing a match on the forward slash. Maybe a slightly different approach to yours would be a non-capturing group:
^https://www.mywebsite.com/(?:[^/]+/)+(\d{9,10}).htm$
The character class [A-Za-z0-9_.-~/]+ matches all the character that follow until the end of the line.
This part ([0-9]{9,10}). will then backtrack until it can match the resulting digits, which it can starting from 9 digits and that will be in the capturing group.
Note to either escape the hyphen \- or place it at the start or end of the character class or else it could possible match a range.
One option is to use a word bounary \b before matching the digits
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+\b([0-9]{9,10})\.htm$
Regex demo
Another way could be matching the / right before the digits.
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+/([0-9]{9,10})\.htm$
Regex demo
If there can also be chars a-zA-Z or an underscoe before the digits and a lookbehind is supported, you could also assert that there is not a digit before (?<!\d)
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+(?<!\d)([0-9]{9,10})\.htm$
Regex demo
One more approach. This gets all the numbers between / and htm
(\d+)(?=\.htm)
RegexDemo
Related
I'm using regex in powershell 5.1.
I need it to detect groups of numbers, but ignore groups followed or preceeded by /, so from this it should detect only 9876.
[regex]::matches('9876 1234/56',‘(?<!/)([0-9]{1,}(?!(\/[0-9])))’).value
As it is now, the result is:
9876
123
6
More examples: "13 17 10/20" should only match 13 and 17.
Tried using something like (?!(\/([0-9]{1,}))), but it did not help.
You may use
\b(?<!/)[0-9]+\b(?!/[0-9])
See the regex demo
Alternatively, if the numbers can be glued to text:
(?<![/0-9])[0-9]+(?!/?[0-9])
See this regex demo.
The first pattern is based on word boundaries \b that make sure there are no letters, digits and _ right before and after an expected match. The second one just makes sure there are no digits and / on both ends of the match.
Details
(?<![/0-9]) - a negative lookbehind making sure there is no digit or / immediately to the left of the current location
[0-9]+ - one or more digis
(?!/?[0-9]) - a negative lookahead making sure there is no optional / followed with a digit immediately to the right of the current location.
I am looking for help here. I want to write a regex to help me find EXACTLY a 7 digit in string - no more or less.
For instance in this string:
1234567 RE:TKT-2744870-R6P1G0: Gentle Reminder
It should return only 1234567
In this one:
12345678 RE:TKT-2744870-R6P1G0: Gentle Reminder
It should return none.
Can you help me with this one.
thanks in advance.
The proper regex should include \d{7} (7 digits) and 2 "border criteria",
for both start and end of the match, to block matching of a fragment
from longer sequence of digits.
My first thought was that neither before nor after the match there can be any digit.
But as I see from your example, these border criteria should be extended.
The set of "forbidden" chars (either before or after the match) should
include also - and letters.
E.g. 2744870 in your example data contains just 7 digits (no more, no less),
but you still don't want it to be matched, apparently because they are surrounded with - chars.
To keep the regex short, I propose:
(?<![\w-])\d{7}(?![\w-])
Details:
(?<![\w-]) - Negative lookbehind for word char or -.
\d{7} - 7 digits.
(?![\w-]) - Negative lookahead for word char or -.
If you decide to extend the set of "forbidden" chars in both border criteria,
just add them to [...] fragments in lookbehind / lookahead (but - char
should remain at the end, otherwise it must be quoted with \).
Regex like (\d{7})[^\d] (in other proposition) is wrong,
as it matches last 7 digits from any longer sequence of digits
(no "front border criterion").
It matches also both 2744870 (surronded with - chars), which are not
to be matched.
This one should do for your examples:
(\d{7})[^\d]
The first matching group contains the seven digits.
Alternatively –as suggested in the comments– you can use a negative lookahead to only match the seven digits and not require matching groups:
^\d{7}(?!\d)
I have a question about groups in a rule i created to extract dates from text.
Let's consider the following string:
fherfrefercr17hfeuetvbyeituew
The string is composed by everything at the beginning, then there is a number composed by one or two digits and then everything again. I need to extract only the number "17" from the string listed above.
With the following rule i extract only 7 and not 17.
.*(\d{1,2}).*
Can anyone help me with that please?
Overview
Given your pattern:
.*(\d{1,2}).*
This works in the following way:
.* Match any character any number of times
The quantifier here is considered to be greedy because it will match as many characters as possible so long as the pattern matches the string.
\d{1,2} Since your pattern says to match 1 or 2 digits and the previous token is greedy, the regex is just going to match a single digit because this still satisfies the pattern (the previous token stole the first digit).
Code
There are multiple ways you can fix this issue
Method 1
This will simply extract all numbers (1+ digits) from the string. If you want to only match 1 or two digits use \d\d? or \d{1,2} instead.
\d+
\d\d?
\d{1,2}
Method 2
This method turns the greedy quantifier * (in .*) into a lazy quantifier .*?. This will match any character any number of times, but as few as possible. The drawback to this method is that it's expensive because the engine needs to backtrack.
.*?\d{1,2}.*
Method 3
This method matches any non-digit character any number of times, then it matches one or two digits. This is likely the solution you're looking for.
\D*(\d{1,2}).*
I would like to check if a phone number contains exactly 3 digits - dot - 3 digits - dot - 3 digits. (e.g. 123.456.789)
So far I have this, but it doesn't work:
^(\d{3}\){2}\d{4}$
Note that an escaped bracket \) loses its special meaning in regex and the pattern becomes invalid since the capturing group is not closed.
If you want to match a dot with a regex, you need to include it to your pattern, and if you say 3 digits must be at the end there is no point in declaring 4 digits with \d{4}.
^(\d{3}\.){2}\d{3}$
^ ^
or if we expand the first group:
^\d{3}\.\d{3}\.\d{3}$
So all the fix consists in adding a dot after the second backslash and adjusting the final limiting quantifier.
Note that for mostly "stylistics" concerns (since efficiency gain is insignificant) I'd use a non-capturing group with the first regex variant:
^(?:\d{3}\.){2}\d{3}$
I'm trying to come up with some regex to match against 1 hyphen per any number of digit groups. No characters ([a-z][A-Z]).
123-356-129811231235123-1235612346123451235
/[^\d-]/g
The one above will match the string below, but it will let the following go through:
1223--1235---123123-------
I was looking at the following post How to match hyphens with Regular Expression? for an answer, but I didn't find anything close.
#Konrad Rudolph gave a good example.
Regular expression to match 7-12 digits; may contain space or hyphen
This tool is useful for me http://www.gskinner.com/RegExr/
Assuming it can't ever start with a hyphen:
^\d(-\d|\d)*$
broken down:
^ # match beginning of line
\d # match single digit
(-\d|\d)+ # match hyphen & digit or just a digit (0 or more times)
$ # match end of line
That makes every hyphen have to have a digit immediately following it. Keep in mind though, that the following are examples of legal patterns:
213-123-12314-234234
1-2-3-4-5-6-7
12234234234
gskinner example
Alternatively:
^(\d+-)+(\d+)$
So it's one or more group(s) of digits followed by hyphen + final group of digits.
Nothing very fancy, but in my tests it matched only when there were hyphen(s) with digits on both sides.