Mod Rewrite RegEx To Match Only If Previous Subset Matched - regex

I am trying to make what I think is a simple regex for use with mod_rewrite.
I've tried various expressions, many of which I thought were promising, but all of which ultimately failed for one reason or another. They all also seem to fail once I add start/end string delimiters.
For example, ^user/(\d{1,10})(?=/)$ was one I tried, but among other things, it seems to group the trailing slash, and I only want to group the digits. I think I need to use a positive lookbehind, but I'm having difficulty because it's looking behind at a group.
What I am trying to match is strings that 1) begin with "user/" and 2) possibly end with (\d{1,10})/ (1 to 10 digits followed by a single slash)
Should Match:
user/
user/123/
user/1234567890/
Should not match:
user
user//
user/-4/
user/35.5/
user/123
user/123//
user/123/5/
user/12345678901/
Edit: Sorry about the formatting; I do not understand how to format anything via this markdown. Those examples are preceded by 4 spaces which I thought should make a code block, but obviously I thought wrong.

^user/(?:([0-9]{1,10})/)?$ should work just fine.

This: ^user(?=/)(/\d{1,10})?/$ Edit: if you want to group digits, ^user(?=/)(?:/(\d{1,10}))?/$

Related

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Regex: split number into optional first group of up to three then last group of up to three

I have two 1-6 digit numbers separated by a slash. I want these split up into groups of at most 3 digits, taking from the right.
For example:
0/1 -> [,0,,1]
1234/3 -> [1,234,,3]
12345/1234 -> [12,345,1,234]
123456/789123 -> [123,456,789,123]
I need to use a regular expression to do this because I want to do this for a location in NGINX. It's possible to do this with application logic but that is not the question due to performance.
Similar question which solves part of this was here using a negative lookahead: Regular expression to match last number in a string
What regex can achieve this split?
UPDATE:
This regex comes close to what I want (https://regex101.com/r/bQtNdK/3):
(?<prefix1>\d{0,3}?)(?<threes1>\d{0,3})\/(?<prefix2>\d{0,3}?)(?=\d)(?<threes2>\d{0,3})
It fails matching if the second number behind the slash is more than 3 digits long.
UPDATE2:
Now this regex works for most combinations (https://regex101.com/r/bQtNdK/5):
(?<prefix1>\d{0,3}?)(?<threes1>\d{1,3})\/(?<prefix2>\d{0,3})(?<threes2>\d{3})
I don't understand why this starts to fail if I use the same regex for prefix2/threes2 like prefix1/threes1 (i.e. make prefix2 also lazy). Any ideas how to solve this? So close...
I don't know that it's possible without the ability for the regex engine to remember all intermediate matches of a match group that matched an arbitrary number of times (.NET can do this, not sure what others). PCRE will apparently only remember the 'last' match for each group, other wise you could use something like this : (?<prefix1>\d{0,2})(?:(?<threes1>\d{3})*)\/(?<prefix2>\d{0,2})(?<threes2>\d{3})*\s
This regex seems to be correct now (regex101):
(?<prefix1>\d{0,3}?)(?<suffix1>\d{1,3})\/(?<prefix2>\d{0,3}?)(?<suffix2>\d{1,3})\/

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Regex href match a number

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.
Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>
To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.
Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.