Use Regular Expressions to find URLs without certain word patterns

Use Regular Expressions to find URLs without certain word patterns - regex

I am trying to write a Regular Expression that can match URLs that don't have a certain pattern. The URLs I am trying to filter out shouldn't have an ID in them, which is 40 Hex uppercase characters.
For example, If I have the following URLs:
/dev/api/appid/A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5/users
/dev/api/apps/list
/dev/api/help/apps/applicationname/apple/osversion/list/
(urls are made up, but the idea is that there are some endpoints with 40-length IDs, and some endpoints that don't, and some endpoints that are really long in total characters)
I want to make sure that the regular expression is only able to match the last 2 URLs, and not the first one.
I wrote the following regex,
\S+(?:[0-9A-F]{40})\S+
and it matches endpoints that do have the long ID in them, but skips over the ones that should be filtered. If I try to negate the regex,
\S+(?![0-9A-F]{40})\S+
It matches all endpoints, because some URLs have lengths that are greater than what the ID should be (40 characters).
How can I use a regular expression to filter out exactly the URLs I need?

Try this regex:
^(?!.*\/[0-9A-F]{40}\/).*$
Click for Demo
Explanation:
^ - asserts the start of the string/url
(?!.*\/[0-9A-F]{40}\/) - Negative Lookahead to check for the presence of a / followed by exactly 40 HEX characters followed by / somewhere in the string. Since, it is a negative lookahead, any string/url containing this pattern will not be matched.
.* - matches 0+ occurrences of any character except a newline character
$ - asserts the end of the string

^((?![A-F0-9]{40}).)*$
Uses a negative lookahead to match any line that doesn't have 40 hex digits in a row. Try it here.

Related

No period in first part of regular expression

This is what I'm currently working with:
((?i)(\w|^){0,25}[0-9]{3})[^\.]*#(gmail)\.com
What I'm attempting to do is block any email that is any amount of characters but with 3 numbers trailing the characters.
This works. HOWEVER, when Google creates a username for people, it usually chooses firstname.lastname####gmail.com. I don't want an email with a period before the #gmail.com to be included.
I have played and played with this expression, and I can't get it. So for example john.doe123#gmail.com, the expression is tagging everything after the period. I need for the regex to check the ENTIRE email and check to see if it follows the expression. I know there is this tidbit ^[^\.]*$ but I have no idea where to put it.

You could match 0-25 word characters followed by 3 digits \w{0,25}[0-9]{3} and use anchors to assert the start ^ and the end $ of the string.
^\w{0,25}[0-9]{3}#gmail\.com$
Regex demo
If you want to make use of the negated character class [^ you could match 0-25 times matching any char except a whitespace char, # or a dot followed by 3 digits using [^\s#.]{0,25}[0-9]{3}
^[^\s#.]{0,25}[0-9]{3}#gmail\.com$
Regex demo

Regex about url encoded string

Would like to write one regex to get the url encoded string in below line:
<topicref href="%E4%BA%B0.txt"/>
When I used a regex like (%[A-Z][0-9])+\.txt it only got %B0.txt. What can I do if I want to get the whole url encoded string such like %E4%BA%B0.txt.
Thanks a lot.

Proper URL encoding uses hex digits only, A-F not A-Z. The encoded URL could contain non-encoded characters anywhere. Also, you should escape the full stop.
((%[0-9A-F]{2}|[^<>'" %])+)\.txt
is a quick ad-hoc fix for your regex, though obviously for any production code, probably don't use a regex for this at all, or at the very least try a well-defined and properly tested URL regex like the one you can find in the HTTP RFC.
Putting the + quantifier outside the capturing parentheses will only return the last repetition. I added a second set of parentheses to put the quantifier inside the first capture group, which assumes you are doing something to extract the first capture group in particular. (If your regex dialect has non-capturing groups, you could change the second opening parenthesis to non-capturing, i.e. (?:.)

You need to change your regex to
([%\dA-Z]+)\.txt
([%\dA-Z]+) - Match %, digits and alphabets one or more time
\.txt - Match .txt
where as your regex means
(%[A-Z][0-9])+.txt
(%[A-Z][0-9])+
% - Match %
[A-Z] - Match A to Z one time
[0-9] - Match any digit one or more time
+ - Match the captured group one or more time
.txt - Match single character (anything except new line) followed by txt

Regular expression matching only URLs not having hyphen and four digits at the end

I am trying to create IIS rewrite rule for product URLs. The regular expression for this rule should be matched only by URLs like this:
catalog/products/gl1800-airbag.aspx
or
catalog/products/cab2.aspx.
URL's like
catalog/products/gl1800-airbag-2007.aspx
or
catalog/products/cab2-2007.aspx
should not be matched. It doesn't matter how much hyphens the last part of URL can have, it only hould not end with something like "-0000" (year).
I am not good at regular expressions and managed to get only to this:
catalog/products/([^/-0-9]+)\.aspx$
Second URL will match it, but first not. I'm not sure how to set the number of digits here and even if my regex is correct.

You can use
catalog/products/(?![^/]*\d{4}\.)([^/]+)\.aspx$
See the regex demo
The [^/]+ will match 1 or more characters other than a / and the (?![^/]*\d{4}\.) negative lookahead will fail a match once it finds 4 digits right before a ..

Regex to match number specific number in a string

I'm trying to fix a regex I create.
I have an url like this:
http://www.demo.it/prodotti/822/Panasonic-TXP46G20E.html
and I have to match the product ID (822).
I write this regex
(?<=prodotti\/).*(?<=\/)
and the result is "822/"
My match is always a group of numbers between two / /

You're almost there!
Simply use:
(?<=prodotti\/).*?(?=\/)
instead of:
(?<=prodotti\/).*(?<=\/)
And you're good ;)
See it working here on regex101.
I've actually just changed two things:
replaced that lookbehind of yours ((?<=\/)) by its matching lookahead... so it asserts that we can match a / AFTER the last character consumed by .*.
changed the greediness of your matching pattern, by using .*? instead of .*. Without that change, in case of an url that has several / following prodotti/, you wouldn't have stopped to the first one.
i.e., given the input string: http://www.demo.it/prodotti/822/Panasonic/TXP46G20E.html, it would have matched 822/Panasonic.

Regular expression to match last number in a string

I need to extract the last number that is inside a string. I'm trying to do this with regex and negative lookaheads, but it's not working. This is the regex that I have:
\d+(?!\d+)
And these are some strings, just to give you an idea, and what the regex should match:
ARRAY[123] matches 123
ARRAY[123].ITEM[4] matches 4
B:1000 matches 1000
B:1000.10 matches 10
And so on. The regex matches the numbers, but all of them. I don't get why the negative lookahead is not working. Any one care to explain?

Your regex \d+(?!\d+) says
match any number if it is not immediately followed by a number.
which is incorrect. A number is last if it is not followed (following it anywhere, not just immediately) by any other number.
When translated to regex we have:
(\d+)(?!.*\d)
Rubular Link

I took it this way: you need to make sure the match is close enough to the end of the string; close enough in the sense that only non-digits may intervene. What I suggest is the following:
/(\d+)\D*\z/
\z at the end means that that is the end of the string.
\D* before that means that an arbitrary number of non-digits can intervene between the match and the end of the string.
(\d+) is the matching part. It is in parenthesis so that you can pick it up, as was pointed out by Cameron.

You can use
.*(?:\D|^)(\d+)
to get the last number; this is because the matcher will gobble up all the characters with .*, then backtrack to the first non-digit character or the start of the string, then match the final group of digits.
Your negative lookahead isn't working because on the string "1 3", for example, the 1 is matched by the \d+, then the space matches the negative lookahead (since it's not a sequence of one or more digits). The 3 is never even looked at.
Note that your example regex doesn't have any groups in it, so I'm not sure how you were extracting the number.

I still had issues with managing the capture groups
(for example, if using Inline Modifiers (?imsxXU)).
This worked for my purposes -
.(?:\D|^)\d(\D)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Use Regular Expressions to find URLs without certain word patterns - regex

^((?![A-F0-9]{40}).)*$ Uses a negative lookahead to match any line that doesn't have 40 hex digits in a row. Try it here.

Related

No period in first part of regular expression

Regex about url encoded string

Regular expression matching only URLs not having hyphen and four digits at the end

Regex to match number specific number in a string

Regular expression to match last number in a string

Categories

Resources