How can I check if a string has multiple matching groups that are the same? - regex

Currently, I am filtering out URL paths using Regex (Python). A couple of the URL paths I have come across are irrelevant and I want to detect URLs that are like this.
For example:
/ugrad/honors/index.php/policies/sao/policies/overview/step-1-course-requirements.html
/ugrad/honors/index.php/overview/sao/overview/sao/policies/noodle.html
In the examples above, you can see that policies and overview are repeated both times.
How can I design a Regex function to detect if there are 2+ matching texts anywhere in a URL path?
I have attempted something like this but I am unsure if it is possible to detect if there is 2+ matching texts anywhere in the string
My attempt: \S+(\/.+)\1\S+

Capture a slash, followed by non-slashes, followed by a slash again. Then repeat anything and backreference the capture group:
(\/[^\/]+\/).*\1
https://regex101.com/r/ygqRZc/1

Related

What regex in Google Analytics to use for this case?

I'm trying to figure out what landing page regex to use to only show URLs that have only two sub-folders, e.g. see image below: just show green URLs but not the read ones as they have 3+ subfolders. Any advice on how to do this in GA with regex?
Cheers
If you want to match a path having only two components, e.g.
/component1/component2/
Then you may use the following regex:
/[^/]+/[^/]+/
Demo
If your regex tool requires anchors, then add them:
^/[^/]+/[^/]+/$
Is this what you are looking for?
^\/[!#$&-;=?-[]_a-z~]+\/[!#$&-;=?-[]_a-z~]+\/$
The two sections contain all the valid html characters. We're also forcing the regex to start with slash, end with slash and have only one slash in between.

Regex exclude .html from string

I am stuck with creation of regex matching my needs.
Regex should include beginning of the string but not the ending (.html)
Example:
company.html should convert into company
My attempt:
url(r'^(?P<page>.+\.html)$', some_view)
Is there any chance someone could advice me on this one? I need to prepare my django urls to expect over 50 companies names and it seems to be the easiest way to keep my code DRY.
Simply exclude it out of the capture group:
url(r'^(?P<page>.+)\.html$', some_view)
(capture group in boldface).
The part between the brackets that starts with (?P<var>...) is the capture group, the content that is matched with that pattern, will be injected into var.
But you can add extra parts outside the capture group, that thus are required by the pattern, but not captured in the variable.
That being said, typically in Django apps, one does not add noise like extensions, etc. Why would you add weird characters to a URL that a non-technical person does not understand at all?

JMeter Proxy exclusion patterns still being recorded

I am using JMeter to record traffic in my browser. In my URL Patterns to Exclude are:
.*\.jpg,
.*\.js,
.*\.png
Which looks like they should block these patterns (I've even tested it with a regex tester here)
Yet, I still see plenty of these files get pulled up. In a related forum someone had a similar issue, but his was caused by having additional url parameters afterwards (eg www.website.com/image.jpg?asdf=thisdoesntmatch). However this doesn't seem to be the case here. Can anyone point me in the right direction?
As already mentioned in the question comments it is probably a problem with the trailing characters. The pattern matcher is executed against the complete url including parameters.
So an URL http://example.com/layout.css?id=123 is not matched against the pattern .*\.css The JMeter HTTP Request Sample seperates the Path and the Parameters so it might be not obvious when you look at the URL.
Solution:Change the pattern to support trailing characters .*\.css.*
Explained
.* Any character
\. Matching the . (dot) character
css The character sequence css
.* Any character
Maybe you can do the oposite: leave blank the URL Patterns to exclude and negate those patterns in the URL Patterns to Include box:
(?!..(bmp|css|js|gif|ico|jpe?g|png|swf|woff))(.)

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Regex for excluding URL

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/
Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters