JMeter Proxy exclusion patterns still being recorded - regex

I am using JMeter to record traffic in my browser. In my URL Patterns to Exclude are:
.*\.jpg,
.*\.js,
.*\.png
Which looks like they should block these patterns (I've even tested it with a regex tester here)
Yet, I still see plenty of these files get pulled up. In a related forum someone had a similar issue, but his was caused by having additional url parameters afterwards (eg www.website.com/image.jpg?asdf=thisdoesntmatch). However this doesn't seem to be the case here. Can anyone point me in the right direction?

As already mentioned in the question comments it is probably a problem with the trailing characters. The pattern matcher is executed against the complete url including parameters.
So an URL http://example.com/layout.css?id=123 is not matched against the pattern .*\.css The JMeter HTTP Request Sample seperates the Path and the Parameters so it might be not obvious when you look at the URL.
Solution:Change the pattern to support trailing characters .*\.css.*
Explained
.* Any character
\. Matching the . (dot) character
css The character sequence css
.* Any character

Maybe you can do the oposite: leave blank the URL Patterns to exclude and negate those patterns in the URL Patterns to Include box:
(?!..(bmp|css|js|gif|ico|jpe?g|png|swf|woff))(.)

Related

What regex in Google Analytics to use for this case?

I'm trying to figure out what landing page regex to use to only show URLs that have only two sub-folders, e.g. see image below: just show green URLs but not the read ones as they have 3+ subfolders. Any advice on how to do this in GA with regex?
Cheers
If you want to match a path having only two components, e.g.
/component1/component2/
Then you may use the following regex:
/[^/]+/[^/]+/
Demo
If your regex tool requires anchors, then add them:
^/[^/]+/[^/]+/$
Is this what you are looking for?
^\/[!#$&-;=?-[]_a-z~]+\/[!#$&-;=?-[]_a-z~]+\/$
The two sections contain all the valid html characters. We're also forcing the regex to start with slash, end with slash and have only one slash in between.

How can I use regular expression to match urls starting with https and ending with #?

Very much a newb with regex and having a hard time figuring this one out. I have an HTML document and I want to clear out a ton of URLs that are inside of it. All of the URLs begin with https:// and they all end with a pound sign #.
Any help would be extremely appreciative. Using sublime text for my editor in case that is needed.
A basic way to do it:
\bhttps://[^\s#]+#
free-spaced:
\b //word start
https://
[^\s#]+ //followed by anything but whitespace and '#'
#
If you truly want to clear everything in between the url from https:// [...] # then you can use:
^(https)+(.)*(#)+$
But you may want to be more specific in terms of what you are filtering out. If this is from a database query you should be ok since you can assume the URL will be the content of the field(s) returned the you will be running the regex through a code loop of some kind.
BTW you can hone your scripts using something like http://regexpal.com/

RegEx to find specific URL structure

I have the following URLs
http://mysite/us/product.aspx
http://mysite/de/support.aspx
http://mysite/spaces/product-space
http://mysite/spaces/product-space/forums/this is my topic
http://mysite/spaces/product-space/forums/here is another topic
http://mysite/spaces/support-zone
http://mysite/spaces/support-zone/forums/yet another topic
http://mysite/spaces/internal
http://mysite/spaces/internal/forums/final topic
http://mysite/support/product/default.aspx
I want to add a Crawl Rule (This is SharePoint 2010 search related) using RegEx that excludes the URLs that don't include /forums/*, leaving only the forum topic URLs.
I want a rule that excludes the URLs for ../spaces/space1 and ../spaces/space2 but leaves all others intact, including the URLs containing /forums/
i.e. here are the results I want to identify with the regex (which will be used in an 'exclude' rule in SharePoint Search):
http://mysite/spaces/product-space
http://mysite/spaces/support-zone
http://mysite/spaces/internal
leaving these results not matched by the regex (and therefore not excluded by this rule)
http://mysite/us/product.aspx
http://mysite/de/support.aspx
http://mysite/spaces/product-space/forums/this is my topic
http://mysite/spaces/product-space/forums/here is another topic
http://mysite/spaces/support-zone/forums/yet another topic
http://mysite/spaces/internal/forums/final topic
http://mysite/support/product/default.aspx
Can someone help me out? I've been looking at this all morning and my head is starting to hurt - I can't explain it, I just don't get regular expression structures.
Thanks
Kevin
You can use lookahead to assert that /forum/ is in the URL (matches if present):
^(?=.*/forums/)
Or negative lookahead to assert it's not present:
^(?!.*/forums/)
Update:
This regex will match the url's you have in the "exclude" list:
^(?!.*/forums/).*/spaces/(?:space1|space2)
In short, we exclude all urls containing /forums/ using a negative lookahead, then we match anything containing /spaces/space1 or /spaces/space2.
Some systems require you to match the entire line however, in which case you would need to add a .* at the end:
^(?!.*/forums/).*/spaces/(?:space1|space2).*
... In Multi-line mode (assuming one URL per line), this did the trick for me:
(.*?\/forums\/.*?)$
Hope this helps
UPDATE:
Given your comment, the pattern to use could be:
.*/spaces/(?!.*/).*
Basically saying Match lines that have /spaces/ but don't have any more / after that (as stated was your criteria in your comment).
Using #rvalvik's regex suggestion (a different way that is also very nice), your answer would look like:
^(?!.*/forums/).*/spaces/.*

Regex for excluding URL

I working with an email company that has a feature where they spider your site in order to provide custom content. I have the ability to have the spider ignore urls based on the regex patterns I provide.
For this system a pattern starts and ends with a "/".
What I'm trying to do is ignore http://www.website.com/2011/10 BUT allow http://www.website.com/2011/10/title-of-page.html
I would have thought the pattern below would work since it does not have a trailing slash but no luck.
Any ideas?
/http:\/\/www\.website\.com\/[0-9][0-9][0-9][0-9]\/[0-9][0-9]/
Your regex matches a part of the URL, so you need to tell it not to allow a slash to follow it:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9](?!\/)/
If you want to also avoid other partial matches like in http://www.website.com/2011/100, then an additional word boundary might help:
/http:\/\/www\.website\.com\/[0-9]{4}\/[0-9][0-9]\b(?!\/)/
It depends on the regexp engine but you can probably either use $ (if the URL is tokenised beforehand) or a match for whitespace and delimiters

A URL that contains all valid characters to test my regex pattern?

First of all I created my own regex to find all URLs in a text, because:
When I searched SO and google only found regex for specific URL constructions, like images, etc.
I found a pretty complete regex from the PHP's manual itself (see "splattermania at freenet dot de 01-Oct-2009 12:01" post at http://php.net/manual/en/function.preg-match.php) that can find almost anything that resembles a URL, as little as "bit.ly".
This pattern has a few errors and constraints, so I'm fixing and enhancing it.
Now the pattern structure seems right, but I'm not sure all valid characters are present. Please post samples of URLs to test my pattern. Might be laziness, but I don't want to read pages and pages of references to find all of them, need to focus on the development. If you have a summary of valid chars for username, password, path, query and anchor that you can share, that would be very very helpful.
Best Regards!
The pattern you linked to does indeed match a lot of URLs, both valid and invalid. It's not really a surprise since nearly everything in that regex is optional; as you wrote yourself, it even matches bit.ly, so it's easy to see how it would match lots of non-URL stuff.
It doesn't take new Unicode domain names into account, for one (e.g., http://www.müller.de).
It doesn't match valid URLs like
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
It doesn't match relative paths (might not be necessary, though) like /cgi-bin/version.pl.
It doesn't match mailto: links.
It doesn't match URLs like http://1.2.3.4. Don't even ask about IPv6 :)
All in all, regular expressions are NOT the right tool to reliably match or validate URLs. This is a job for a parser. If you can live with many false positive and false negative matches, then regexes are fine.
Please read Jan Goyvaerts' excellent essay on this subject: Detecting URLs in a block of text.