Match a url that does not contain certain word - regex

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.

Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.

Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Related

Can I use negative lookahead and other conditions together in regex group?

I'm trying to match some URLs against another table using regex and - because the original source wasn't put together properly, I'm using a regex to clean them within the SQL.
As an example, the URLs might be /this-is-my-test-string/ or /this-is-my-test-string and the reference table is always of the form /this-is-my-test-string so using this regex works well to capture the matching part.
(\/[^\/)]*)\/?
However I've now come across some others with the form /this-is-my-test-string- and /this-is-my-test-string-/ which aren't as straightforward - I can't just add - to the exclusion as it's present in the rest of the string. From reading around - regex is not something I use regularly - a lookahead would seem to be the answer, but I can't work out how to include this in the expression.
Any help would be gratefully received.
You can use $ to anchor the end of the string, and use a non-greedy quantifier *? on the non-slash character set to allow -? to match a - from (or near) the end of the string:
(\/[^\/)]*?)-?\/?$

Regular expression to exclude tag groups or match only (.*) in between tags

I am struggling with this regex for a while now.
I need to match the text which is in between the <ns3:OutputData> data</ns3:OutputData>.
Note: after nscould be 1 or 2 digits
Note: the data is in one line just as in the example
Note: the ... preceding and ending is just to mention there are more tags nested
My regex so far: (ns\d\d?:OutputData>)\b(.*)(\/\1)
Sample text:
...<ns3:OutputData>foo bar</ns3:OutputData>...
I have tried (?:(ns\d\d?:OutputData>)\b)(.*)(?:(\/\1)) in an attempt to exclude group 1 and 3.
I wan't to exclude the tags which are matched, as in the images:
start
end
Any help is much appreciated.
EDIT
There might be some regex interpretation issue with Grep Console for IntelliJ which I intend to use the regex.
Here is is the latest image with the best match so far...
Your regex is almost there. All you need to do is to make the inside-matcher non-greedy. I.e. instead of (.*) you can write (.*?).
Another, xml-specific alternative is the negated character-class: ([^<]*).
So, this is the regex: (ns\d\d?:OutputData>)\b(.*?)(\/\1) You can experiment with it here.
Update
To make sure that the only group is the one that matches the text, then you have to make it work without backreferences: (?:ns\d\d?:OutputData>)\b(.*?)<
Update 2
It's possible to match only the required parts, using lookbehind. Check the regex here.:
(?<=ns\d:OutputData>)\b([^<]*)|(?<=ns\d\d:OutputData>)\b([^<]*)
Explanation:
The two alternatives are almost identical. The only difference is the number of digits. This is important because some flavors support only fixed-length lookbehinds.
Checking alternative one, we put the starting tag into one lookbehind (?<=...) so it won't be included into the full match.
Then we match every non- lt symbol greedily: [^<]*. This will stop atching at the first closing tag.
Essentially, you need a look behind and a look ahead with a back reference to match just the content, but variable length look behinds are not allowed. Fortunately, you have only 2 variations, so an alternation deals with that:
(?<=<(ns\d:OutputData>)).*?(?=<\/\1)|(?<=<(ns\d\d:OutputData>)).*?(?=<\/\2)
The entire match is the target content between the tags, which may contain anything (including left angle brackets etc).
Note also the reluctant quantifier .*?, so the match stops at the next matching end tag, rather than greedy .* that would match all the way to the last matching end tag.
See live demo.
This was the answer in my case:
(?<=(ns\d:OutputData)>)(.*?)(?=<\/\1)
The answer is based on #WiktorStribiżew 3 given solutions (in comments).
The last one worked and I have made a slight modification of it.
Thanks all for the effort and especially #WiktorStribiżew!
EDIT
Ok, yes #Bohemian it does not match 2-digits, I forgot to update:
(?<=(ns\d{0,2}:OutputData)>)(.*?)(?=<\/\1)

Regex Extraction for Google Analytics Content Grouping

I'm attempting to setup Content Groupings using Extraction within Google Analytics.
I have URL's of the form http://www.ehattons.com/52674/Bachmann_Branchline_37_671_Pack_of_3_14_Ton_tank_wagons_in_Fina_livery_weathered/StockDetail.aspx
I wish to use Regex to say that only in cases where a URL contains /StockDetail.aspx, extract everything before the first underscore, excluding any digits. e.g. 'Bachmann'.
I've managed to source the following regex to return everything before the first underscore
^[^_]+(?=_).
However, that's as far as I can get with my limited understanding. Anyone know what regex will do the trick here?
Many thanks,
Well you did the halfway.
Think about it this way : you're looking for extracting something followed by a underscore but not following one when the string contain /StockDetail.aspx. You know that this part of string will always be after your first underscore.
So you start with no underscore before : [^_]
Then you create the group you want to match with ([a-zA-Z]*) (you cannot work with \w since it's including underscore). Your string has to be followed by a underscore so you add _ after your group. And finnaly somewhere in the url you've got /StockDetail.aspx. Your regex should look like this :
[^_]([a-zA-Z]*)_.*(?:\/StockDetail\.aspx)
Result

Regular expression in Express 4 to capture all requests to static assets for file extension .gz

I am looking to do something like this:
app.get('*.gz',function(req,res,next){
res.set('Content-Encoding','gzip');
next();
});
but I don't think the regex I am using is correct. As the example suggests, I am looking for middleware that captures all requests to static assets that have an extension of .gz. (.gz being files zipped by gzip). Is my example correct?
Also if someone could mention what type of regular expressions Expess uses, that would help me look up reference material. To date, I have never read anywhere whether they are standard JS regexp's or Perl style or what?
I think you could try app.use instead app.get and check if static file youre looking for is correct , something like this
app.use("*.gz" , function(req,res,next){
console.log(req.originalUrl);
next();
}
You have to put this code before serve static files ( app.use(express.static(...) )
(.*)\.gz$
This will capture the filename without the extension. If you need the extension move the closing parenthesis to the right of gz. This might give you some false positives depending on your data structure - which you haven't mentioned in the question, btw. This will avoid any filenames that looks like this: filename.gz.tar.
Here's a breakdown of the code from regex101.com:
1st Capturing group (.*)
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times,
as many times as possible, giving back as needed [greedy]
\. matches the character . literally
gz matches the characters gz literally (case sensitive)
$ assert position at end of the string
g modifier: global. All matches (don't return on first match)
The regular expression would be '.*\.gz' to get all file names with the extension .gz
I have worked with front end js a little and the regex were closed in back slashes rather than single quotes. Please check if this is the right way to input regex.
You are currently using wildcard "*" technique. Various language do support this for example VB(Visual Basic) i would do exactly what you are doing when selecting files with the same extension.

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/