Ignoring URL if contains specified word in URL GET parameters - regex

I'm working on script that would show potentially dangerous HTTP requests, but I don't know how to filter URI in HTTP request correctly. The idea is to look if any URL is contained in GET parameters, but ignore the URLs which are added to GET parameter with specified word (for example - GET parameter with name goto can contain any URL. So if there is starting line of request like this ...
GET /check/request?first=1&second=http://domain.tld/something&third=3 HTTP/1.1
... there must be match. In case we have other request's starting line like ...
GET /check/request?goto=http://domain.tld/something HTTP/1.1
... this one should be ignored.
Base regex which matches any line with URL is:
^(GET|POST).*\?.*\=http\:\/\/.* HTTP\/.*$
I was trying to modify it correctly, but my version only matches lines which contains word goto in URL itself, not as parameter:
^(GET|POST).*\?.*(?!.*goto)\=http\:\/\/.* HTTP\/.*$
Any help would be appreciated.

UPDATE
^(GET|POST).*\?.*(?<!goto)\=http\:\/\/.* HTTP\/.*$
Check here

You probably meant lookbehind to http://.* rather than lookahead to .*:
^(GET|POST).*\?.*(?<!goto)\=http\:\/\/
Please see an example on regex101.

Related

Consolidated RegEx to parse syslog data

Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.
Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.
Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$
One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)

Django Url pattern regex for tokens

I need to pass tokens like b'//x0eaa#abc.com//x00//xf0//x7f//xff//xff//xfd//x00' in my Django Url pattern. I am not able to find matching regex for that resulting Page not found error.
My url will be like /api/users/0/"b'//x0eaa#abc.com//x00//xf0//x7f//xff//xff//xfd//x00'"/
I have tried with following regex
url(r'^api/users/(?P<username>[\w\-]+)/(?P<paging_state>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4})/$', views.getUserPagination),
Please pass the token in request header or body and then use accordingly in your view.
Considering there are some static predictable elements in your url like -
api/users/
/" before b
"/ at the end after '
So I can see the url in either of the 2 ways below. Regex's mentioned accordingly:
api/users/(set of words, digits or hyphens)/"(any character except newline)"/
REGEX: ^api\/users\/([\w\d\-]+)\/"(.*)"\/$
URL: url(r'^api\/users\/([\w\d\-]+)\/"(.*)"\/$', views.getUserPagination),
api/users/(set of words, digits or hyphens)/"(one character-b)'//(any no. of words or digits)#(any no. of words or digits).(any no. of words or digits) (any no. of words, digits, front slashes)'"/
REGEX: ^api\/users\/([\w\d\-]+)\/"([a-g]'\/\/[\w\d]*#[\w\d]*.[\w\d]*[\/\w\d]*')"\/$
URL: url(r'^api\/users\/([\w\d\-]+)\/"([a-g]'\/\/[\w\d]*#[\w\d]*.[\w\d]*[\/\w\d]*')"\/$', views.getUserPagination),
You should be able to use either of the above two. There can be multiple ways to match the token part in your url. So unless it is a big security concern, you can do with the simplest approach as mentioned in point 1.

Extract last part of url without query string or jsessionid

I want a regex that will always return the last part of an url before the query string parameters and without the jessionid if present.
Here's some url examples:
http://www.somesite.com/some/path/test.action;jsessionid=000063vCmvJAn7VWyymA_dPsHZs:16u9pglit?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test;jsessionid=000063vCmvJAn7VWyymA_dPsHZs:16u9pglit?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test.action?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test?sort=2&param1=1&param2=2
Here's my regex so far:
.*http://.*/some/path.*/(.*);?.*\?.*
It is working for the url that does not contain jsessionid, but will return test;jessionid=... if it is present.
To test: http://regex101.com/r/fM0mE2
I would use this regex:
.*http:\/\/.*\/some\/path.*\/([^;\?]+);?.*\?.*
^^^^^^
Basically matches anything that isn't ; or ?. And I think it might be shortened to:
.*http:\/\/.*\/some\/path.*\/([^;\?]+)

Ignore backslash in django url patterns

In my django project, I have a url pattern like
(r'^survey/u2=([^/]+)/u3=([^/]+)/$',SurveyView.as_view()).
When I try to open the below url
http://www.sample.com/survey/u2=rc57S4/jyTJBz+==/u3=/U5pKfrV8X1MjUU2tI0AhqTF5PGR8g=/
[where u2 & u3 are encrypted value using internal keys. ]
I'm getting page not found error. This is due to, the sample url is not matching with the original url pattern at server end, as it has '/' backslash character in the url parameter.
Right now,I'm not in a position to edit the sample url by adding encode to the parameters, since this url has been mailed to customer. However if the customer opens the link I should not through error message.
How can I handle this special characters at server end while pattern match for url?.
Instead of passing as arguments in URL pass it as a GET request. seperated by ? and & characters.

How to rewrite url containing plus and special chracters?

We've got some incoming URLs that needs to be redirected, but we are having trouble with URLs that contains pluses (+).
For example any incoming URL must be redirected to the Homepage of the new site:
/eng/news/2005+01+01.htm
Should be redirected to to the home page of the new site
/en/
Using UrlRewriter.net we've set up a rule which works with 'normal' URLs but does not work for the above
<redirect url="~/eng/(.+)" to="/en/index.aspx" />
However it works fine if i change the incoming URL to
/eng/news/2005-01-01.htm
What's the problem and can anyone help?
I don't know about UrlRewriter.net, and I'm not sure which regex syntax it uses. I give some hint based on Perl regex.
what is the ~ at the beginning? Perhaps you mean ^, i.e. beginning of the string.
(.+) matches any character repeated one or more time; it does not match the + sign as you want
This is one way to write a (Perl) regex matching URLs starting with the string /eng/ and containg a + sign:
^\/eng\/.*\+.*
I hope this helps.