Regex HTTP header parsing

Regex HTTP header parsing - regex

I'm trying to use Regex to get a bit if HTTP header parsing done. I'd like to use groups to organize some of the information:
Let's say I have this:
Content-Disposition: form-data; name="item1"
I'd like the result of my regex to create two groups:
contentdisposition : form-data
name : item1
I've tried several methods, but I can't seem to figure out how to do this. If name= doesn't exist then only one group should be created, but the regex should not error out.
Any ideas?

/Content-Disposition: (.*?);(?: name="(.*?)")?/ might be what you're looking for. It uses an optional greedy quantifier to get the name unless that would cause the match to fail.

Related

Consolidated RegEx to parse syslog data

Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.

Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.

Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$

One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)

Ignoring URL if contains specified word in URL GET parameters

I'm working on script that would show potentially dangerous HTTP requests, but I don't know how to filter URI in HTTP request correctly. The idea is to look if any URL is contained in GET parameters, but ignore the URLs which are added to GET parameter with specified word (for example - GET parameter with name goto can contain any URL. So if there is starting line of request like this ...
GET /check/request?first=1&second=http://domain.tld/something&third=3 HTTP/1.1
... there must be match. In case we have other request's starting line like ...
GET /check/request?goto=http://domain.tld/something HTTP/1.1
... this one should be ignored.
Base regex which matches any line with URL is:
^(GET|POST).*\?.*\=http\:\/\/.* HTTP\/.*$
I was trying to modify it correctly, but my version only matches lines which contains word goto in URL itself, not as parameter:
^(GET|POST).*\?.*(?!.*goto)\=http\:\/\/.* HTTP\/.*$
Any help would be appreciated.

UPDATE
^(GET|POST).*\?.*(?<!goto)\=http\:\/\/.* HTTP\/.*$
Check here

You probably meant lookbehind to http://.* rather than lookahead to .*:
^(GET|POST).*\?.*(?<!goto)\=http\:\/\/
Please see an example on regex101.

Regex HTTP Response Body Message

I use a jmeter for REST testing.
I have made a HTTP Request, and this is the response data:
{"id":11,"name":"value","password":null,"status":"ACTIVE","lastIp":"0.0.0.0","lastLogin":null,"addedDate":1429090984000}
I need just the ID (which is 11) in
{"id":11,....
I use the REGEX below :
([0-9].+?)
It works perfectly but it will be a problem if my ID more than 2 digits. I need to change the REGEX to :
([0-9][0-9].+?)
Is there any dynamic REGEX for my problem. Thank you for your attention.
Regards,
Stefio

If you want any integer between {"id": and , use the following Regular Expression:
{"id":(\d+),
However the smarter way of dealing with JSON data could be JSON Path Extractor (available via JMeter Plugins), going forward this option can be much easier to use against complex JSON.
See Using the XPath Extractor in JMeter guide (scroll down to "Parsing JSON") to learn more on syntax and use cases.

I suggest using the following regular expression:
"id":([^,]*),
This will first find "id": and then look for anything that is not a comma until it finds a comma. Note the character grouping is only around the value of the ID.
This will work for ANY length ID.
Edit:
The same concept works for almost any JSON data, for example where the value is quoted:
"key":"([^"]*)"
That regular expression will extract the value from given key, as long as value is quoted and does not contain quotes. It first finds "key": and then matches anything that is not a quote until the next quote.

You can use the quantifier like this:
([0-9]{2,}.+?)
It will catch 2 or more digits, and then any symbol, 1 or more times. If you want to allow no other characters after the digits, use * instead of +:
([0-9]{2,}.*?)
Regex demo

HTTP Response Header Regex

I had planned to use Jmeter Regex Extractor to get a Session ID in HTTP Response Header. This is the example of the HTTP Header :
HTTP/1.1 200 OK
x-powered-by: yoke
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST,GET,DELETE,PUT
Access-Control-Allow-Headers: X-Requested-With,jsessionid,Origin,Accept,Content-Type
Access-Control-Expose-Headers: Content-Size,Message,Total-Pages,Total-Count,Current-Page,jsessionid,Origin,Total-Outstanding,Content-Range
content-type: application/json
jsessionid: 10838d69-f9ac-4c70-b1f7-9447a7a6a463
Content-Length: 106
All I need to get is :
10838d69-f9ac-4c70-b1f7-9447a7a6a463
I use this REGEX :
jsessionid: [^\n]+
But I get :
jsessionid: 10838d69-f9ac-4c70-b1f7-9447a7a6a463
Can you help me with it?
Thank you
Best Regards,
Stefio

Use the regex expression
jsessionid: ([^\n]+)
and Template
$1$
Your issue has to do with regex grouping. Group 0 is the entire match, which is the default of Jmeter Regex Extractor. Group 1 is what was matched by the regex inside the first set of parenthesis. Template $1$ says to use the contents of group 1 as your result. Regex grouping can get much more complicated, so read tutorials if you want to grab multiple values from a regex expression.
Jmeter Regex Extractor user manual

Look into lookaround for regex, for you particular case it'd be lookbehind regex. This outght to work, untested though:
(?<=jsessionid:\s).+
The (?<=jsessionid:\s) part means literraly match jsessionid: but don't include it in results

JSESSIONID is basically a cookie so I don't think you need to extract it with regex.
I can think of 2 scenarios why you might need it:
You need to pass it with the next request as a header
You need its value for something else
In both cases you can use HTTP Cookie Manager
For the 1st option: it handles cookies automatically
For the 2nd option: given CookieManager.save.cookies property set to true you should be able to access the cookie value as ${COOKIE_JSESSIONID}

Excluding in Live HTTP Headers plugin for Firefox

I am trying to exclude gmail's requests from Live Http headers, but I cant
seem to get the exclude reg ex to work.
My exclude regex is this: .gif$|.jpg$|.ico$|.css$|.js$|.*mail.google.com.*
Any ideas/suggestions?

I have had the same problem and its soultion was stupid simple:
do you have enabled the check box ("exclude URL by RegExp" (or similar - I have only the german version))?
Hint: you do need to add the .* at start and end of your expression, because the request will be excludes if it contains the pattern (is must not match the complate url).

I think. You sould use "\." to catch a dot. Dot without slash is any symbol.
Like this:
\.gif$|\.jpg$|\.ico$|.css$|\.js$|.*mail\.google\.com.*

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex HTTP header parsing - regex

/Content-Disposition: (.?);(?: name="(.?)")?/ might be what you're looking for. It uses an optional greedy quantifier to get the name unless that would cause the match to fail.

Related

Consolidated RegEx to parse syslog data

Ignoring URL if contains specified word in URL GET parameters

Regex HTTP Response Body Message

HTTP Response Header Regex

Excluding in Live HTTP Headers plugin for Firefox

Categories

Resources