Splunk regex to match part of url string - regex

I'm trying to use Splunk to search for all base path instances of a specific url (and maybe plot it on a chart afterwards).
Here are some example urls and the part I want to match for:
http://some-url.com/first/ # match "first"
http://some-url.com/first/second/ # match "first"
http://some-url.com/first/second/third/ # match "first"
Here's the regex I'm using, which works fine:
http:\/\/some-url\.com\/(.*?)\/
What should my Splunk search be to extract the desired text? Is this even possible in Splunk?

Assuming that it's always com.
Using rex:
index= and other stuff | rex field=(if not _raw) "\.com/(?<match> \w+)/" | table match

To match any URL (.com or not), you can use the following command.
index=... | rex field=_raw "http(s)?://[^/]+/(?<match>[^/]+)"
This will match things such as
http://splunk.com/first/
https://simonduff.net/first/
https://splunk.com/first/middle/last
https://splunk.com/first

Related

Regular expression to extract different parts of URL and path

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?
You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Extract Splunk domain from payload_printable field with regex

I'm trying to extract a domain from the Splunk payload_printable field (source is Suricata logs) and found this regex is working fine for most of the cases:
source="*suricata*" alert.signature="ET JA3*"
| rex field=payload_printable "(?<dom>[a-zA-Z0-9\-\_]{1,}\.[a-zA-Z0-9\-\_]{2,}\.[a-zA-Z0-9\-\_]{2,})"
| table payload_printable, dom
The regular expression is:
(?<dom>[a-zA-Z0-9\-\_]{1,}\.[a-zA-Z0-9\-\_]{2,}\.[a-zA-Z0-9\-\_]{2,})
For example, if my printable_payload looks like this:
...........^aO+.t....]......$.....mT*l.......&.,.+.0./.$.#.(.'.
...........=.<.5./.
...].........activity.windows.com..........
.................
.......................#...........
The domain "activity.windows.com" is successfully extracted. Now, it doesn't work for such a payload, because the regex matches another part that does not correspond to the domain:
...........^aO+]v;.~........:.Y.zORw._I..K>..&.,.+.0./.$.#.(.'.
...........=.<.5./.
...].........activity.windows.com..........
.................
.......................#...........
It extracts "Y.zORw._I".
Another example:
...........^h.'`.o2...
.y.k>..e.ef...]..8.G..&.,.+.0./.$.#.(.'.
...........=.<.5./.
...p.........arc.msn.com..........
.................
.......................#.........h2.http/1.1...................
I don't know how to do. Thank you for your help.
This regex will match domain names and correctly matches the two examples you gave:
"(?<dom>(?:[a-z0-9](?:[a-z0-9-_]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-_]{0,61}[a-z0-9])"

Regex to remove everything after -i- (with -i-)

I was trying to find solution for my problem.
Input: prd-abcd-efgh-i-0dflnk55f5d45df
Output: prd-abcd-efgh
Tried Splunk Query : index=aws-* (host=prd-abcd-efgh*) | rex field=host "^(?<host>[^.]+)"| dedup host | stats count by host,methodPath
I want to remove everything comes after "-i-" using simple regex.I tried with regex "^(?[^.]+)" listed here
https://answers.splunk.com/answers/77101/extracting-selected-hosts-with-regex-regex-hosts-with-exceptions.html
Please help me to solve it.
replace(host, "(?<=-i-).*", "")
Example here: https://regex101.com/r/blcCcQ/2
This (?<=-i-) is a lookbehind
I have no knowledge of Splunk. but the normal way to do that would be to match the part you don't want and replace it with an empty string.
The regex for doing that could be:
-i-.*
Then replace the match with an empty string.
Something simple like this should work:
([a-z-]+)-i-.+
The first capture group will return only the part preceding -i-.

splunk: Get the first three numbers from ip address

I'm trying to get the first three sets of numbers of an IP address which is in this format: 10.10.10.10
Desired value would be 10.10.10
Try this regex: ^(.+)(?=\.\d+$)
DEMO
And from next time please post what have you tried along with how you plan to reach the solution.
Regex to match a correct IP4Address:
/^(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5])$/
Regex101
Regex to match first three blocks of an correct IP4Address:
/^(([01]?\d?\d|2[0-4]\d|25[0-5])\.){2}([01]?\d?\d|2[0-4]\d|25[0-5])$/
Regex101
or if it is still fine when it matches a point after the third block:
/^(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}$/
Regex101
was able to get it this way:
rex field=IP "(?<first_three>\d+\.\d+\.\d+)\.\d+"
Another method to do.
..| rex field=ip_addr "(?<split_ip>.+)\.[0-9]+"
Where,
ip_addr - field name
split_ip - variable under which the split IP address will be stored
Example:
Splunk Query:
| stats count | eval ip = "115.124.35.123" | rex field=ip "(?<split_ip>.+)\.[0-9]+" | table split_ip
Output:
115.124.35
Below works for me.
rex field=_raw "(?<ip_address>^\d+\.\d+\.\d+\.\d+)"|timechart count by ip_address
Use below regex :
^(?P<result>.+(?=\.\d+))
[link] https://regex101.com/r/bO4tY5/3
https://regex101.com/ is a super useful tool for this kind of stuff. It lets you write your regex and test it for different strings in real time.
Once you've got what you need, stick it into your Splunk search query with the rex command.
To answer your exact problem:
The regex code, where MY_FIELD_NAME_HERE is the name of the extracted field:
(?<MY_FIELD_NAME_HERE>\d+\.\d+\.\d+)\.\d+
The regex with examples from regex101:
https://regex101.com/r/qTTf4e/2
The command required for the Splunk query language, where ORIGNAL_FIELD is your original field holding 10.10.10.10 and MY_FIELD_NAME_HERE is the extracted field:
... | rex field="ORIGNAL_FIELD" "(?<MY_FIELD_NAME_HERE>\d+\.\d+\.\d+)\.\d+"

Exclude regular expression match if it contains a string

I'm still learning regular expressions and I seem to be stuck.
I wanted to write a reg exp that matches URL paths like these that contain "bulk":
/bulk-category_one/product
/another-category/bulk-product
to only get the product pages, but not the category pages like:
/bulk-category_one/
/another-category/
So I came up with:
[/].*(bulk).*[/].+|[/].*[/].*(bulk).*
But there's pagination, so when I put the reg exp in Google Analytics, I'm finding stuff like:
/bulk-category/_/showAll/1/
All of them have
/_/
and I don't want any URL paths that contain
/_/
and I can't figure out how to exclude them.
I would go about it this way:
/[^/\s]*bulk[^/]*/[^/\s]+(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*(?!/)
first part:
/ - match the slash
[^/\s]* - match everything that's not a slash and not a whitespace
bulk - match bulk literally
[^/]* - match everything that's not a slash
/ - match the slash
[^/\s]+ - match everything that's not a slash and not a whitespace
(?!/) - ensure there is not a slash afterwards (i.e. url has two parts)
The second part is more of the same, but this time 'bulk' is expected in the second part of the url not the first one.
If you need the word 'product' specifically in the second part of the url one more alternative would be required:
/[^/\s]*bulk[^/]*/[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*product[^/\s]*bulk[^/\s]*(?!/)
If I apply that simple regex to a file FILE
egrep ".*bulk.*product" FILE
which contains your examples above, it only matches the 2 lines with bulk and product. We can, additionally, exclude '/_/':
egrep ".*bulk.*product" FILE | egrep -v "/_/"
Two invocations are often much more easy to define and to understand, than a big one-fits-all.