Regular expression to extract part of a file path using the logstash grok filter - regex

I am new to regular expressions but I think people here may give me valuable inputs. I am using the logstash grok filter in which I can supply only regular expressions.
I have a string like this
/app/webpf04/sns882A/snsdomain/logs/access.log
I want to use a regular expression to get the sns882A part from the string, which is the substring after the third "/", how can I do that?
I am restricted to regex as grok only accepts regex. Is it possible to use regex for this?

Yes you can use regular expression to get what you want via grok:
/[^/]+/[^/]+/(?<field1>[^/]+)/

for your regex:
/\w*\/\w*\/(\w*)\/
You can also test with:
http://www.regextester.com/
By googling regex tester, you can have different UI.

If you are indeed using Perl then you should use the File::Spec module like this
use strict;
use warnings;
use File::Spec;
my $path = '/app/webpf04/sns882A/snsdomain/logs/access.log';
my #path = File::Spec->splitdir($path);
print $path[3], "\n";
output
sns882A

This is how I would do it in Perl:
my ($name) = ($fullname =~ m{^(?:/.*?){2}/(.*?)/});
EDIT:
If your framework does not support Perl-ish non-grouping groups (?:xyz), this regex should work instead:
^/.*?/.*?/(.*?)/
If you are concerned about performance of .*?, this works as well:
^/[^/]+/[^/]+/([^/]+)/
One more note: All of regexes above will match string /app/webpf04/sns882A/.
But matching string is completely different from first matching group, which is sns882A in all three cases.

Same answer but a small bug fix. If you doesnt specify ^ in starting,it will go for the next match(try longer paths adding more / for input.). To fix it just add ^ in the starting like this. ^ means starting of the input line. finally group1 is your answer.
^/[^/]+/[^/]+/([^/]+)/
If you are using any URI paths use below.(it will handle path aswell as URI).
^.*?/[^/]+/[^/]+/([^/]+)/

Related

Using RegEx with Alteryx to replace string

I have a simple issue: Using Alteryx, I want to take a string, match a certain pattern and return the matched pattern.
This is my current approach:
Regex_replace("CP:ConsumerProducts&Retail</td><td><strong><fontcl","[^\<]+","$1")
According to various sources and tools like regex101, the first matched sequence should be "CP:ConsumerProducts&Retail". However, Alteryx returns
<<<<
Alteryx uses the Perl RegEx Syntax (https://help.alteryx.com/2018.2/boost/syntax_perl.html), therefore, it should have no problem with the pattern itself.
I believe I am missing something obvious but I cannot figure it out.
I have received a reply through a different forum. A solution that works for me is to use the following pattern: ([^\<]+).*
You can try the following workflow:

Regex: Extract string between two strings

This is the issue I face
The String
nt/sign-in?wa=wsignin1.0&wtre
The Need
From that string I need to extract the following
wsignin1.0
The Attempts
So far I have tried the following Regex
wa=(.*?)(?=&amp)
This returns:
wa=wsignin1.0
The "wa=" is not supposed to be there
Perhaps with a look behind?
(?<=wa\=)(.+)(?=\&wtre)
wsignin1.0
JMeter uses Perl5-style regular expressions therefore the regex you are looking for might be as simple as:
wa=(.+?)&wtre
Demo:
Use $1$ as "Template" in your Regular Expresssion Extractor.
See How to Debug your Apache JMeter Script for more details on JMeter tests troubleshooting.
=([\w.]++)
will capture it in the first capture group. Otherwise I think #jivan has a good idea with the lookbehind. A little tweak too it:
(?<==)[\w.]++
Put this in your Regular Expression extractor:
nt/sign-in?wa=([a-zA-Z0-9\.]*)&wtre
I hope this help you.

Matching a substring using a regular expression in PowerShell

Regular expressions are really not my forte and I am trying to learn. Struggling with this one at the moment.
<fraglink id="230681395" resid="1057000484">
I have a file with loads of text in it, and every now and then bits like the above appear in it. I want to get the number in between the quotes after resid=.
Is some sort of look ahead / behind required here?
It looks like you want a regex like:
resid="([0-9]+)"
And to grab $1.
Since your content looks like XML, you should probably not use a regular expression to grab your desired value. If you share your whole file we will show you how to select the value properly using XPath for example.
However, if you want to use a regular expression for training purposes, try this:
$content = Get-Content 'your_file_path' -raw
[regex]::Match($content, '\bresid="([^"]+)').Groups[1].Value
Using a lookahead ((?<=pattern)), your pattern could look like:
(?<=resid=")\d+
With Regex.Match():
$id = [regex]::Match($inputString,'(?<=resid=")\d+').Value

regular expression matching filename with multiple extensions

Is there a regular expression to match the some.prefix part of both of the following filenames?
xyz can be any character of [a-z0-9-_\ ]
some.prefix part can be any character in [a-zA-Z0-9-_\.\ ].
I intentionally included a . in some.prefix.
some.prefix.xyz.xyz
some.prefix.xyz
I have tried many combinations. For example:
(?P<prefix>[a-zA-Z0-9-_\.]+)(?:\.[a-z0-9]+\.gz|\.[a-z0-9]+)
It works with abc.def.csv by catching abc.def, but fail to catch it in abc.def.csv.gz.
I primarily use Python, but I thought the regex itself should apply to many languages.
Update: It's not possible, see discussion with #nowox below.
I think your regex works pretty well. I recommend you to trying regex101 with your example:
https://regex101.com/r/dV6cE8/3
The expression
^(?i)[ \w-]+\.[ \w-]+
Should work in your case:
som e.prefix.xyz.xyz
^^^^^^^^^^^
some.prefix.xyz
^^^^^^^^^^^
abc.def.csv.gz
^^^^^^^
And in Python you can use:
import re
text = """some.prefix.xyz.xyz
some.prefix.xyz
abc.def.csv.gz"""
print re.findall('^(?i)[ \w-]+\.[ \w-]+', text, re.MULTILINE)
Which will display:
['som e.prefix', 'some.prefix', 'abc.def']
I might think you are a bit confused about your requirement. If I summarize, you have a pathname made of chars and dot such as:
foo.bar.baz.0
foobar.tar.gz
f.o.o.b.a.r
How would you separate these string into a base-name and an extension? Here we recognize some known patterns .tar.gz is definitely an extension, but is .bar.baz.0 the extension or it is only .0?
The answer is not easy and no regexes in this World would be able to guess the correct answer at 100% without some hints.
For example you can list the acceptable extensions and make some criteria:
An extension match the regex \.\w{1,4}$
Several extensions may be concatenated together (\.\w{1,4}){1,4}$
The remaining is called the basename
From this you can build this regular expression:
(?P<basename>.*?)(?P<extension>(?:\.\w{1,4}){1,4})$
Try this[a-z0-9-_\\]+\.[a-z0-9-_\\]+[a-zA-Z0-9-_\.\\]+

Notepad++ replace with reg expression?

I have a big list with links and other date in it. I want to filter out all the data and have a list with just the links.
Example of the current list:
32,2012-01-04 06:44:44,http://link.com/link
33,2012-01-04 06:44:45,http://link.com/link,{Text|textext|text},http://link.com/link|http://link.com/link|http://link.com/link
Notepad++ offers find replace functionality using RegEx. You can access this feature by using Ctrl+H.
If you're actually asking for a regular expression to do this, you can use something like this to match URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
which I found here.
Additionally you can test out changes to your regex easily at http://gskinner.com/RegExr/
Using the input you provided, here's a pattern you can use on http://www.regexr.com/
You'll need to make sure the global (/g) flag is on
Expression:
.*?(http.*?)[,|\n]
Input:
32,2012-01-04 06:44:44,http://link.com/link1
33,2012-01-04 06:44:45,http://link.com/link2,{Text|textext|text},http://link.com/link3|http://link.com/link4|http://link.com/link5
Substitution:
$1\n
Output:
http://link.com/link1
http://link.com/link2
http://link.com/link3
http://link.com/link4
http://link.com/link5