Regular Expression for retrieving File Extension in HTTP url - regex

I am working on the ELK stack and as part of Logstash data transformation i am transforming data in Apache access logs.
One of the metric needed is to get a stat on different content types (aspx, php, gif, etc.).
From the log file I am trying to retrieve request url and then deduce the file type, for ex /c/dataservices/online.jsp?callBack is the request and I would get .aspx using the regular expression
\.\w{3,4}.
My regular expression wont work for request say /etc/designs/design/libs.min.1253.css this is returning me .min as the extension.
I am trying to get the last extension but it is not working. Please do suggest other approaches.

You need to anchor the match to the end of the string or the beginning of a query param ?. Try:
\.\w{3,4}($|\?)
Play with it here: https://regex101.com/r/iV3iM1/1

You're going to need a much fancier Regex.
Try this one.
([/.\w]+)([.][\w]+)([?][\w./=]+)?
This uses three capture groups. The first ([/.\w]+) matches your path up to the last .
The second ([.][\w]+) matches the final extension, and you can use the capture group to read it out.
The third ([?][\w./=]+)? matches the query string, which is optional.

Related

Regex to match url path - Golang with Gorilla Mux

I'm setting up an api endpoint which after parsing the url I would get a path such as /profile/<username> e.g. /profile/markzuck
The username is optional though as this endpoint returns the the authenticated users profile if username is blank
Rule set:
I'm not the best at regex but I've created an expression that requires /profile after that if there is a following / e.g. /profile/ then you need to have a <username> that matches (\w){1,15}. Also I want it to be allowed to match any number of combinations if there is another following / e.g. /profile/<username>/<if preceding "/" then anything else>
Although I'm not 100% sure my expression is correct this seems to work in JavaScript
/^\/(profile)(\/(?=(\w){1,15}))?/
Gorilla Mux though is different and it requires the route matching string to always start with a slash and some other things I don't understand like it can only use non-capturing groups
( found this out by getting this error: panic: route /{_dummy:profile/([a-zA-Z_])?} contains capture groups in its regexp. Only non-capturing groups are accepted: e.g. (?:pattern) instead of (pattern) )
I tried using the same expression I used for JavaScript which didn't work here. I created a more forgiving expresion handlerFunc("/{_dummy:profile\/[a-zA-Z_].*}") which does work however this doesn't really follow the same rule set I'm using in my JavaScript expresion.
I was able to come up with my working expresion from this SO post here
And Gorilla Mux's docs talks a little bit about how their regex works when explaining how to use the package in the intro section here
My question is what is a similar or equivalent expression to the rule set I described that will work in Gorilla Mux HandlerFunc()?
If you're doing this with mux, then I believe what you need is not regex, but multiple paths.
For the first case, use a path "/profile". For the one containing a user name, use another path "/profile/{userName}". If you really want to use a regex, you can do "/profile/{username:}" to validate the user name. If you need to process anything that comes after username, either register separate paths (/profile/{username}/otherstuff), or register a pathPrefix "/profile/{username}/" and process the remaining part of the URL manually.

How to dismiss the end of the url parameters with regex?

I have a script that is supposed to trigger when a certain page path is open.
The issue: the page path contains multiple parameters including the parameter "returnUrl", returning the previous page visited.
Here is the url I want to check :
/cxsSearchApply?positionId=a0w0X000004IceYQAS&lang=en&returnUrl=https://example.com/cxsrec__cxsSearchDetail?id=a0w0X000004IceYQAS&lang=en&returnUrl=https://example.com/cxsrec__cxsSearch&lang=en
I initially used this regex code to get triggered on this page :
(cxsSearchApply.*)
But I have others regex codes like:
(cxsSearchSearchDetail.*)
And they also trigger because of the page path included in the url...
What reggex I should use to match the first part of the url but nothing after "returnUrl" ?
So you want to match cxsSearchApply on the text before &returnUrl. You could use a lookahead:
(cxsSearchApply.*)(?=returnUrl=)
However, what you really want is to match everything before the first &returnUrl. So you need a non-greedy operator:
(cxsSearchApply.*?)(?=returnUrl=)
Likewise, for your other search, it should no longer match because it is also only looking at the first part:
(cxsSearchSearchDetail.*?)(?=returnUrl=)
I believe that will get you what you want.
Nothing after "returnUrl"
If this is literally what you want, you can simply do (.*)(&returnUrl=.*) and take the first capture group as your result.

Google Analytics Regex Code

I'm having trouble figuring out the last part of my regex code for Google Analytics. I want to be able to grab any URL from my site that fits the following pattern:
www.site.com/hotel/[any text]/rooms?[any text]
So the URLs will always begin with /hotels and will always end with /rooms? followed by any possible text string with any possible text between "hotel/" and "/rooms?".
I have this much: ^/hotel/([^/])+/rooms([^\?])
But I'm not sure how to finish this so that it will only capture URLs that have text after the "?"
This works. You may want to tighten up the the allowed text in the path parameter and query parameter.
^www.site.com/hotel/[^/]+/rooms\?.+$

Matching URL containing one word AND another word using Regex

I am trying to write a regular expression to be used in a Google Analytics goal that will match URLs containing
?package=whatever
and also
/success
The user will first visit a page like
www.website.com/become-client/?package=greatpackage
and if they purchase they will be lead to this page
www.website.com/become-client/?package=greatpackage/success
So based on this I could use the following regex
\?package\=greatpackage/success
This should match the correct destination and I would be able to use this in the goal settings in Analytics to create a goal for purchases of the greatpackage package.
But sometimes the website will use other parameters in addition to ?package. Like ?type, ?media and so on.
?type=business
Resulting in URLs like this
www.website.com/become-client/?package=greatpackage?type=business
and if they purchase they will be lead to this page
www.website.com/become-client/?package=greatpackage?type=business/success
Now the /success part is moved away from the ?package part. My questions is how do I write a regex that will still match this URL no matter what other parameters there may be in between the parts?
---update----
#jonarz proposed the following and it works like a charm.
\?package\=greatpackage(.*?)/success
But what if there are two products with nearly the same name. For example greatpackage and greatpackageULTRA. The code above will select both. If changing the product names is impossible, how can I then select only one of them?
The regex that would solve the problem introduced in the edit, would be:
\?package\=greatpackage((\?|\/)(.*?))?\/success(\/|\b)
Here is a test: https://regex101.com/r/jS4cH5/1 and it seems to suit your needs.
If you want to match an url like this one :
www.website.com/become-client/?package=greatpackage?type=business?other=nada/success
With a group to extract your package type :
.*\?package=([^\/?]+).*\/success
Without group (just matching the url if it's containing package=greatpackage and success)
.*\?package=greatpackage.*\/success
Without group and matching for any package type :
.*\?package=[^\/?]+.*\/success
You just need to add .* to match any char (except new lines). The [^/?]* part is there to be sure your package type isn't empty (ie : the first char isn't a / nor ?).

Grabbing specific query string parameters from URL with regex

We have an implementation of Liferay portal and I'm just getting started with using Google Analytics with it. I'm noticing a lot of duplicate entries in GA, mainly because of the query strings in the URI, for example:
/web/home-community/search-and-help?p_p_id=mytcdirectory_WAR_mytcdirectory&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-3&p_p_col_count=4&_mytcdirectory_WAR_mytcdirectory_action=getResults
I'm playing around with the Search and Replace filters in GA (using regex) and my goal is to try to pull out the ?p_p_id and &*_action parameters from the URI, and disregard the rest. I'm getting close with the following regex:
^([^\?]+)([\?\&]p_p_id=[^\&]+)?.*(\&[^\&]+_action=[^\&]+)?.*$
But that last grouping isn't working correctly. If I remove the ? from the end of the last grouping it matches, but the problem with that approach is that not all URIs contain that query string so it needs to be optional. But if I keep it in, it won't grab that last parameter. My regex fiddle is located here:
http://regex101.com/r/qQ2dE4/13
Thank you all in advance for any help.