Regex PCRE capture multiple occurences query string in URL - regex

I am trying to capture multiple occurence of utm tag in a URL and append when re-writing the url. However i just want utm key values and skip others.
This is a sample URL
https://example.com/dl/?screen=page&title=SABC&page_id=4063&myvalue=Noidea&utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&test=value&utm_content=contentTest19
I tried this:
(\?.*)(page_id=([^&]*))(\?|&)(.*[&?]utm_[a-z]+=([^&]+).*)
and unfortunately, it doesn't produce the result I expect.
I need to capture PAGE ID and utm tags both, but do not want test=value, myvalue=Noidea and only want query strings with utm tags.
Expected Result is the URL below:
https://example.com/dl/page_id/4063?utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&utm_content=contentTest19
one group with pageid=<somenumber/text>
one group with all utm tags with key and value
Help will be appreciated.

You can make regex like this to get group result:
(?:(page_id|utm_[a-z]+)=[A-z0-9]+)(?:^\&)?

You can instead replace any parameter that does not match the desired ones with the empty string. The pattern for this is
(?:[?&](?!(?:page_id|utm_[^=&]++)=)[^&]*+)++$|(?<=[?&])(?!(?:page_id|utm_[^=&]++)=)[^&]*+(?:&|$)
Here's a working proof: https://regex101.com/r/L5xcl4/2 It has an extra \s only so it works on the multiline input in the tester, but you shouldn't need it as you'll be working on a string that contains only a URL without whitespace.

Related

Why did my regex not give the desired result

I have a string from which I need to extract specific url that consists of an image extension and the following regex:
ITEMIMAGEURL\d+=(http://.*?)(,|$|\n)
and the string that I've to extract from is:
ITEMIMAGEURL0 = http://images.example.com/xyz/l/dasda/test-image-6af8af8afa9.jpg,
ITEMIMAGEURL1 = http://images.example.com/xyz/l/dasda/test-image-,
ITEMIMAGEURL2 = http://images.example.com/abc/as/test/test-image-abrd23lg9.jpg
My regex works fine but I want to extract only the url with .jpg|.gif or any other image extension so I've tried
ITEMIMAGEURL\d+=(http://.*?(?(?=.[a-zA-Z]{3,4})))(,|$|\n)
But it didn't work as expected
My expected result is
http://images.example.com/xyz/l/dasda/test-image-6af8af8afa9.jpg
http://images.example.com/abc/as/test/test-image-abrd23lg9.jpg
You can use this regex to extract image URLs:
ITEMIMAGEURL\d+=(http://[^,\s]+?\.(?:jpe?g|gif|png))
RegEx Demo
Your image URL is captured in group #1. This assumes your URL doesn't contain comma character.
If comma is allowed in image URLs then use this regex with negative lookahead:
ITEMIMAGEURL\d+=(http://(?:(?!,ITEMIMAGEURL\d).)+\.(?:jpe?g|gif|png))
RegEx Demo 2
ITEMIMAGEURL\d+=(http:\/(?:\/[\w\.-]+)+\.(?:jpe?g|gif|png),?\s?)?
I think you know basics of RegExp. So one one: (?:\/[\w\.-]+) this is a pattern of valid url path. This is not only valid one, you could choose any you like, e.g. (?:\/[^\s,]+).
Demo

Using regex to filter a URL list

I'm trying to use regex to filter a list of site that doesn't include a specific word.
For example from the list below, i want to filter all sites with the word test and empty strings so the final output that I'll get is http://example.com. I tried to use ^((?!test).)* but that doesn't filter empty strings. Maybe there is a better way to filter them? Thanks.
http://test1.com
http://test2.com
*empty string*
http://example.com
You need to use a negative lookahead and .+ in your regex as this:
^(?!.*test).+
RegEx Demo

Simple regex to replace first part of URL

Given
http://localhost:3000/something
http://www.domainname.com/something
https://domainname.com/something
How do I select whatever is before the /something and replace it with staticpages?
The input URL is the result of a request.referer, but since you can't render request.referer (and I don't want a redirect_to), I'm trying to manually construct the appropriate template using controller/action where action is always the route, and I just need to replace the domain with the controller staticpages.
You could use a regex like this:
(https?://)(.*?)(/.*)
Working demo
As you can see in the Substitution section, you can use capturing group and concatenates the strings you want to generate the needed urls.
The idea of the regex is to capture the string before and after the domain and use \1 + staticpages + \3.
If you want to change the protocol to ftp, you could play with capturing group index and use this replacement string:
ftp://\2\3
So, you would have:
ftp://localhost:3000/something
ftp://www.domainname.com/something
ftp://domainname.com/something

Extract last part of url without query string or jsessionid

I want a regex that will always return the last part of an url before the query string parameters and without the jessionid if present.
Here's some url examples:
http://www.somesite.com/some/path/test.action;jsessionid=000063vCmvJAn7VWyymA_dPsHZs:16u9pglit?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test;jsessionid=000063vCmvJAn7VWyymA_dPsHZs:16u9pglit?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test.action?sort=2&param1=1&param2=2
http://www.somesite.com/some/path/test?sort=2&param1=1&param2=2
Here's my regex so far:
.*http://.*/some/path.*/(.*);?.*\?.*
It is working for the url that does not contain jsessionid, but will return test;jessionid=... if it is present.
To test: http://regex101.com/r/fM0mE2
I would use this regex:
.*http:\/\/.*\/some\/path.*\/([^;\?]+);?.*\?.*
^^^^^^
Basically matches anything that isn't ; or ?. And I think it might be shortened to:
.*http:\/\/.*\/some\/path.*\/([^;\?]+)

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?
you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.