Why did my regex not give the desired result - regex

I have a string from which I need to extract specific url that consists of an image extension and the following regex:
ITEMIMAGEURL\d+=(http://.*?)(,|$|\n)
and the string that I've to extract from is:
ITEMIMAGEURL0 = http://images.example.com/xyz/l/dasda/test-image-6af8af8afa9.jpg,
ITEMIMAGEURL1 = http://images.example.com/xyz/l/dasda/test-image-,
ITEMIMAGEURL2 = http://images.example.com/abc/as/test/test-image-abrd23lg9.jpg
My regex works fine but I want to extract only the url with .jpg|.gif or any other image extension so I've tried
ITEMIMAGEURL\d+=(http://.*?(?(?=.[a-zA-Z]{3,4})))(,|$|\n)
But it didn't work as expected
My expected result is
http://images.example.com/xyz/l/dasda/test-image-6af8af8afa9.jpg
http://images.example.com/abc/as/test/test-image-abrd23lg9.jpg

You can use this regex to extract image URLs:
ITEMIMAGEURL\d+=(http://[^,\s]+?\.(?:jpe?g|gif|png))
RegEx Demo
Your image URL is captured in group #1. This assumes your URL doesn't contain comma character.
If comma is allowed in image URLs then use this regex with negative lookahead:
ITEMIMAGEURL\d+=(http://(?:(?!,ITEMIMAGEURL\d).)+\.(?:jpe?g|gif|png))
RegEx Demo 2

ITEMIMAGEURL\d+=(http:\/(?:\/[\w\.-]+)+\.(?:jpe?g|gif|png),?\s?)?
I think you know basics of RegExp. So one one: (?:\/[\w\.-]+) this is a pattern of valid url path. This is not only valid one, you could choose any you like, e.g. (?:\/[^\s,]+).
Demo

Related

regex extract username from 2 types of url

I'm currently using this regex (?<=\/movie\/)[^\/]+, but it only matches the username from the second url, i know i could make a if (contains /movie/): use this regex, else: use another regex on my code, but i'm trying to do this directly on regex.
http://example.com:80/username/token/30000
http://example.com:80/movie/username/token/30000.mp4
To complete the Tensibai's answer, if you have not a port in url, you can use the last dot in url to start your regex :
\.[^\/\.]+\/(?:movie\/)?([^\/]+)
(demo)
You can use something like this to make the movie/ optional and have the username in a named capture group (Live exemple):
\d[/](?:movie\/)?(?<username>[^/]+)[/]
using \d/ to anchor the start of match at after the url.

Regex PCRE capture multiple occurences query string in URL

I am trying to capture multiple occurence of utm tag in a URL and append when re-writing the url. However i just want utm key values and skip others.
This is a sample URL
https://example.com/dl/?screen=page&title=SABC&page_id=4063&myvalue=Noidea&utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&test=value&utm_content=contentTest19
I tried this:
(\?.*)(page_id=([^&]*))(\?|&)(.*[&?]utm_[a-z]+=([^&]+).*)
and unfortunately, it doesn't produce the result I expect.
I need to capture PAGE ID and utm tags both, but do not want test=value, myvalue=Noidea and only want query strings with utm tags.
Expected Result is the URL below:
https://example.com/dl/page_id/4063?utm_source=sourceTest19&utm_medium=mediumTest19&utm_campaign=campaignTest19&utm_term=termTest19&utm_content=contentTest19
one group with pageid=<somenumber/text>
one group with all utm tags with key and value
Help will be appreciated.
You can make regex like this to get group result:
(?:(page_id|utm_[a-z]+)=[A-z0-9]+)(?:^\&)?
You can instead replace any parameter that does not match the desired ones with the empty string. The pattern for this is
(?:[?&](?!(?:page_id|utm_[^=&]++)=)[^&]*+)++$|(?<=[?&])(?!(?:page_id|utm_[^=&]++)=)[^&]*+(?:&|$)
Here's a working proof: https://regex101.com/r/L5xcl4/2 It has an extra \s only so it works on the multiline input in the tester, but you shouldn't need it as you'll be working on a string that contains only a URL without whitespace.

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.
You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver
Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.
This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.
Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files
Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

Regex ignore first 12 characters from string

I'm trying to create a custom filter in Google Analytic to remove the query parts of the url which I don't want to see. The url has the following structure
[domain]/?p=899:2000:15018702722302::NO:::
I would like to create a regex which skips the first 12 characters (that is until:/?p=899:2000), and what ever is going to be after that replace it with nothing.
So I made this one: https://regex101.com/r/Xgbfqz/1 (which could be simplified to .{0,12}) , but I actually would like to skip those and only let the regex match whatever is going to be after that, so that I'll be able to tell in Google Analytics to replace it with "".
The part in the url that is always the same is
?p=[3numbers]:[0-4numbers]
Thank you
Your regular expression:
\/\?p=\d{3}\:\d{0,4}(.*)
Tested in Golang RegEx 2 and RegEx101
It search for /p=###:[optional:####] and capture the rest of the right side string.
(extra) JavaScript:
paragraf='[domain]/?p=899:2000:15018702722302::NO:::'
var regex= /\/\?p=\d{3}\:\d{0,4}(.*)/;
var match = regex.exec(paragraf);
alert('The rest of the right side of the string: ' + match[1]);
Easily use "[domain]/?p=899:2000:15018702722302::NO:::".substr(12)
You can try this:
/\?p\=\d{3}:\d{0,4}
Which matches just this: ?p=[3numbers]:[0-4numbers]
Not sure about replacing though.
https://regex101.com/r/Xgbfqz/1

Conditional Regex to match url

I am trying to make a if/then condition to match the url, but I can't seem to get it to work. I am trying to match URLs and then capture the non-optional group. So - if a url comes in like this:
/en/testing.aspx
I want to capture /testing.aspx
if the url comes in like this:
/testing.aspx
I want to capture /testing.aspx
Is there an easy way to do this using regex?
EDIT:
The Url can be multi-part url, like /en/sub1/sub2/testing.aspx - I essentially want everything after "/en/".
use regex \/en(\/.+)$
Check this out
edited
https://regex101.com/r/lwowhi/6
If there is "/en/" in the URL and you still want to capture /testing.aspx then here is an edit (?:\/en)*(\/.+)$
https://regex101.com/r/lwowhi/8
You can use a greedy regex which will consume everything up until the final forward slash. Then, capture everything which comes after that point.
^.*?(?:\/en)?(\/.*)$
Demo
Guessing all pages are .aspx then use group.
regex: .(/..aspx)
this will match "/testing.aspx" in all bellow samples
/testing.aspx or
/en/testing.aspx or
www.abc.com/en-us/testing.aspx