How to allow / match any path or query param for a dynamic link - regex

I've created a dynamic link for my project that has following default allowed regex
^https{0,1}:\/\/myapp\.page\.link([\/#\?].*){0,1}$
For dev purposes, I'd like to edit it to allow for anything as long as it has that myapp.page.link domain, so any path(s) any query param(s). But am having tough time figuring this out. Googled around and found pattern like this, but it doesn't seem to work ^https://myapp\.page\.link/.*$

If all you want it to match any string that has myapp.page.link in it, you can actually just use myapp\.page\.link as the regex. This would match a it as a substring in your app.
If you want to get more advanced and account for more URL constructs, you might use this: ^https?:\/\/myapp\.page\.link(/|/\S+)?. This reads "match 'http', an optional 's', two '/' characters, 'myapp.page.link' literally, followed by an optional trailing slash or trailing slash with non-space characters".

Related

How to configure thank you page in Google Analytics with REGEX?

I would like to track in Google Analytics the below thank you page with a REGEX. The below thank you page will vary depending of the title of the file downloaded (title-content). Hence, what kind of regex do I need in order to count the conversion regardless of the title of the file to be downloaded?
/en/media/download/title-content/thanks
Taking a wild guess, you're trying to match the path you listed.
Something like this might work:
^/en/media/download/[^/]+/thanks$
Here I'm using [^/]+ instead of .+ because we want to match only characters which are not the slash, since I'm assuming ^/en/media/download/blah-blah/not/thanks$ should not match.
It kind of depends what language you're going for, though. This answer could vary depending whether you are doing this in an Apache config file, a PHP script, Javascript, or something else.
Also, the "anchors" (^ and $) force the match to start at the start of the string, and to end at the end, so foo/en/media/download/bar/thanks/baz won't match. Remove either or both as appropriate.
Also, depending on your regex engine, the / characters may need to be escaped, replacing each one with \/ instead (apparently not needed for Google Analytics!)
Also, depending on your regex engine, you might need to put a character at the beginning and end of the regex, and you might need to quote it. For example, in PHP you'd use something like: if (preg_match('#^/en/media/download/[^/]+/thanks$#', $_SERVER['REQUEST_URI'])) { /* do stuff here. */} (also apparently not needed for Google Analytics!)
Thanks to a comment by #kbelder, you might want to consider something like this, if you're comparing against the whole request URI:
^/en/media/download/[^/]+/thanks/?([#?].*)?$
/? - An optional trailing slash.
[#?] - A hash or questionmark.
.* - Any number of extra characters.
([#?].*)? - Optional GET parameters or anchor id.
This would mean it'd match:
/en/media/download/blah/thanks/?name=fred
/en/media/download/blah/thanks#credits
but not:
/en/media/download/blah/thanks/name/fred
/en/media/download/blah/thanks/credits

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Regex expression for parsing URLs my way

I've a question in how to parse urls, my way.
Here's my regex expression:
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]{1,})?
My rationalle is that I want to accept
mail.google.com (as long as there's a .com, .net etc)
However the .com must be followed by a symbol (if any) and not alphanumeric. However in this way of checking, this url will fail
www.company.com
However I cant do a greedy repetiton to search for a .com as in this case
developer.google.com/appid=com.company.apppackage
How do I search to check for the first occurance of a '.com' without a alphanumeric character following it, yet making it optional in case its just
Google.com
Use $ as an alternative to match the end of the string.
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]+|$)?
BTW, trying to match all top-level domains will drive you crazy, since anyone can now register a TLD, so they change frequently.

Regex to "ignore" not "exclude"

I'm totally lost. I need a regular expression that
can detect any of the 4 starting urls like below
^(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)$
And ... .
should detect:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
And ... . which is important
Should Ignore, but NOT Exclude... . the following exact string (either it's present in the page or not)
http://www.w3.org
Which is complicated for me, because i still need to include it in the regex line
even if it's ignored, otherwise, it will match & be found in
(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)
And my aim is to find/match any url besides
http://www.w3.org
even if it's in the page, Or if it's not present.
so if there's only this in the page:
http://www.w3.org
& no other url.. then it shouldn't match.
Thanks Tyler but my regex knowledge is almost zero, i can only know what commands do when i right click on them to chose actions like in regulazy or regexr ((
So i updated my command according to the url i provided to you:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
& it works:
https?(://|%3A%2F%2F)(?!www.w3.org)(.*)
But because of my lack of knowledge, i don't understand how to do that below
"What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs"
I tried to add this, but it doesn't work:
(www.)
All i'm missing now is detection of urls starting with www
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
OK so try this:
/\bhttps?(://|%3A%2F%2F)(?!www\.w3\.org)(.*)\b/g
Test here: http://regexr.com?38jp5
That test link uses javascript-style regex, but should work elsewhere.
The important part is the second half - a negative lookahead, that checks what follows is not the exact text www.w3.org
I compressed what you had: mine matches http then an optional s then either :// or %3A%2F%2F.
I wrapped the whole thing in word boundaries, you could change that to quotes or whatever you need. The global flag lets you match multiple items.
In regards to OP's questions:
D%22
could appear before http or https
this one is missing & should match:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
If this matters, just remove the word boundary \b before and after the regex, so the http can match anywhere.
The regex command should detect: (any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
This regex would fail to match a link like http://google.com - looking for www is really not a good way to check for a link on its own. What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs
Edit #2:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
As I mention above, what you are describing will not match a url like http://google.com - but if that is what you want, use this:
(\W|^)[wW]{3}\.[^\s$]+
Instead of that, what I think you want is this, which is a combination of my first answer, and the link to a different post above.
((https?(://|%3A%2F%2F))(www\.)|(https?(://|%3A%2F%2F))|(www\.))(?!(www\.)?w3\.org)([^</\?\s]+)[^<\s]*
You'll want to use this regex with the Global and Insensitive flags

Parsing text for URLs with trailing commas

I'm looking at a JSON feed from Twitter and trying to make URLs clickable using a regular expression.
The problem is that there are URLs in the text with trailing commas. A comma can legally be part of a URL, but in this case they're just punctuation inserted by the user.
Is there any way around this? Am I missing something?
You are not missing something; there is no fool-proof way of determining the "intended" URL if it is provided as and is surrounded by plaintext. Your best bet is to make an educated guess.
A common approach is to check if the punctuation mark(s) in question is followed by a whitespace or is the terminator of the string. If it is, do not interpret it as part of the URL; otherwise, include it.
Keep in mind this problem isn't limited to commas or a single character (consider the ellipsis, ...).
You could ignore the last character if it is punctuation (so that punctuation in the middle of a url doesn't affect it).
eg. Regex could be something like:
`([a-z/A-Z0-9.,]*?)([.,]?)\s`
Warning (the first part of the regex doesn't include all url stuff, so you still need to fix that. But essentially, we have ([a-z/A-Z0-9.,]*?) which matches the main part of the URL. the * allows many characters, but we use ? so that it isn't greedy.
Then we use ([.,]?) to match a possible trailing punctuation, and \s to match a space or whitespace.
The first subexpression is therefore the url, and you can turn it into a link.
If you have access to the internet, you could try accessing the resource to see if it returns a 404 to decide whether the trailing punctuation is part of the URL or actual punctuation.