Regex captures all occurrences but the last of certain characters - regex

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.

Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Regex for Google Analytics Goals

I've searched all the other Regex on Google Analytics questions but I can't use the answers as this is pretty specific to my problem.
I want to set a goal but use Regex to flag it as a goal IF string includes
/client-thank-you/ AND anything EXCEPT hire
so in other words
/client-thank-you/hire is not correct
/client-thank-you/anything/else is correct
Each of the following regexes will match any string that contains /client-thank-you/ and does not contain hire, depending on what assumption(s) you make about where "hire" is in the string.
Solution
Where can "hire" be located in the string?
Anywhere:
((?!hire).)*?/client-thank-you/((?!hire).)*
Only following the "/client-thank-you/":
.*?/client-thank-you/((?!hire).)*
Only immediately following the "/client-thank-you/":
.*?/client-thank-you/(?!hire).*
Notes
Optimization:
Each of these regexes will match the entire string. If your tool lets you determine if a string contains a substring match (rather than naively attempting to match the entire string), then you could optimize the second and third regexes by removing the leading .*?. Likewise, the third regex could be further optimized by removing the trailing .* as well.
Positively require "anything":
Note that all of these regexes assume that a string that ends with "/client-thank-you/" (with nothing after it) is valid. If this assumption is incorrect (i.e. the string .*/client-thank-you/$ is not a match), then change the trailing * on every regex to +. This would also mean that you have to keep the last .* on the third regex as a .+ (i.e. don't optimize that away).
EDIT:
The above will not work since GA uses a very limited version of regex (that does not include lookaround). If there is no other GA tool (other than a single regex) that you can use that meets your needs, then you could use the following as a last-ditch effort:
([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$)
And in expanded form for illustration purposes only:
( | | | | )
[-._~!$&'()*+,;=:#/0-9A-Za-gi-z] h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z] hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z] hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z] .{1,3}$
This regex will match 1-4 characters that do not form "hire". It does so by matching the minimum number of characters necessary to verify that the match is neither "hire" nor can serve as a prefix of "hire". It takes into account end-of-line (e.g. "hir" is valid if there is nothing else after it). The characters that it matches are all valid characters that can occur in the path component of a URL as specified in RFC 3986.
You use this regex by substituting it for every ((?!hire).) in any of the solutions given above. For example:
.*?/client-thank-you/([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$).*
This matches any url that contains "/client-thank-you/" but not "/client-thank-you/hire".
Do be careful, though. Doubled "h"s will make this workaround fail (e.g. "hhire"). However, if "hire" will only ever follow a path delimiter (i.e. /hire/), then that shouldn't be a problem.
If you can't use a lookahead like Travis suggested, then I suggest setting the goal to fire on an event instead of a pageview.
If you're using Google Tag Manager, you'll have the ability to write a more advanced regex, or at least set a blocking rule for the event that prevents it from firing when 'hire' is in the page URL.

Mistaken Squid Proxy regex? → ^.*stackoverflow\.*

I have several proxy rule files for Squid, and all contain rules like:
acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*
The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.
The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...
Let's expand/describe the pattern:
^.*stackexchange\.* The whole pattern
^ String beginning
.* Match anything (greedy quantifier, I presume)
stackexchange Keyword to match
\.* Any number of dots (.)
Totally legitimate matches:
stackexchange.com: The Stack Exchange website.
stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.
But these possible matches make it seem more like a keyword block:
stackexchange
stackexchanger
notstackexchange
not-stackexchange
some-website.stackexchange
some-website.stackexchange-tld
And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:
stackexchange.
stackexchange...
stackexchange..........
stackexchange.......com
stackexchange.com
stackexchangecom
you get the idea.
Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.
So now, the question itself:
This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)
So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?
I.e., isn't ^.*stackexchange equivalent to stackexchange?
Edit: Just to clarify, I didn't write those proxy rule files.
I don't understand why you use \.* to match all the following dots
However to bypass your problem you can try this out :
^[^\.]*\.stackexchange\.*
[^\.]* matches anything except a dot
\. then you match the dot
edit : formatting

Regex to "ignore" not "exclude"

I'm totally lost. I need a regular expression that
can detect any of the 4 starting urls like below
^(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)$
And ... .
should detect:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
And ... . which is important
Should Ignore, but NOT Exclude... . the following exact string (either it's present in the page or not)
http://www.w3.org
Which is complicated for me, because i still need to include it in the regex line
even if it's ignored, otherwise, it will match & be found in
(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)
And my aim is to find/match any url besides
http://www.w3.org
even if it's in the page, Or if it's not present.
so if there's only this in the page:
http://www.w3.org
& no other url.. then it shouldn't match.
Thanks Tyler but my regex knowledge is almost zero, i can only know what commands do when i right click on them to chose actions like in regulazy or regexr ((
So i updated my command according to the url i provided to you:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
& it works:
https?(://|%3A%2F%2F)(?!www.w3.org)(.*)
But because of my lack of knowledge, i don't understand how to do that below
"What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs"
I tried to add this, but it doesn't work:
(www.)
All i'm missing now is detection of urls starting with www
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
OK so try this:
/\bhttps?(://|%3A%2F%2F)(?!www\.w3\.org)(.*)\b/g
Test here: http://regexr.com?38jp5
That test link uses javascript-style regex, but should work elsewhere.
The important part is the second half - a negative lookahead, that checks what follows is not the exact text www.w3.org
I compressed what you had: mine matches http then an optional s then either :// or %3A%2F%2F.
I wrapped the whole thing in word boundaries, you could change that to quotes or whatever you need. The global flag lets you match multiple items.
In regards to OP's questions:
D%22
could appear before http or https
this one is missing & should match:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
If this matters, just remove the word boundary \b before and after the regex, so the http can match anywhere.
The regex command should detect: (any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
This regex would fail to match a link like http://google.com - looking for www is really not a good way to check for a link on its own. What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs
Edit #2:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
As I mention above, what you are describing will not match a url like http://google.com - but if that is what you want, use this:
(\W|^)[wW]{3}\.[^\s$]+
Instead of that, what I think you want is this, which is a combination of my first answer, and the link to a different post above.
((https?(://|%3A%2F%2F))(www\.)|(https?(://|%3A%2F%2F))|(www\.))(?!(www\.)?w3\.org)([^</\?\s]+)[^<\s]*
You'll want to use this regex with the Global and Insensitive flags

Parsing text for URLs with trailing commas

I'm looking at a JSON feed from Twitter and trying to make URLs clickable using a regular expression.
The problem is that there are URLs in the text with trailing commas. A comma can legally be part of a URL, but in this case they're just punctuation inserted by the user.
Is there any way around this? Am I missing something?
You are not missing something; there is no fool-proof way of determining the "intended" URL if it is provided as and is surrounded by plaintext. Your best bet is to make an educated guess.
A common approach is to check if the punctuation mark(s) in question is followed by a whitespace or is the terminator of the string. If it is, do not interpret it as part of the URL; otherwise, include it.
Keep in mind this problem isn't limited to commas or a single character (consider the ellipsis, ...).
You could ignore the last character if it is punctuation (so that punctuation in the middle of a url doesn't affect it).
eg. Regex could be something like:
`([a-z/A-Z0-9.,]*?)([.,]?)\s`
Warning (the first part of the regex doesn't include all url stuff, so you still need to fix that. But essentially, we have ([a-z/A-Z0-9.,]*?) which matches the main part of the URL. the * allows many characters, but we use ? so that it isn't greedy.
Then we use ([.,]?) to match a possible trailing punctuation, and \s to match a space or whitespace.
The first subexpression is therefore the url, and you can turn it into a link.
If you have access to the internet, you could try accessing the resource to see if it returns a 404 to decide whether the trailing punctuation is part of the URL or actual punctuation.