Parsing text for URLs with trailing commas - regex

I'm looking at a JSON feed from Twitter and trying to make URLs clickable using a regular expression.
The problem is that there are URLs in the text with trailing commas. A comma can legally be part of a URL, but in this case they're just punctuation inserted by the user.
Is there any way around this? Am I missing something?

You are not missing something; there is no fool-proof way of determining the "intended" URL if it is provided as and is surrounded by plaintext. Your best bet is to make an educated guess.
A common approach is to check if the punctuation mark(s) in question is followed by a whitespace or is the terminator of the string. If it is, do not interpret it as part of the URL; otherwise, include it.
Keep in mind this problem isn't limited to commas or a single character (consider the ellipsis, ...).

You could ignore the last character if it is punctuation (so that punctuation in the middle of a url doesn't affect it).
eg. Regex could be something like:
`([a-z/A-Z0-9.,]*?)([.,]?)\s`
Warning (the first part of the regex doesn't include all url stuff, so you still need to fix that. But essentially, we have ([a-z/A-Z0-9.,]*?) which matches the main part of the URL. the * allows many characters, but we use ? so that it isn't greedy.
Then we use ([.,]?) to match a possible trailing punctuation, and \s to match a space or whitespace.
The first subexpression is therefore the url, and you can turn it into a link.
If you have access to the internet, you could try accessing the resource to see if it returns a 404 to decide whether the trailing punctuation is part of the URL or actual punctuation.

Related

Custom email validation regex pattern not working properly

So I've got /.+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(.{1})\w{2,}/ pattern I want to use for email validation on client-side, which doesn't work as expected.
I know that my pattern is simple and doesn't cover every standard possibility, but it's part of my regex training.
Local part of address should be valid only when it has at least one digit [0-9] or letter [a-zA-Z] and can be mixed with comma or plus sign or underscore (or all at once) and then # sign, then domain part, but no IP address literals, only domain names with at least one letter or digit, followed by one dot and at least two letters or two digits.
In test string form it doesn't validate a#b.com and does validate baz_bar.test+private#e-mail-testing-service..com, which is wrong - it should be vice versa - validate a#b.com and not validate baz_bar.test+private#e-mail-testing-service..com
What specific error I've got there and where?
I can't locate this, sorry..
You need to change your regex
From: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(\.{1})\w{2,}
To: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]?\#[\w+-]+(\.{1})\w{2,}
Notice that I added a ? before the # sign and removed the ? from the first "group" after the # sign. Adding that ? will make your regex to know that hole "group" is not mandatory.
See it working here: https://regex101.com/r/iX5zB5/2
You're requiring the local part (before #) to be at least two characters with the .+ followed by the character class [^...]. It's looking for any character followed by another character not in the list of exclusions you specify. That explains why "a#b.com" doesn't match.
The second problem is partly caused by the character class range +-? which includes the . character. I think you wanted [-\w+?]+. (Do you really want question marks?) And then later I think you wanted to look for a literal . character but it really ends up matching the first character that didn't match the previous block.
Between the regex provided and the explanatory text I'm not sure what rules you intend to implement though. And since this is an exercise it's probably better to just give hints anyway.
You will also want to use the ^ and $ anchors to makes sure the entire string matches.

Regex captures all occurrences but the last of certain characters

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.
Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.

Regex match anything that is not sub-pattern

I have cookies in my HTTP header like so:
Set-Cookie: frontend=ovsu0p8khivgvp29samlago1q0; adminhtml=6df3s767g199d7mmk49dgni4t7; external_no_cache=1; ZDEDebuggerPresent=php,phtml,php3
and I need to extract the 26 character string that comes after frontend (e.g. ovsu0p8khivgvp29samlago1q0). The following regular expression matches that for me:
(?<=frontend=)(.*)(?=;)
However, I am using Varnish Cache and can only use a regex replace. Therefore, to extract that cookie value (26 character frontend string) I need to match all characters that do not match that pattern (so I can replace them with '').
I've done a fair bit of Googling but so far have drawn a blank. I've tried the following
Match characters that do not match the pattern I want: [^((?<=frontend=)[A-Za-z0-9]{26}(?=;))] which matches random characters, including the ones I want to preserve
I'd be grateful if someone could point me in the right direction, or note where I might have gone wrong.
The Set-Cookie response header is a bit magical in Varnish, since the backends tend to send multiple headers with the same name. This is prohibited by the RFC, but the defacto way to do it.
If you are using Varnish 3.0 you can use the Header VMOD, it can parse the response and extract what you need:
https://github.com/varnish/libvmod-header
Use regex pattern
^Set-Cookie:.*?\bfrontend=([^;]*)
and the "26 character string that comes after frontend" will be in group 1 (usually referred to in the replacement string as $1)
Do you have control over the replacement string? If so, you can go with Ωmega's answer, and use $1 in your replacement string to write the frontend value back.
Otherwise, you could use this:
^Set-Cookie:.*(?!frontend=)|(?<=frontend=.{26}).*$
This will match everything from the start of the string, until frontend= is encountered. Or it will match everything that has frontend= exactly 26 characters to the left of it and up until the end of the string. If those 26 characters are a variable length, it would get signigicantly more complicated, because only .NET supports variable-length lookbehinds.
For your last question. Let's have a look at your regex:
[^((?<=frontend=)[A-Za-z0-9]{26}(?=;))]
Well, firstly the negative character class [^...] you tried to surround you pattern with, doesn't really work like this. It is still a character class, so it matches only a single character that is not inside that class. But it gets even more complicated (and I wonder why it matches at all). So firstly the character class should be closed by the first ]. This character class matches anything that is not (, ?, <, =, ), a letter or a digit. Then the {26} is applied to that, so we are trying to find 26 of those characters. Then the (?=;) which asserts that those 26 characters are followed by ;. Now comes what should not work. The closing ) should actually throw and error. And the final ] would just be interpreted as a literal ].
There are some regex flavors which allow for nesting of character classes (Java does). In this case, you would simply have a character class equivalent to [^a-zA-Z0-9(){}?<=;]. But as far as I could google it, Varnish uses PCRE, and in PCRE your regex should simply not compile.

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>

Need assistance regex matching a single quote, but do not include the quote in the result

I'm trying to find out a way to match the following test string:
token = '1866FB352F4DF76BCB92C3482DB7D7B4F562';
The data I want returned is...
1866FB352F4DF76BCB92C3482DB7D7B4F562
I've tried the following, the closes I have is this, but it's including the single quote at the end:
(?!token = ')(\w+)';
Now, another one, which works closely, but it's including the last single quote:
'([^']+)'
Anyone want to take a stab at this?
Update: After looking at what I need to parse, I found the same value in the html, in the form area, which looks like it might be easier to grab:
name="token" value="482CD1FE037F68D5A36F4C961A6D57D9"
Again, I just need the contents within value="*"
However, the regex will have to parse the entire html source, so I assume I will need to search for name="toke" value= but not include that in the result set.
If your regex engine supports lookaround, you can use
(?<=')\w+(?=')
This matches an alphanumeric word if it's surrounded by single quotes, without making those quotes a part of the actual match. If you only want to match hexadecimal numbers, use
(?i)(?<=')[0-9A-F]+(?=')
EDIT:
Since you have now added that you're using JMeter, and because JMeter doesn't support lookbehind assertions for reasons incomprehensible to me (because Java itself does support it just fine), you can possibly cheat like this:
\b[0-9A-F]+(?=')
only checks whether an entire hex number occurs right before a ' character. It does not check for the presence of an opening quote, but chances are that this won't matter.