Regex for matching any URL character - regex

I have come accross a specification that said described a field as :
Any URL char
And I wanted to validate it on my side via a REGEX.
I searched a bit and, even if I found this great SO question that contains every piece of information I needed, I found it too bad not to have a question asking precisely for the regex, so here I am.
What would be a proper regex matching any URL character ?
Edit
I extracted the following regex from what I understood from the specification :
[\w\-.~:/?#\[\]#!$&'()*+,;=%]
So, is this REGEX right and exhaustive or did I miss anything ?
After reading the specification, I guess it is simply "all ASCII characters".

See the Characters section:
A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set
and those reserved characters not acting as delimiters, define each
component's identifying data.
Although there is an indication that only digits, letters and some symbols are supported, you may see a suggested regex to parse a URI at the Appendix B. Parsing a URI Reference with a Regular Expression that may actually match pretty every char:
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
What you collected as a [\w.~:/?#\[\]#!$&'()*+,;=%-] pattern is too restrictive, unless \w is Unicode aware (URI may contain any Unicode letters), then, it might be working more or less for you.
If you plan to match just ASCII URLs, use ^[\x00-\x7F]+$ (any 1+ ASCII symbols) or ^[!-~]+$ (only visible ASCII).

Related

Regex to validate Unique Transaction Identifier

I'm trying to write a regex pattern to validate Unique Transaction Identifiers (UTI). See description: here
The UTI consists of two concatenated parts, the prefix and the transaction identifier. Here is a summary of the rules I'm trying to take into account:
The prefix is exactly 10 alphanumeric characters.
The transaction identifier is 1-32 characters long.
The transaction identifier is alphanumeric, however the following special characters are also allowed: . : _ -
The special characters can not appear at the beginning or end of the transaction identifier.
It is not allowed to have two special characters in a row.
I have so far constructed a pattern to validate the UTI for the first 4 of these points (matched with ignored casing):
^[A-Z0-9]{11}((\w|[:\.-]){0,30}[A-Z0-9])?$
However I'm struggling with the last point (no two special characters in a row). I readily admit to being a bit of a novice when it comes to regex and I was thinking there might be some more advanced technique that I'm not familiar with to accomplish this. Any regex gurus out there care to enlighten me?
Solved: Thanks to user Bohemian for helping me find the pattern I was looking for. My final solution looks like this:
^[a-zA-Z0-9]{11}((?!.*[.:_-]{2})[a-zA-Z0-9.:_-]{0,30}[a-zA-Z0-9])?$
I will leave the question open for follow-up answers in case anyone has any further suggestions for improvements.
Try this:
^[A-Z0-9]{11}(?!.*[.:_-]{2})[A-Z0-9.:_-]{0,30}[A-Z0-9]$
The secret sauce is the negative look ahead (?!.*[.:_-]{2}), which asserts (without consuming input) that the following text does not contain 2 consecutive "special" chars .:_-.
Note that your attempt, which uses \w, allows lowercase letters and underscores too, because
\w is the same as [a-zA-Z0-9_]

PHP and XSD Regular expression for currency and location names

I am trying to find a suitable regular expression that allows number, digits, spaces and other characters that can be used in names such as . ' -_. I need to implement an expression as a PHP preg_match and an XSD pattern. Currently I have the following for PHP
/^[a-zA-Z0-9 '-.]
Which allows the characters I want (unless there are any other special characters you could kindly recommend I use). The issue with this is that it allows special characters to be used one after the other, allowing values such as .-- . I need it so that this can't happen, only allowing a special character if a letter or digit comes before it.
I would also like the equivalent for an XSD pattern but everything I have tried so far has been inadequate. I am currently using
[\w\d '-.']*[\w\d][\w\d '-.]*
In addition, the length must be between 3-50 (which works in all cases currently).
Any guidance would be fantastic as I have searched high and low for an answer.
Valid names could be:
Netherlands Antillean guilde
Timor-Leste
Cote d'Ivoire
Including letters with accents.
Invalid names could be:
Tes''t
Test (space before)
_test (special character before
-'.
...

Wildcard in Word 2013 to match zero or more whitespaces

What is the analog of regular expression's * modifier in Word 2013 wildcards?
In Word 2013 Find tool with wildcards enabled, apparently 0 is not a valid number as the number of matches. For example, if you type in the search box
fe{1,2}d
it will match fed and feed. However,
fe{0,2}d
will just produce an error message. What is the correct expression to match fd, fed, feed, feeed, etc.?
My motivation is to match a specific text when it is in a paragraph alone (i.e., surrounded by paragraph marks ^13) but with a possible whitespaces after it:
^13hello world {0,}^13
which just produces an error message. I did not find any solution without enabling wildcards, but even with wildcards enabled I can't get it working.
Similarly,
^13hello world #^13
matches one or more spaces, but I need zero or more.
I don't believe Word has ever had an equivalent for the zero-or-more operator, so while I haven't checked in Word 2013, I wouldn't expect to see it there either. (This page is old, but as far as I know it's still pretty authoritative on wildcard searching in Word: http://word.mvps.org/faqs/general/usingwildcards.htm)
In general, I would suggest doing two searches, one without the character and one using the 1-or-more operator.
ETA: Removed bad wildcard search.

Regex match anything that is not sub-pattern

I have cookies in my HTTP header like so:
Set-Cookie: frontend=ovsu0p8khivgvp29samlago1q0; adminhtml=6df3s767g199d7mmk49dgni4t7; external_no_cache=1; ZDEDebuggerPresent=php,phtml,php3
and I need to extract the 26 character string that comes after frontend (e.g. ovsu0p8khivgvp29samlago1q0). The following regular expression matches that for me:
(?<=frontend=)(.*)(?=;)
However, I am using Varnish Cache and can only use a regex replace. Therefore, to extract that cookie value (26 character frontend string) I need to match all characters that do not match that pattern (so I can replace them with '').
I've done a fair bit of Googling but so far have drawn a blank. I've tried the following
Match characters that do not match the pattern I want: [^((?<=frontend=)[A-Za-z0-9]{26}(?=;))] which matches random characters, including the ones I want to preserve
I'd be grateful if someone could point me in the right direction, or note where I might have gone wrong.
The Set-Cookie response header is a bit magical in Varnish, since the backends tend to send multiple headers with the same name. This is prohibited by the RFC, but the defacto way to do it.
If you are using Varnish 3.0 you can use the Header VMOD, it can parse the response and extract what you need:
https://github.com/varnish/libvmod-header
Use regex pattern
^Set-Cookie:.*?\bfrontend=([^;]*)
and the "26 character string that comes after frontend" will be in group 1 (usually referred to in the replacement string as $1)
Do you have control over the replacement string? If so, you can go with Ωmega's answer, and use $1 in your replacement string to write the frontend value back.
Otherwise, you could use this:
^Set-Cookie:.*(?!frontend=)|(?<=frontend=.{26}).*$
This will match everything from the start of the string, until frontend= is encountered. Or it will match everything that has frontend= exactly 26 characters to the left of it and up until the end of the string. If those 26 characters are a variable length, it would get signigicantly more complicated, because only .NET supports variable-length lookbehinds.
For your last question. Let's have a look at your regex:
[^((?<=frontend=)[A-Za-z0-9]{26}(?=;))]
Well, firstly the negative character class [^...] you tried to surround you pattern with, doesn't really work like this. It is still a character class, so it matches only a single character that is not inside that class. But it gets even more complicated (and I wonder why it matches at all). So firstly the character class should be closed by the first ]. This character class matches anything that is not (, ?, <, =, ), a letter or a digit. Then the {26} is applied to that, so we are trying to find 26 of those characters. Then the (?=;) which asserts that those 26 characters are followed by ;. Now comes what should not work. The closing ) should actually throw and error. And the final ] would just be interpreted as a literal ].
There are some regex flavors which allow for nesting of character classes (Java does). In this case, you would simply have a character class equivalent to [^a-zA-Z0-9(){}?<=;]. But as far as I could google it, Varnish uses PCRE, and in PCRE your regex should simply not compile.

Regex match, quite simple:

I'm looking to match Twitter syntax with a regex.
How can I match anything that is "#______" that is, begins with an # symbol, and is followed by no spaces, just letters and numbers until the end of the word? (To tweeters, I want to match someone's name in a reply)
Go for
/#(\w+)/
to get the matching name extracted as well.
#\w+
That simple?
It should be noted that Twitter no longer allows usernames longer than 15 characters, so you can also match with:
#\w{1,15}
There are still apparently a few people with usernames longer than 15 characters, but testing on 15 would be better if you want to exclude likely false positives.
There are apparently no rules regarding whether underscores can be used the the beginning or end of usernames, multiple underscores, etc., and there are accounts with single-letter names, as well as someone with the username "_".
#[\d\w]+
\d for a digit character
\w for a word character
[] to denote a character class
+ to represent more than one instances of the character class
Note that these specifiers for word and digit characters are language dependent. Check the language specification to be sure.
There is a very extensive API for how to get valid twitter names, mentions, etc. The Java version of the API provided by Twitter can be found on github twitter-text-java. You may want to take a look at it to see if this is something you can use.
I have used it to validate Twitter names and it works very well.