regex for links - help to understand it - regex

how do you read this regex?
#(http|https|ftp)://([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(d+)?/?#i
this is a regex for links, but i'm having trouble to understand it
Thanks

Depending on what language you're in, regexes need a delimiter. Seems the # (pound sign or hash) is used here. So,
#...actual regex goes here...#
In javascript you need forward slashes (/..../).
Some regex engines allow you to pass flags that influence matching process. These appear after the closing delimiter:
#...actual regex goes here...#..flags go here..
In your example, there is one flag, the i and I am guessing that means: "case insensitive" (i for insensitive). Depending on the regex engine you can have flags that influence the syntax you can use for the actual regex (for example, the dot can match either any character or any character except newlines depending upon wheter a flag was passed), flags that influence how the matching is done (for example, in javascript a g indicates the global flag, and that means matching anywhere inside the string is done, and state is preserved), flags that determine whether whitespace is allowed as indentation inside the regex. And some have a m flag indicating whether the regex will be applied on a line by line basis, or on the entire text. There is AFAIK no standard set of flags, check your regex engine documentation.
If you have multiple flags, you just concatenate them together to a string of flags and put them after the closing delimiter.
Now for the actual regex. First, you start with a parenthesized expression:
(...group...)
This is also called a group. In many regex engines, these groups have special meaning, because when a match is found you can access the bits of text that matched the expression inside the group using a special variable (or sometimes, the match is returned as an array, where each element represents a group). If you can access the bits inside groups, it is called a "capturing group".
In this particular case the group uses "alternation" or "choice" and this is indicated by the | (pipe). The pipe is part of the regex syntax and means "or". So,
(http|https|ftp)
means: match "http", or if that doesn't match, "https", of if that doesn't match, "ftp". This also brings up another reason for using parenthesis: of all special regex syntax operators, the pipe has the lowest precedence, so the parenthesis would not have been there, it would have meant: match "http" or "https" or "ftp://...etc"
So far, we've seen these "special characters": | (pipe) and ( and ). After that we get
://
These are not special characters, and any non-special characters simply match themselves.
We then get another group, which makes up almost the rest of the regex:
([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+)
Inside it, we see a bracketed expression:
[A-Z0-9]
The brackets [ and ] are special, and indicate a "character class". There are other ways to denote character classes, but in all cases a character class matches a single character. Which character depends on the nature of the class. In this case, the class is defined using two ranges:
A-Z
means characters A thru Z (and anything in between) and
0-9
means characters 0 thru 9 (and anything in between).
Basically, [A-Z0-9] matches any alpha-numeric character.
Note that the dash between the boundaries of the range is only a special character inside these bracketed expressions. Paradoxically, a dash inside the brackets can also simply mean a dash if it cannot be interpreted as a range.
This is folllowed by yet another character class:
[A-Z0-9_-]
Almost the same as the previous on, it just adds the underscore and the dash. This last dash cannot be interpreted as a range separator, so it simply means a dash. This character class will match any alpha-numeric character as well as underscore and dash.
This class is followed by a * (asterisk) and this is a special character indicating a cardinality. Cardinalities specify how often the immediately preceding element may occur. These are the common cardinalities:
* (asterisk) means zero or more times.
? (question mask) means zero or once.
+ (plus) means one or more times.
Now the entire bit starts to make sense:
[A-Z0-9][A-Z0-9_-]*
means: a sequence starting with one alphanumeric caracter, optionally followed by a string of "word" characters (that is, alphanumeric, dash and underscore).
The following bit of the regex is this:
(?:.[A-Z0-9][A-Z0-9_-]*)+
I think this is trying to match the domain parts. So that if you have say:
https://mail.google.com
The .google and .com bits would be matched by this part. The initial (?: bit is meant to tell the regex engine to not create a "backreference". This is not really my stronghold, maybe someone else can explain. But the rest of that group is quite clear and resembles what we saw before. I think there is a mistake though: the dot (.) that appears immediately before the bracketed character class usually means "match any character" or "match any non-newline character", not "match a literal dot". Typically if you want a literal dot, you need to escape it. This would be the syntax in javascript and I think perl:
(\.[A-Z0-9][A-Z0-9_-]*)+
(note the backslash immediately before the dot to indicate a literal dot)
The final bits of the regex seem an attempt to match a port number:
:?(d+)?
However, the d+ bit is probably wrong: right now it matches "one or more d's". It should probably be:
:?(\d+)?
meaning: optionally match a colon (:), optionally followed by a bunch of digits. The \d is also a character class, but a predefined one. I think most regex engines use \d to denote a digit, but you should check the documentation of your engine to see the exact convention. So in say:
http://domain.server.extension:8080/
this part of the regex would match :8080 (provided you fix the d+ thing).
Finally, we see
/?
Meaning the entire thing can be followed optionally by a forward slash.
So, all in all, I don;'t think this matches a "link", rather it matches the inital part of a URL. To match an entire url, you would need a bit more, at least I don't see any expression that could match the path, resource, hash and query bits that may occur in a proper URL.

When you say you have trouble understanding it, it means you tried something and are stuck somewhere?
Please ask more specific questions.
I can give you some keywords that you can lookup them more easy, a good place for that is regular-expressions.info
(http|https|ftp) is an alternation
[A-Z0-9] is a character class
*, + and ? are quantifiers
(...) is a (capturing) group, (?:...) is a non capturing group
The # at the start and end are regex delimiters, the i at the very end is a modifier/option (match case independent).
The (d+)? at the end would match one or more (optional) letters "d". This is quite strange. I assume it should be (\d+)? that would be one or more (optional) digits.

Related

How to find a sequence of formatted digits in Apache Nifi using a regular expression?

I want to find using Apache Nifi this kind of text in a CSV with lots of text:
nnnn?nn
where n is a digit between 0 and 9, and ? is a literal question mark.
A real example is:
8764?23
It always has 4 digits before ? and 2 digits after.
How can this be done?
Starting off simple:
\d{4}\?\d{2}
But this would also match 8764?23 within a longer string such as 98764?23 or 8764?234.
If you need to find exact matches as individual values within the CSV, a more complex regular expression is needed:
(?:^|,)\s*(\d{4}\?\d{2})\s*(?:,|$)
This may look a bit strange at first sight so let's break it down:
(?:^|,) uses the (something|something else) syntax to allow a choice of two different things - here it is allowing either the very start of the string ^ or a comma ,. The ?: at the start excludes this expression from being included as a capturing group.
\s* allows any amount of whitespace (i.e. zero or more spaces, tabs etc.) to appear before the matched expression.
(\d{4}\?\d{2}) specifies exactly 4 digits \d{4} followed by a question mark \? (which needs to be escaped to distinguish it from the regex ? meaning 0 or 1 occurrences), followed by 2 more digits \d{2}. The surrounding brackets () are used to specify this as a capturing group.
\s* allows more whitespace after the matched expression.
(?:,|$) allows either a comma , or the end of the string $ and ?: excludes this from being a capturing group.
Demo
https://regex101.com/r/X0Ic4v/1
Usage
The above can be used with Nifi's ExtractText to get the first capturing group for each match. Since it is only the capturing group that is of interest and not the rest of the match, "Include Capture Group 0" can be set to false. Presumably both "Enable Multiline Mode" and "Enable repeating capture group" should be set to true.
Further considerations
The above assumes that 8764?23 appears exactly like that as a value in a CSV string. But maybe you need to allow "8764?23"? Or possibly others such as '8764?23', _8764?23_ or even ABC8764?23DEF? There are too many possible variants for here a one size fits all so please reply in the comments to state the requirements if the above doesn't fit your needs.
Here is your regular expression: \d\d\d\d\?\d\d and tool where you can use it (and here more complicated version)
This is the Regex required for your needs.
(\d{4}\?\d\d)

RegExp space character

I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?
Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.

Regex code question

I'm new to this site and don't know if this is the place to ask this question here?
I was wondering if someone can explain the 3 regex code examples below in detail?
Thanks.
Example 1
`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i
Example 2
\\1
Example 3
`[^a-z0-9]`i','`[-]+`
The first regex looks like it'll match the HTML entities for accented characters (e.g., é is é; ø is ø; æ is æ; and  is Â).
To break it down, & will match an ampersand (the start of the entity), ([a-z]{1,2}) will match any lowercase letter one or two times, (acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig) will match one of the terms in the pipe-delimited list (e.g., circ, grave, cedil, etc.), and ; will match a semicolon (the end of the entity). I'm not sure what the i character means at the end of the line; it's not part of the regex.
All told, it will match the HTML entities for accented/diacritic/ligatures. Compared, though, to this page, it doesn't seem that it matches all of the valid entities (although ti does catch many of them). Unless you run in case-insensitive mode, the [a-z] will only match lowercase letters. It will also never match the entities ð or þ (ð, þ, respectively) or their capital versions (Ð, Þ, also respectively).
The second regex is simpler. \1 in a regex (or in regex find-replace) simply looks for the contents of the first capturing group (denoted by parentheses ()) and (in a regex) matches them or (in the replace of a find) inserts them. What you have there \\1 is the \1, but it's probably written in a string in some other programming language, so the coder had to escape the backslash with another backslash.
For your third example, I'm less certain what it does, but I can explain the regexes. [^a-z0-9] will match any character that's not a lowercase letter or number (or, if running in case-insensitive mode, anything that's not a letter or a number). The caret (^) at the beginning of the character class (that's anything inside square brackets []) means to negate the class (i.e., find anything that is not specified, instead of the usual find anything that is specified). [-]+ will match one or more hyphens (-). I don't know what the i',' between the regexes means, but, then, you didn't say what language this is written in, and I'm not familiar with it.