Regex code question - regex

I'm new to this site and don't know if this is the place to ask this question here?
I was wondering if someone can explain the 3 regex code examples below in detail?
Thanks.
Example 1
`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i
Example 2
\\1
Example 3
`[^a-z0-9]`i','`[-]+`

The first regex looks like it'll match the HTML entities for accented characters (e.g., é is é; ø is ø; æ is æ; and  is Â).
To break it down, & will match an ampersand (the start of the entity), ([a-z]{1,2}) will match any lowercase letter one or two times, (acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig) will match one of the terms in the pipe-delimited list (e.g., circ, grave, cedil, etc.), and ; will match a semicolon (the end of the entity). I'm not sure what the i character means at the end of the line; it's not part of the regex.
All told, it will match the HTML entities for accented/diacritic/ligatures. Compared, though, to this page, it doesn't seem that it matches all of the valid entities (although ti does catch many of them). Unless you run in case-insensitive mode, the [a-z] will only match lowercase letters. It will also never match the entities ð or þ (ð, þ, respectively) or their capital versions (Ð, Þ, also respectively).
The second regex is simpler. \1 in a regex (or in regex find-replace) simply looks for the contents of the first capturing group (denoted by parentheses ()) and (in a regex) matches them or (in the replace of a find) inserts them. What you have there \\1 is the \1, but it's probably written in a string in some other programming language, so the coder had to escape the backslash with another backslash.
For your third example, I'm less certain what it does, but I can explain the regexes. [^a-z0-9] will match any character that's not a lowercase letter or number (or, if running in case-insensitive mode, anything that's not a letter or a number). The caret (^) at the beginning of the character class (that's anything inside square brackets []) means to negate the class (i.e., find anything that is not specified, instead of the usual find anything that is specified). [-]+ will match one or more hyphens (-). I don't know what the i',' between the regexes means, but, then, you didn't say what language this is written in, and I'm not familiar with it.

Related

RegExp space character

I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?
Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Using regex to match multiple comma separated words

I am trying to find the appropriate regex pattern that allows me to pick out whole words either starting with or ending with a comma, but leave out numbers. I've come up with ([\w]+,) which matches the first word followed by a comma, so in something like:
red,1,yellow,4
red, will match, but I am trying to find a solution that will match like like the following:
red, 1 ,yellow, 4
I haven't been able to find anything that can break strings up like this, but hopefully you'll be able to help!
This regex
,?[a-zA-Z][a-zA-Z0-9]*,?
Matches 'words' optionally enclose with commas. No spaces between commas and the 'word' are permitted and the word must start with an alphanumeric.
See here for a demo.
To ascertain that at least one comma is matched, use the alternation syntax:
(,[a-zA-Z][a-zA-Z0-9]*|[a-zA-Z][a-zA-Z0-9]*,)
Unfortunately no regex engine that i am aware of supports cascaded matching. However, since you usually operate with regexen in the context of programming environments, you could repeatedly match against a regex and take the matched substring for further matches. This can be achieved by chaining or iterated function calls using speical delimiter chars (which must be guaranteed not to occur in the test strings).
Example (Javascript):
"red, 1 ,yellow, 4, red1, 1yellow yellow"
.replace(/(,?[a-zA-Z][a-zA-Z0-9]*,?)/g, "<$1>")
.replace(/<[^,>]+>/g, "")
.replace(/>[^>]+(<|$)/g, "> $1")
.replace(/^[^<]+</g, "<")
In this example, the (simple) regex is tested for first. The call returns a sequence of preliminary matches delimted by angle brackets. Matches that do not contain the required substring (, in this case) are eliminated, as is all intervening material.
This technique might produce code that is easier to maintain than a complicated regex.
However, as a rule of thumb, if your regex gets too complicated to be easily maintained, a good guess is that it hasn't been the right tool in the first place (Many engines provide the x matching modifier that allows you to intersperse whitespace - namely line breaks and spaces - and comments at will).
The issue with your expression is that:
- \w resolves to this: [a-zA-Z0-9_]. This includes numeric data which you do not want.
- You have the comma at the end, this will match foo, but not ,foo.
To fix this, you can do something like so: (,\s*[a-z]+)|([a-z]+\s*,). An example is available here.

regex for links - help to understand it

how do you read this regex?
#(http|https|ftp)://([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(d+)?/?#i
this is a regex for links, but i'm having trouble to understand it
Thanks
Depending on what language you're in, regexes need a delimiter. Seems the # (pound sign or hash) is used here. So,
#...actual regex goes here...#
In javascript you need forward slashes (/..../).
Some regex engines allow you to pass flags that influence matching process. These appear after the closing delimiter:
#...actual regex goes here...#..flags go here..
In your example, there is one flag, the i and I am guessing that means: "case insensitive" (i for insensitive). Depending on the regex engine you can have flags that influence the syntax you can use for the actual regex (for example, the dot can match either any character or any character except newlines depending upon wheter a flag was passed), flags that influence how the matching is done (for example, in javascript a g indicates the global flag, and that means matching anywhere inside the string is done, and state is preserved), flags that determine whether whitespace is allowed as indentation inside the regex. And some have a m flag indicating whether the regex will be applied on a line by line basis, or on the entire text. There is AFAIK no standard set of flags, check your regex engine documentation.
If you have multiple flags, you just concatenate them together to a string of flags and put them after the closing delimiter.
Now for the actual regex. First, you start with a parenthesized expression:
(...group...)
This is also called a group. In many regex engines, these groups have special meaning, because when a match is found you can access the bits of text that matched the expression inside the group using a special variable (or sometimes, the match is returned as an array, where each element represents a group). If you can access the bits inside groups, it is called a "capturing group".
In this particular case the group uses "alternation" or "choice" and this is indicated by the | (pipe). The pipe is part of the regex syntax and means "or". So,
(http|https|ftp)
means: match "http", or if that doesn't match, "https", of if that doesn't match, "ftp". This also brings up another reason for using parenthesis: of all special regex syntax operators, the pipe has the lowest precedence, so the parenthesis would not have been there, it would have meant: match "http" or "https" or "ftp://...etc"
So far, we've seen these "special characters": | (pipe) and ( and ). After that we get
://
These are not special characters, and any non-special characters simply match themselves.
We then get another group, which makes up almost the rest of the regex:
([A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+)
Inside it, we see a bracketed expression:
[A-Z0-9]
The brackets [ and ] are special, and indicate a "character class". There are other ways to denote character classes, but in all cases a character class matches a single character. Which character depends on the nature of the class. In this case, the class is defined using two ranges:
A-Z
means characters A thru Z (and anything in between) and
0-9
means characters 0 thru 9 (and anything in between).
Basically, [A-Z0-9] matches any alpha-numeric character.
Note that the dash between the boundaries of the range is only a special character inside these bracketed expressions. Paradoxically, a dash inside the brackets can also simply mean a dash if it cannot be interpreted as a range.
This is folllowed by yet another character class:
[A-Z0-9_-]
Almost the same as the previous on, it just adds the underscore and the dash. This last dash cannot be interpreted as a range separator, so it simply means a dash. This character class will match any alpha-numeric character as well as underscore and dash.
This class is followed by a * (asterisk) and this is a special character indicating a cardinality. Cardinalities specify how often the immediately preceding element may occur. These are the common cardinalities:
* (asterisk) means zero or more times.
? (question mask) means zero or once.
+ (plus) means one or more times.
Now the entire bit starts to make sense:
[A-Z0-9][A-Z0-9_-]*
means: a sequence starting with one alphanumeric caracter, optionally followed by a string of "word" characters (that is, alphanumeric, dash and underscore).
The following bit of the regex is this:
(?:.[A-Z0-9][A-Z0-9_-]*)+
I think this is trying to match the domain parts. So that if you have say:
https://mail.google.com
The .google and .com bits would be matched by this part. The initial (?: bit is meant to tell the regex engine to not create a "backreference". This is not really my stronghold, maybe someone else can explain. But the rest of that group is quite clear and resembles what we saw before. I think there is a mistake though: the dot (.) that appears immediately before the bracketed character class usually means "match any character" or "match any non-newline character", not "match a literal dot". Typically if you want a literal dot, you need to escape it. This would be the syntax in javascript and I think perl:
(\.[A-Z0-9][A-Z0-9_-]*)+
(note the backslash immediately before the dot to indicate a literal dot)
The final bits of the regex seem an attempt to match a port number:
:?(d+)?
However, the d+ bit is probably wrong: right now it matches "one or more d's". It should probably be:
:?(\d+)?
meaning: optionally match a colon (:), optionally followed by a bunch of digits. The \d is also a character class, but a predefined one. I think most regex engines use \d to denote a digit, but you should check the documentation of your engine to see the exact convention. So in say:
http://domain.server.extension:8080/
this part of the regex would match :8080 (provided you fix the d+ thing).
Finally, we see
/?
Meaning the entire thing can be followed optionally by a forward slash.
So, all in all, I don;'t think this matches a "link", rather it matches the inital part of a URL. To match an entire url, you would need a bit more, at least I don't see any expression that could match the path, resource, hash and query bits that may occur in a proper URL.
When you say you have trouble understanding it, it means you tried something and are stuck somewhere?
Please ask more specific questions.
I can give you some keywords that you can lookup them more easy, a good place for that is regular-expressions.info
(http|https|ftp) is an alternation
[A-Z0-9] is a character class
*, + and ? are quantifiers
(...) is a (capturing) group, (?:...) is a non capturing group
The # at the start and end are regex delimiters, the i at the very end is a modifier/option (match case independent).
The (d+)? at the end would match one or more (optional) letters "d". This is quite strange. I assume it should be (\d+)? that would be one or more (optional) digits.

Regular expression (alphanumeric)

I need a regular expression to allow the user to enter an alphanumeric string that starts with a letter (not a digit).
This should work in any of the Regular Expression (RE) engines. There is a nicer syntax in the PCRE world but I prefer mine to be able to run anywhere:
^[A-Za-z][A-Za-z0-9]*$
Basically, the first character must be alpha, followed by zero or more alpha-numerics. The start and end tags are there to ensure that the whole line is matched. Without those, you may match the AB12 of the "###AB12!!!" string.
Full explanation:
^ start tag.
[A-Za-z] any one of the upper/lower case letters.
[A-Za-z0-9] any one of the upper/lower case letters or digits,
* repeated zero or more times.
$ end tag
Update:
As Richard Szalay rightly points out, this is ASCII only (or, more correctly, any encoding scheme where the A-Z, a-z and 0-9 groups are contiguous) and only for the "English" letters.
If you want true internationalized REs (only you know whether that is a requirement), you'll need to use one of the more appropriate RE engines, such as the PCRE mentioned above, and ensure it's compiled for Unicode mode. Then you can use "characters" such as \p{L} and \p{N} for letters and numerics respectively. I think the RE in that case would be:
^\p{L}[\pL\pN]*$
but I'm not certain. I've never used REs for our internationalized software. See here for more than you ever wanted to know about PCRE.
I think this should do the work:
^[A-Za-z][A-Za-z0-9]*$
You're looking for a pattern like this:
^[a-zA-Z][a-zA-Z0-9]*$
That one requires one letter and any number of letters/numbers after that. You may want to adjust the allowed lengths.