Optimal UUID RegEx with optional delimeters - regex

I have already found a question for UUID regular expressions here, but these expressions do not account missing delimeters.
I have come up with the following expression, but is there a more optimal RegEx?
/\b([0-9a-f]{8}-?([0-9a-f]{4}-?){3}[0-9a-f]{12})\b/i

I am assuming by optimal that you mean a shorter expression. I simplified your regex down to the following:
/[\da-f]{8}-?([\da-f]{4}-?){3}[\da-f]{12}/i, which you can see in action here.
I removed the outer parentheses and \b because everything was correctly matched without them.
I was also able to shave three characters off by replacing [0-9a-f] with [\da-f].
I originally had [0-F], but after examining the ASCII sequence, I realized that matched
0123456789:;<=>?#ABCDEF, which includes some extra symbols that we do not want to match.
In conclusion, my expression is synonymous with yours but contains nine fewer characters.

Related

E-mail address validation using Regular Expressions

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address.
I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.
Exercise:
Untangle the following regular expression that validates an email address:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
It looks like a jumble of characters.
Can someone please explain to me how does this work?
I try to use this online resources by by Jan Goyvaerts.
Any help I will appreciate it.
First of all, there is a good thread about totally the same thing:
Using a regular expression to validate an email address
Then, below there is the explanation of your regular expression:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.
Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.
Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.
Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.
Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.
After the dot we see again the character of symbols, explained above.
#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
- Here we see the at symbol ('#'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.
More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.
A good Regular Expressions reference: Regular Expression Language - Quick Reference
Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing
About special characters: Regex Tutorial - Literal Characters and Special Characters
Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.
One valid characters for a username
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
check for a single '.' and any additional amount of characters
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
The '#' symbol
Valid second / lower level domain
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
A valid top level domain
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.
I've found a remarkable tool for visualizing regular expressions here: http://regexper.com
It shows me that your regular expression breaks down like this. Hopefully this helps explain it.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
#
Matches the # symbol
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.
Two Suggestions I have for you.
Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.
See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

Regex for extracting qmake variables

I'm trying to write the QRegExp for extracting variable names from qmake project code (*.pro files).
The syntax of variable usage have two forms:
$$VAR
$${VAR}
So, my regular expression must handle both cases.
I'm trying to write expression in this way:
\$\$\{?(\w+)\}?
But it does not work as expected: for string $$VAR i've got $$V match, with disabled "greeding" matching mode (QRegExp::setMinimal (true)). As i understood, gready-mode can lead to wrong results in my case.
So, what am i doing wrong?
Or maybe i just should use greedy-mode and don't care about this behavior :)
P.S. Variable name can't contains spaces and other "special" symbols, only letters.
You do not need to disable greedy matching. If greedy matching is disabled, the minimal match that satisfies your expression is returned. In your example, there's no need to match the AR, because $$V satisfies your expression.
So turn the minimal mode back on, and use
\$\$(\w+|\{\w+\})
This matches two dollar signs, followed by either a bunch of word characters, or by a bunch of word characters between braces. If you can trust your data not to contain any non-matching braces, your expression should work just as well.
\w is equal to [A-Za-z0-9_], so it matches all digits, all upper and lowercase alphabetical letters, and the underscore. If you want to restrict this to just the letters of the alphabet, use [A-Za-z] instead.
Since the variable names can not contain any special characters, there's no danger of matching too much, unless a variable can be followed directly by more regular characters, in which case it's undecidable.
For instance, if the data contains a string like Buy our new $$Varbuster!, where $$Var is supposed to be the variable, there is no regular expression that will separate the variable from the rest of the string.

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).

Different regex evaluation in collections or patterns

I am experiencing a strange behaviour when searching for a regular expression in vim:
I attempt to clean up superfluous whitespace in a file and want to use the substitute command for it.
When I use the following regular expression with collections, vim matches single whitespaces as well:
\%[\s]\{2,}
When I use the same regular expression with patterns instead of collections vim correctly matches only 2 or more whitespaces:
\%(\s\)\{2,}
I know that I do not need to use a collection, but if I try the expression in a online regular expression parser (e.g. Rubular) it works with a collection as well.
Can anyone explain why these expression are not evaluated in the same way?
Because \%[...] and \%(...\) are completely different patterns.
\%[...] means a sequence of optional atoms.
For example, r\%[ead] matches "read", "rea", "re" and "r".
While \%(...\) treats the enclosed atoms as a single atom.
For example, r\%(ead\) matches only "read".
So that,
\%[\s]\{2,} can be interpreted as \(\s\|\)\{2,}, then \(\s\|\)\(\s\|\)\|\(\s\|\)\(\s\|\)\(\s\|\)\|....
Here \(\s\|\)\(\s\|\), the minimum pattern, can be interpreted as \(\)\(\), \(\)\(\s\), \(\s\)\(\) or \(\s\)\(\s\).
It matches 1 whitespace character too.
\%(\s\)\{2,} can be interpreted as \s\{2,}, then \s\s\|\s\s\s\|....
It matches only 2 or more whitespace characters.
does this answer your question?
http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\%[]
A sequence of optionally matched atoms. This always matches.
It matches as much of the list of atoms it contains as possible.
Thus it stops at the first atom that doesnt match.
For example:
/r\%[ead]
matches "r", "re", "rea" or "read". The longest that matches is used.
The problem is it always match and override the quantifier {2,} at the back.
it is rarely used, but interesting nevertheless.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part