Difference between "(\S+)\.|" and "(\S+) |" in Perl [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I have no idea what the difference between (\S+) | and (\S+)\.| is.

(\S+)\.| will match and capture any number (one or more) of non-space characters, followed by a dot character.
(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag).
In both cases, these constructs appear to be one component of an alternation.
Breaking it down:
(....) : Group and capture.
\S : Non-space character.
+ : One or more.
\. : A dot character (without the backslash escape, the
dot has special meaning).
: Just an ordinary single space.
| : Alternation (similar to logical or).
See perlretut for a crash course in Perl's regular expressions. Also perlintro is a good starting point for learning Perl, and perlre is the canonical explanation of Perl's regular expressions. There are many other useful documents in Perl's documentation, but these would get you moving in the right direction.
If you want to learn everything there was to know in 2005 about common regular expression flavors, Mastering Regular Expressions, 3rd Edition is unparalleled. And despite being a few years old, it's still one of the best resources anywhere on regular expressions.

Related

What is the purpose of [.] in regular expressions? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I found [.] in a regular expression on manpage of notmuch:
notmuch search 'from:"/bob#.*[.]example[.]com/"'
It seemed to be useless because brackets are for list but have only one character, but finally I learned it matches a literal dot.
Then, why they use it rather than \.? Are there any advantages on this expression?
At first I thought that this is to avoid double escaping but on further consideration I think this is because a dot in a character set ([]) is treated differently than normal. It makes sense that in a character set a dot only matches a literal dot, the whole point is to match a specific set of characters so having a wildcard in the set doesn't make sense.
So [.,;:] may be used to match punctuation marks.
Once you take that into account it's obvious that [.] just matches dot.
Whether to use \. or [.] is left as an aesthetic decision.

negation classes regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
i wrote this regex for tokenize a text: "\b\w+\b"
but someone suggets me to convert it into \b[^\W\d_]+\b
can anyone explaing to me why this second way (using negation) is better?
thanks
The first one matches all letters, numbers and the underscore. Depending on the regex engine, this may include unicode letters and numbers. (the word boundaries are superfluous in this case btw.)
The second regex matches only letters (excluding non-word-charcters, digits and the underscore). Due to the word boundary, it will only match them, if they are surrounded by non-word-characters or start/end of th string.
If your regex engine supports this, you might want to use [[:alpha:]] or \p{L} (or [A-Za-z] in case of non-unicode) instead to make your intent clearer.

How to build a regular expression which prohibits hyphens from appearing at the start and end of a string? [duplicate]

This question already has answers here:
RegEx for allowing alphanumeric at the starting and hyphen thereafter
(4 answers)
Closed 5 years ago.
I want to build a regular expression which only matches [A-Za-z0-9\-] with an additional rule that hyphens (-) are not allowed to appear at the start and at the end.
For example:
my-site is matched.
m is matched.
mysite- is not matched.
-mysite is not matched.
Currently, I've come up with ^[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]+$.
But this doesn't match m.
How can I change my regular expression so that it fits my needs?
Use look arounds:
^(?!-)[A-Za-z0-9-]*(?<!-)$
The reason this works is that look arounds don't consume input, so the look ahead and the look behind can both assert on the same character.
Note that you don't need to escape the dash within the character class if it's the first or last character.

? character in a regular expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I have the following regex :
.*(?:(?:(?<!a)cc|string).*number).*
And I am trying to understand what the ? in the beginning of the string between brackets mean. I know the a? means that the previous character 'a' can be repeated zero or one time. But what does it mean when it appears in the beginning of a string ?
The answer requires a little history lesson. When Larry Wall wanted to add new features to regexes in Perl, he couldn't just change the meaning of existing metacharacters, or assign special meanings to characters that didn't have them. That would have broken a lot of regexes that had been working. Instead, he had to look for character sequences that would never appear in a regex.
There was only the one kind of group originally: what we now call capturing groups. The opening parenthesis was a metacharacter, so it would make no sense to follow it with a quantifier. You could match a literal open-paren zero or one time with \(?, or you could match (and capture) a literal question mark with (\?), but if you tried to use (? in regex it would throw an exception.
Larry changed the rule so (? could appear in a regex, but it must form the beginning of a special-group construct, which requires at least one more character. So, to answer your question, the string doesn't start with ?. The sequence (?: forms a single token, representing the beginning of a non-capturing group. We also have (?= and (?! for positive and negative lookaheads, (?<= and (?<! for lookbehinds, and so on.
(?:) is a non-capturing group. It do a matching operation only. It won't capture anything.
(?<!) is a Negative lookbehind.

How can I write a regex which matches non greedy? [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
How do I match any character across multiple lines in a regular expression?
(26 answers)
What is the difference between .*? and .* regular expressions?
(3 answers)
RegEx: Smallest possible match or nongreedy match
(3 answers)
Closed 3 years ago.
I need help about regular expression matching with non-greedy option.
The match pattern is:
<img\s.*>
The text to match is:
<html>
<img src="test">
abc
<img
src="a" src='a' a=b>
</html>
I test on http://regexpal.com
This expression matches all text from <img to last >. I need it to match with the first encountered > after the initial <img, so here I'd need to get two matches instead of the one that I get.
I tried all combinations of non-greedy ?, with no success.
The non-greedy ? works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .. You need to tell them explicitly that you want to match line-breaks too with .
For example,
<img\s.*?>
works fine!
Check the results here.
Also, read about how dot behaves in various regex flavours.
The ? operand makes match non-greedy. E.g. .* is greedy while .*? isn't. So you can use something like <img.*?> to match the whole tag. Or <img[^>]*>.
But remember that the whole set of HTML can't be actually parsed with regular expressions.
The other answers here presuppose that you have a regex engine which supports non-greedy matching, which is an extension introduced in Perl 5 and widely copied to other modern languages; but it is by no means ubiquitous.
Many older or more conservative languages and editors only support traditional regular expressions, which have no mechanism for controlling greediness of the repetition operator * - it always matches the longest possible string.
The trick then is to limit what it's allowed to match in the first place. Instead of .* you seem to be looking for
[^>]*
which still matches as many of something as possible; but the something is not just . "any character", but instead "any character which isn't >".
Depending on your application, you may or may not want to enable an option to permit "any character" to include newlines.
Even if your regular expression engine supports non-greedy matching, it's better to spell out what you actually mean. If this is what you mean, you should probably say this, instead of rely on non-greedy matching to (hopefully, probably) Do What I Mean.
For example, a regular expression with a trailing context after the wildcard like .*?><br/> will jump over any nested > until it finds the trailing context (here, ><br/>) even if that requires straddling multiple > instances and newlines if you let it, where [^>]*><br/> (or even [^\n>]*><br/> if you have to explicitly disallow newline) obviously can't and won't do that.
Of course, this is still not what you want if you need to cope with <img title="quoted string with > in it" src="other attributes"> and perhaps <img title="nested tags">, but at that point, you should finally give up on using regular expressions for this like we all told you in the first place.