What's the difference between * and ? in regular expressions?

What's the difference between * and ? in regular expressions? - regex

Both seem to mean, match 0 or more characters? I don't understand the difference between them, or when to use ? and when to use *. Some examples would help.

In the Formal definition the symbols of regular expressions operators are
. : which is concatenation like a.b.c would match a text having abc . Some times to indicate concatenation simply two symbols are used back to back.
* : match the last symbol 0 or more times, (abc)* would match a null string, abc, abcabc, abcabcabc, but not abcaabc. Known as the Kleen's star.
+ : would match either the left-hand side or the right hand side . (abc + def) would match abc or def. Also the union operator or the | operator is used.
These are applied on a set of symbols sigma, which includes the symbols in your language within other special symbols are the epsilon which denotes the empty string, and the null means no symbols at all. For details see [3]
These are the formal definitions.
When you use applications accepting the POSIX regular expression syntax the meaning of the different operators are like this:
These are the POSIX Basic regular expression operations
. : The dot '.' matches any character like a.c could match abc, axc, amc, aoc anything.
^ : Indicates the start of line. ^abc would match the string which is starting at the line. abc appearing in between the line would not be matched
$ : Indicates the end of line. abc$ would only match the string abc at the end of the line. This would not match any 'abc' in between the lines.
* : Matches the last symbol preceding the '*' 0 or more times. So ab*c would match ac, abc, abbc, abbbc, abbbbbc, abbbbbbbbc etc.
{m, n} : Matches the the preceding symbol atleat 'm' times and at most 'n' times.
ab{2,4}c would not match 'abc', but would match 'abbc', 'abbbc', 'abbbbc', but will not match 'abbbbbc' . So if the number of 'b' is >= 2 and <= 4 it would match.
{m,} : means match the preceding symbol minimum 'm' times, and no limit in the maximum. (note the comma)
{n} : means match the preceding symbol exactly 'n' times. so ab{3}c would only match 'abbbc'.
[symbols] : will match any one of the symbols inside the box braces. like a[xyz]c would match 'axc' , 'ayc', and 'azc' and no other strings
[^symbols] : will match any symbol once which are not inside the box brackets. like a[xyz]c would match any strings 'a.b' with the '.' being any symbol except x, y, z.
These are the POSIX Extended regular expression operators (needs grep -E)
? : Will match the preceding symbol 0 or at most 1 time. so ab?c would match 'ac' and 'abc' only.
+ : Will match the preceding symbol at least 1 time and at most any number times (no upper bounds). Like ab+c would match abc, abbbc, abbbbbc, abbbbbbbbc, etc, but would not match 'ac'
| : Would match either the expression on the left side of the '|' or the right side expression on the right side of the '|'. like (ab+c)|(xy*z) .
Also have a look at the POSIX meta character classes like [:alpha:] represents all the alphabets. [:punct:] denotes all the punctuations etc.
Wild Characters/ Globs
If you are using * and ? as wild cards then the interpretations are as below
* : Match any number of any characters at this position. Like *.c means all strings ending with the string '.c' (here . has no special interpretations) . Test with ls *.c or ls *.doc
? : Match any character only one time at this position. Like file??.txt would match strings 'fileab.c', 'file00.c' etc, and match any exactly two characters. Test with ls *.??? which will list all the files having a three character extension.
I hope this answers your question. Or you might want to through some text about formal definitions and the POSIX and maybe the Perl style regular expressions for a clear idea.
References:
Wikipedia Page
grep manual Regular expression section
Theory of Computation by Michael Sipser
Note: This answer was reconstructed

? means zero or one of. * means any number of. So this:
^ab?$
would match a and ab, but not abb. This:
^ab*$
would match not only a and ab, but also abb, abbb, and a with any number of bs following it.

For sake of regex completeness something like *? is also used. In this case it is a lazy match or non-greedy match and will match as few characters as possible before matching the next token.
For example:
a.*a
would match whole of abaaba
while
a.*?a
will match aba, aba

Related

Do parentheses change the length of a regular expression?

Let Sigma = {a,b}. The regular expression RE = (ab)(ab)*(aa|bb)*b over Sigma.
Give a string of length 5 in the set denoted by RE.
Correct answer: abaab
My answer: (ab)aab
I placed the parentheses there because they are in the RE. I understand why I don't need to, but is my answer incorrect? I tested it using RegEx, and the expression (ab)aab matched the text abaab, but it did not match when I reversed this.

() is syntax of regex and has its semantic meaning, you may have a look here and here
Similar to ^ or & and other reserved character in regex, you have to special handle to match them using regex, for example: Regex to Match Symbols: !$%^&*()_+|~-=`{}[]:";'<>?,./
Also, specifically in your question context, () should not appear as part of the string as it is not in the charater set (alphabet) {a,b}. And the string you provide has a lengh of 7 instead of 5, so it is correct to say it is wrong.

Your answer is wrong because the parentheses do not belong to your set of symbols. The string (ab)aab cannot be generated using only symbols present in the {a,b} set.
Even more, you were asked to provide a string of 5 symbols but (ab)aab has length 7.
Parentheses have special meaning in regex. They create sub-regexps and capturing groups. For example, (ab)* means ab can be matched any number of times, including zero. Without parentheses, ab* means the regex matches one a followed by any number of bs. That's a different expression.
For example:
the regular expression (ab)* matches the empty string (ab zero times), ab, abab, ababab, abababab and so on;
the regular expression ab* matches a (followed by zero bs), ab, abb, abbb, abbbb and so on.
The first set of parentheses in your example is useless if you are looking only for sub-regexps. Both (ab) and ab expressions match only the ab string. But they can be used to capture the matched part of the string and re-use it either with back references or for replacement.
When parentheses are used for sub-expressions in regular expressions, they are meta-characters, do not match anything in the string. In order to match an open parenthesis character ( (found in the string) you have to escape it in the regex: \(.
Several strings that match the regular expression (ab)(ab)*(aa|bb)*b over Sigma = { 'a', 'b' }: abb, ababb, abababababb, ababababaabbaaaabbb.
The last string (ababababaabbaaaabbb) matches the regex pieces as follows:
ab - (ab)
ababab - (ab)* - ('ab' 3 times)
aabbaaaabb - (aa|bb)* - ('aa' or 'bb', 5 times in total)
b - b
A regex that matches the string (ab)aab is \(ab\)(ab)*(aa|bb)*b but in this case
Sigma = { 'a', 'b', '(', ')' }

Regular expression replace, back referencing replaced characters sublime text 3

I have a file with the following lines:
A 123
B 323
Each line starts with either A or B, and is followed by a blank and a number.
I am trying to convert this into
'C [a-z]*A 123'
for each line. I use a regex in Find and replace. The regex [AB] [0-9]* selects all the lines without a problem. I'm trying to replace it with 'C [a-z]*$1' that does not print $1 in the replaced string, and returns:
'C [a-z]*'
What am I missing?

You regex - [AB] [0-9]* - has no round brackets (i.e. no capturing groups that must be present if you wish to reference the captured subtexts later in the relacement string), and thus, you do not get the expected result.
You can use
(?m)^[AB][ ]([0-9]{3})
Or, if the digits are optional, use * quantifier that means match 0 or more characters as defined with the preceding subpattern
(?m)^[AB][ ]([0-9]*)
And replace with
'C [a-z]*$1'
See demo

What does the RegEx (a+b)^n(c+d)^m match?

I'm unsure of what this RegEx matches:
(a+b)^n(c+d)^m
I know that the + metacharacter means "one or more times the preceding pattern". So, a+ would match one or more as whereas a* also includes the empty string.
But I think that in this case, the RegEx means a or b to the nth time concatenated with c or d to the mth time, so it'd match strings like these:
aaaacc (n=4, m=2)
bbbbbdddd (n=5, m=4)
aaaddddd (n=3, m=5)
bc (n=1, m=1)
aaaaaaaaaaaaccccc (n=12, m=5)
...
Is this correct? If it's not, can anyone provide examples of what this RegEx does match?

It doesn't look like a valid regular expression given the incorrect use of ^
^ should either be inside []'s like this [^a], or at the very start of the regular expression.
+ just means 1 or more occurrence of a character.
If ^n means can be repeated n times then these would be matches:
aaaaaabccccccccd,
aaaaaabaaaaaabaaaaaabccccccccdccccccccd

Apparently (a+b)^n(c+d)^m means "n slots for unordered a's and b's followed by m slots for unordered c's and d's"
e.g. an example of (a+b)^10(c+d)^5 would be: aaaababbbadcccd

If you're using Perl regular expressions with the 'm' option, e.g. /(a+b)^n(c+d)^m/m, the
'^' will match an internal beginning of line. So...
/
(a+b) # Match one or more as followed by b
^n # Match the beginning of a line followed by a literal n.
(c+d) # Match one or more cs followed by d
^m # Match the beginning of a line followed by a literal m.
/mx
(a+b) and (c+d) would be available in $1 and $2.

Regex Expression Differences

I would like to understand the difference between the following 3 regular expressions:
I wanted to display all the lines in a file that consisted only of lowercase alphabets in them.
Here are the 3 regular expressions I tried:
cat filename.txt | grep ^[a-z]*
Regex Description: This will display all the lines starting with 0 or more lowercase letters. So, it will match either of the following:
zapato
113078
OLIVIA
Not exactly, what we wanted.
cat filename.txt | grep ^[a-z]*$
Regex Description: This will display all the lines starting with 0 or more lowercase letters till the end of the line. This matches the following:
fubuki
BALLIN
Kristine
This time there were no results with digits in them.
cat filename.txt | grep ^[a-z]*[a-z]$
Regex Description: This one works well for me. It searches for all the lines starting with 0 or more lowercase letters and it matches it till it finds another lowercase letter. For some reason, this works for me. However, I want to know how this is different from the previous regular expressions.
tonia
ecurby
totonno
Also, when the asterisk () in the regular expression means, 0 or more, then it should include all the results when I write, ^[a-z]

Short explanations of your regular expressions:
^[a-z]*
Match string starting with 0 or more characters from [a-z].
Matches empty string and every string starting with character of set [a-z].
^[a-z]*$
Match string containing nothing but 0 or more characters from [a-z].
Matches empty string and every string containing only characters of set [a-z].
^[a-z]*[a-z]$
Match string starting with 0 or more characters from [a-z] followed by exactly one last character from [a-z].
Matches every non-empty string containing only characters of set [a-z].
Use this instead of your current third option:
^[a-z]+$
It is semantically equivalent but simpler.
The expression x*x (or xx*) is equivalent to x+ in regular expressions (with x being any expression). The latter is basically just syntactic sugar for either of the former more verbose expressions.
Or put differently: while * means 0 or more, + means 1 or more.

Regular Expression to match set of arbitrary codes

I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4

This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/

Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20

/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)

^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)

^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.

I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js