I have a problem in solving the following exercise and I'd appreciate any help.
Let Σ = {a,b}. I need to give a regular expression for all strings containing an odd number of a.
Thank you for your time
b*(ab*ab*)*ab*
the main part of it is (ab*ab*)*, which enumerate all possibilities of even number of as. then at last, an extra a has to exist to make it odd.
notice that this regular expression is equivalent to:
b*a(b*ab*a)*b*
these two constructs are in the form defined by pumping lemma:
http://en.wikipedia.org/wiki/Pumping_lemma
UPDATE:
#MahanteshMAmbi presented his concern of the regular expression matching the case aaabaaa. In fact, it doesn't. If we run grep, we shall see clearly what is matched.
$ echo aaabaaa | grep -P -o 'b*(ab*ab*)*ab*'
aaabaa
a
-o option of grep will print each matching instance every line. In this case, as we can see, the regular expression is being matched twice. One matches 5 as, one matches 1 a. The seeming error in my comment below is caused by an improper test case, rather than the error in the regular expression.
If we want to make it rigorous to use in real life, it's probably better to use anchors in the expression to force a complete string match:
^b*(ab*ab*)*ab*$
therefore:
$ echo aaabaaa | grep -P -q '^b*(ab*ab*)*ab*$'
$ echo $?
1
^[^a]*a(?=[^a]*(?:a[^a]*a)*[^a]*$).*$
This will find only odd number of a's for any generic string.See demo.
https://regex101.com/r/eS7gD7/22
Related
I tried this regex:
ab(cd|c)*d
in the regex101 and RegExr websites. It matched this text completely:
abcdcdd
Now let's swap "cd" and "c" in the regex:
ab(c|cd)*d
When I try this regex in the websites, I see this regex does not completely match the same text.
Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?
REGEX: ab(cd|c)*d
Complete text matched in 13 steps: abcdcdd
REGEX: ab(c|cd)*d
Partial text matched in 9 steps: abcdcdd
#MurrayW's answer is excellent, but I would like to add some background information.
Regex as Finite State Automata
When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:
Both
echo abcdcdd | grep --color -E 'ab(c|cd)*d'
and
echo abcdcdd | grep --color -E 'ab(cd|c)*d'
color the whole string in red.
Patterns we call "regular expressions"
True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.
At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.
So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.
Speed
While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.
Because the | character performs an or operation by testing the left-most condition first. If that matches, nothing further is tested in the or. If that fails, then the next or element is tested, and so on.
Using regex pattern ab(cd|c)*d, you can see that the cd part of (cd|c)* matches in your string, and is also repeated: abcdcdd.
However, in pattern ab(c|cd)*d, the c matches from the or operation in abcdcdd and so cd isn't tested at all. Then, the d at the end of the pattern matches the d after the first c and then the pattern stops, having only matched abcdcdd
As previously answered in the comments, they are not the same patterns. The alternation in the first one tries to match cd first, the second one c first.
First pattern
abcdcdd
^^^^
||
||
ab(cd|c)*d
Second pattern
abcdcdd
^^____
| |
| |
ab(c|cd)*d
If the d is optional, you can omit the pipe for the alternation and make the d optional.
ab(cd?)*d.
Regex demo
Note that this way you repeat the capturing group which will hold the value of the last iteration.
If you are not interrested in the value of the group and non capturing groups are supported you could use ab(?:cd?)*d.
Regex is always a left to right proposition.
The only way a regex engine will ignore a previous alternation construct
is if it has to satisfy a term on the right side of the alternation group
that cannot be satisfied otherwise.
The regex rule is that the pattern is traversed from left to right,
but is controlled by the target string being traversed from left to right.
The symbiosis ..
Given the target string was matched like so "abcdcdd"
its easy to assume that the regex subset of the full regex
ab
( c | cd )* # (1)
d
is clearly
ab
c*
d
where the cd term of the alternation to the right was never needed
for a successful match.
This proves regex engines are a Left to Right bias machine.
Given a bash variable holding the following string:
INPUT="Cookie: cf_clearance=foo; __cfduid=bar;"
Why is the substitution ${INPUT/cf_clearance=[^;]*;/} producing the output: Cookie: instead of what I'd expect: Cookie: __cfduid=bar;
Testing the same regex in online regex validators confirms that cf_clearance=[^;]*; should match cf_clearance=foo; only, and not the rest of the string.
What am I doing wrong here?
Use the actual regular-expression matching features instead of parameter expansion, which works with patterns.
[[ $INPUT =~ (.*)(cf_clearance=[^;]*;)(.*) ]]
ans=${BASH_REMATCH[1]}${BASH_REMATCH[3]}
You can also use an extended pattern, which is equivalent to a regular expression in power:
shopt -s extglob
$ echo "${INPUT/cf_clearance=*([^;]);/}"
Use sed:
INPUT=$(sed 's/cf_clearance=[^;]*;//' <<< "$INPUT")
Like you have been told in comments, bash parameter substitution only supports glob patterns, not regular expressions. So the problem is really with your expectation, not with your code per se.
If you know that the expression can be anchored to the beginning of the string, you can use the ${INPUT#prefix} parameter substitution to grab the shortest possible match, and add back the Cookie: in front:
echo "Cookie: ${INPUT#Cookie: cf_clearance=*;}"
If you don't have this guarantee, something very similar can be approximated with a pair of parameter substitutions. Find which part precedes cf_clearance, find which part follows after the semicolon after cf_clearance; glue them together.
head=${INPUT%cf_clearance=*}
tail=${INPUT#*cf_clearance=*;}
echo "$head$tail"
(If you are not scared of complex substitutions, the temporary variables aren't really necessary or useful.
echo "${INPUT%cf_clearance=*}${INPUT#*cf_clearance=*;}"
This is a little dense even for my sophisticated taste, though.)
I need a regular expression (grep -e "__"), which matching all lines containing if and just one = (ignoring lines containing ==)
I tried this:
grep -e "if.*=[^=]"
but = is not a character class, so it doesn't work.
The problem is .* may contain an =.
I'd suggest
grep -e "if[^=]*=[^=]"
If your goal is to find lines of code with an if containing an erroneous assignment instead of a comparison, I'd suggest to use a linter (which would be based on a robust parser instead of just regexes). The linter to use depends on the language of the code, of course (for example I use this one in Javascript).
I have spent the past few hours to trying to get a regular expression string right and have had no luck. The strings function would be to search through a file list and pull the ones which have any of the following in them:(OL####,DE####,DEA####,OLA####). Thus far I have gotten the following to sort of work.
grep "\<[DE\b|DEA\b|OL\b|OLA\b]\+[0-9]"
However it still finds things such as "E1" and pulls those lines out. What am I missing? I am very new to regular expressions and am trying to learn as I go.
Try this:
grep -oE '\b(OL|DE|DEA|OLA)[0-9]+\b' file
You can't use alternation inside of a character class. A character class defines a set of characters. Saying — "match one character specified by the class". Use a grouping construct instead:
I would try the following to match the lines:
grep -E '\b(DEA?|OLA?)[0-9]+'
If you only want the substring, use the following:
grep -Eo '\b(DEA?|OLA?)[0-9]+'
You need to replace your square brackets with round ones and remove the +:
grep -P "<(DE|DEA|OL|OLA)[0-9]"
Also note that angle brackets don't need escaping. I'm assuming you intended to have the < there, since it's not in your example strings.
I want to increase value.For Example
Jerry1
Jerry2
Jerry3
Jerry4
I want to change that.
Jerry2
Jerry3
Jerry4
Jerry5
How can I change ?
Don't try to abuse regular expressions for everything.
By design, regular expressions are meant to not support counting. The reason is simple: if you want to have this, you need at least a type-2 language, while processing is signficiantly more complex than for type 3 ("regular") languages.
See Wikipedia for details: https://en.wikipedia.org/wiki/Chomsky_hierarchy
So by the definition, once you fully support counting it probably no longer is a regular language.
There are extensions around, for example perl extended regular expressions, that do allow to solve this particular problem. But essentially, they are no longer regular expressions, but they invoke an external function to do the work.
The following perl extended regular expression should do what you want:
s/(-?\d+)/$1 + 1/eg
but essentially, only the matching part is a regular expression, the substitution is Perl, so turing complete. The e flag indicates the right part should be evaluated by Perl, not as regexp substitution string.
You can of course do this trick in pretty much any other regular expression engine. Match, then compute the increment, then substitute the match with the new value.
Full perl filter demo:
> echo 'Test 123 test 0 Banana9 -17 3 route66' | perl -pe 's/(-?\d+)/$1+1/eg'
Test 124 test 1 Banana10 -16 4 route67
The p flag makes perl read standard input and apply the program to each line, then output the result. That is why the actual script consists of the substitution only. This is what makes Perl so popular for unix scripting. You can even mass-apply this filter to a whole set of files (see -i for in-place modification, and the perlrun manual page). So in order to modify a whole set of files in place (backups will be postfixed with .bak):
perl -p -i .bak -e 's/(-?\d+)/$1+1/eg' <filenames>