Regular expression theory

Regular expression theory - regex

I have a little problem with RE theory.
Given an alphabet {0, 1}, I have to create a regular expression that matches all string that does NOT contain the substring 111.
I'm not able to get the point, also for simplier substring like 00.
Edit: The solution must contains only the three standard operation: concatenation, alternation, kleene star, as you can see in the wiki link
Thank you.

As far as I understand, the language you want to regexify is not allowed to contain three or more consecutive 1's. Such a regexp could be (110|10|0*)*|1|11|0*1|0*11

How about this:
{ε|1}{ε|1}{ε|{0{ε|1}{ε|1}}*}

Back in the days when we didn't have the ?! negative lookahead facility I would use a negation match. So for grep I would
grep -v (pattern I'm searching for) someFile.txt
which would give the lines in the file that don't contain the pattern.
In perl I would use the
!~
negation matcher rather than the usual
=~
I don't know which regex variant you are using, but I'm struggling to see a way to solve your problem without either an overall negation or a ?! negative lookahead.
matcher.

Related

PERL Regex Negation Issue

I have written a regex to pick files of the format
(ABC.*\.DAT) in perl.
How to write a negation for the above regex?
I already tried expressions like (?!ABC.*)\.DAT or (?!(ABC.*\.DAT))
Any help is appreciated.

(?s:(?!ABC).)*\.DAT
You can try this negation based regex. See demo.
The above can be safely embedded into a larger pattern. For example,
/^(?:(?!ABC).)*\.DAT\z/s
If you are trying to match the whole input, and if ABC doesn't end with ., .D, .DA or .DAT, then the following will be faster:
/^(?!.*ABC)\.DAT\z/s

regular expression and substitution

In Latex, I had a lot of math expressions with subscriptions in terms of 123, now, I need to change them to \alpha \beta \gamma instead of 123.
for example:
$E_{223}$ to $E_{\beta\beta\gamma}$
and
$E_{31}$ to $_{\gamma\alpha}$
However, I also have power index which is not supposed to be altered, such as $E^3_{112}$ should be change to $E^3_{\alpha\alpha\beta}$.
Is there a way to use regular expression to make this task easier? I know some regular expression from unix and perl, but seems inadequate for this problem.
thank you for anything!

I'm not 100% familiar with Latex, but typical regex would look like this:
(?<\^)#
Where the # is 1, 2 or 3. Then, in your replace, you would replace the matches with \alpha, \beta and \gamma. The (?<\^) is a negative look-behind that says to only replace instances of that number when they aren't preceded by a ^ character (your power indicator).
If typical regex doesn't permit, I'll delete my answer.

In Perl you could do things like:
$text =~ s#\$\w[^${\s]*_{\K([123]+)(?=}\$)#
local $_ = $1;
s/1/\\alpha/g; s/2/\\beta/g; s/3/\\gamma/g;
$_
#ge;

Try these:
replace (?<!\^\d|\d{2}|\d{3}|\d{4})1 with \alpha
replace (?<!\^\d|\d{2}|\d{3}|\d{4})2 with \beta
replace (?<!\^\d|\d{2}|\d{3}|\d{4})3 with \gamma
Edit: These regexes make sure that it won't replace a number from an exponent. You may have to tweak them to check for optional - if you have negative exponents.
Edit 2: #QTax pointed out that you can't use a variable length lookbehinds.
Subexp of look-behind must be fixed character length.
But different character length is allowed in top level
alternatives only.
Reference: http://tacosw.com/latexian/help/find/regex.html

I don't know what editor or regex engine you're using for this, but here's the basic idea I'd go with in Perl-ish regex:
Replace this:
(?<=\{\d*)1(?=\d*\})
With this:
\\alpha
I think you'll want to set the g flag as well.
Not sure if I have the right escaping syntax (it's been a while since I touched Perl) but I think so.
Repeat as necessary for \beta, \gamma, etc.

Regex: Does not have/include pattern

I have a regex pattern to match an HTML script tag. How can I change this script tag pattern so that the patterns means "input string DOES NOT MATCH" the script tag pattern?
In other words, given a pattern, what is the alteration needed to change the meaning of the pattern to "does not match this pattern"?
For example, if I have a pattern: \d{3}-\d{3}-\d{4}, what is the equivalent pattern for this that means "does not match \d{3}-\d{3}-\d{4}"?

You can negate a regex pattern by using a negative lookahead. This is slightly different than simply negating the regex though. Negative lookahead would look like the following in Java (and many other languages):
(?!\d{3}-\d{3}-\d{4})
It should be noted that this doesn't exactly answer the question. Finding the inverse of a regular language is not an easy task using a regular expression (I don't think). A much easier way to solve the problem would be to inverse the program logic:
Instead of:
if (string.matches(yourRegex))
Do:
if (!string.matches(yourRegex))

That is not easily achievable for arbitrary patterns. In practice, it's almost always easier to do what you want in the surrounding code than in the pattern itself. For instance, instead of
grep '\d{3}-\d{3}-\d{4}' file
you could use
grep -v '\d{3}-\d{3}-\d{4|' file
Or in a program you could change something like
if (pattern.matches()) {
foo();
}
into something like
if (!pattern.matches()) {
foo();
}
In a more tedious approach, you would have to enumerate all possible values that should match instead of what should not match. So, say you want to match everything but the string <html>, you could write a regex like so:
([^<]|<([^h]|h([^t]|t([^m]|m([^l]|l[^>])))))
Reading that regex is like saying: "Okay, you can match any character but '<', or you could match '<' but then you can't match an 'h' after that... or you do match an 'h' after that but then you can't match a 't' after that... and so on.
It's butt ugly, but then again, for simple string matches, you can easily write a recursive function that transforms any given term into a pattern like the above.

easier to just negate the test surely? eg...
if (!regex.test(str)) ...
(javascript example)
Negating a character class is easy with ^ but a whole regex will get much more convoluted.

What language are you using? The easiest solution to the specific problem you stated is to simply prepend a negation operator (usually "!") to the match.

I definitely agree with the other answers saying you should negate testing for a match, but this should do what you want using just a regex:
(?!.*\d{3}-\d{3}-\d{4})
This is a negative lookahead, by not placing any characters outside of the lookahead the regex basically means "fail on any string that starts with any number of characters (.*) followed by the regex \d{3}-\d{3}-\d{4}".

is it the right reqular expression

i have following regular expression but it's not working properly it takes only three values after # sign but i want it to be any number length
"/^[a-zA-Z0-9_\.\-]+\#([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9]{2,4}$/"
this#thi This is validated
this#this It is not validating this expression
Can you please tell me what's the problem with the expression...
Thanks

If you want your regex to match "any number length" then why are you using {2,4}?
I think a better example of the strings you're trying to match might give others a better idea of what you want, because based on your regex it is a bit confusing what you're looking for.

Try this:
^[a-zA-Z0-9_.-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z0-9]{2,4}$
The main problem is that you didn't escape the dot: \.. In regular expression the dot matches everything (mostly), making your regex quite liberal.

How do I properly match Regular Expressions?

I have a list of objects output from ldapsearch as follows:
dn: cn=HPOTTER,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=HGRANGER,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=RWEASLEY,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=DMALFOY,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=SSNAPE,ou=FACULTY,ou=HOGWARTS,o=SCHOOL
dn: cn=ADUMBLED,ou=FACULTY,ou=HOGWARTS,o=SCHOOL
So far, I have the following regex:
/\bcn=\w*,/g
Which returns results like this:
cn=HPOTTER,
cn=HGRANGER,
cn=RWEASLEY,
cn=DMALFOY,
cn=SSNAPE,
cn=ADUMBLED,
I need a regex that returns results like this:
HPOTTER
HGRANGER
RWEASLEY
DMALFOY
SSNAPE
ADUMBLED
What do I need to change in my regex so the pattern (the cn= and the comma) is not included in the results?
EDIT: I will be using sed to do the pattern matching, and piping the output to other command line utilities.

You will have to perform a grouping. This is done by modifying the regex to:
/\bcn=\(\w*\),/g
This will then populate your result into a grouping variable. Depending on your language how to extract this value will differ. (For you with sed the variable will be \1)
Note that most regex flavors you don't have to escape the brackets (), but since you're using sed you will need to as shown above.
For an excellent resource on Regular Expressions I suggest: Mastering Regular Expressions

OK, the place where you asked the more specific question was closed as "exact duplicate" of this, so I'm copying my answer from there to here:
If you want to use sed, you can use something like the following:
sed -e 's/dn: cn=\([^,]*\),.*$/\1/'
You have to use [^,]* because in sed, .* is "greedy" meaning it will match everything it can before looking at any following character. That means if you use \(.*\), in your pattern it will match up to the last comma, not up to the first comma.

Check out Expresso I have used it in the past to build my RegEx. It is good to help learning too.

The quick and dirty method is to use submatches assuming your engine supports it:
/\bcn=(\w*),/g
Then you would want to get the first submatch.

Without knowing what language you're using, we can't tell for sure, but in most regular expression parsers, if you use parenthesis, such as
/\bcn=(\w*),/g
then you'll be able to get the first matching pattern (often \1) as exactly what you are searching for. To be more specific, we need to know what language you are using.

If your regex supports Lookaheads and Lookbehinds then you can use
/(?<=\bcn=)\w*(?=,)/g
That will match
HPOTTER
HGRANGER
RWEASLEY
DMALFOY
SSNAPE
ADUMBLED
But not the cn= or the , on either side. The comma and cn= still have to be there for the match, it just isn't included in the result.

Sounds more like a simple parsing problem and not regex. An ANTLR grammar would sort this out in no time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression theory - regex

As far as I understand, the language you want to regexify is not allowed to contain three or more consecutive 1's. Such a regexp could be (110|10|0)|1|11|01|011

How about this: {ε|1}{ε|1}{ε|{0{ε|1}{ε|1}}*}

Related

PERL Regex Negation Issue

regular expression and substitution

Regex: Does not have/include pattern

is it the right reqular expression

How do I properly match Regular Expressions?

Categories

Resources