regular expression and substitution - regex

In Latex, I had a lot of math expressions with subscriptions in terms of 123, now, I need to change them to \alpha \beta \gamma instead of 123.
for example:
$E_{223}$ to $E_{\beta\beta\gamma}$
and
$E_{31}$ to $_{\gamma\alpha}$
However, I also have power index which is not supposed to be altered, such as $E^3_{112}$ should be change to $E^3_{\alpha\alpha\beta}$.
Is there a way to use regular expression to make this task easier? I know some regular expression from unix and perl, but seems inadequate for this problem.
thank you for anything!

I'm not 100% familiar with Latex, but typical regex would look like this:
(?<\^)#
Where the # is 1, 2 or 3. Then, in your replace, you would replace the matches with \alpha, \beta and \gamma. The (?<\^) is a negative look-behind that says to only replace instances of that number when they aren't preceded by a ^ character (your power indicator).
If typical regex doesn't permit, I'll delete my answer.

In Perl you could do things like:
$text =~ s#\$\w[^${\s]*_{\K([123]+)(?=}\$)#
local $_ = $1;
s/1/\\alpha/g; s/2/\\beta/g; s/3/\\gamma/g;
$_
#ge;

Try these:
replace (?<!\^\d|\d{2}|\d{3}|\d{4})1 with \alpha
replace (?<!\^\d|\d{2}|\d{3}|\d{4})2 with \beta
replace (?<!\^\d|\d{2}|\d{3}|\d{4})3 with \gamma
Edit: These regexes make sure that it won't replace a number from an exponent. You may have to tweak them to check for optional - if you have negative exponents.
Edit 2: #QTax pointed out that you can't use a variable length lookbehinds.
Subexp of look-behind must be fixed character length.
But different character length is allowed in top level
alternatives only.
Reference: http://tacosw.com/latexian/help/find/regex.html

I don't know what editor or regex engine you're using for this, but here's the basic idea I'd go with in Perl-ish regex:
Replace this:
(?<=\{\d*)1(?=\d*\})
With this:
\\alpha
I think you'll want to set the g flag as well.
Not sure if I have the right escaping syntax (it's been a while since I touched Perl) but I think so.
Repeat as necessary for \beta, \gamma, etc.

Related

Regex for nested matches

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".
Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.

Capture followed by Digits: Replace Syntax? (Dreamweaver)

When you address a regex capture, things can get tricky when digits follow the capture. In PCRE, I can write
${1}000
to substitute the capture of Group 1 followed by three zeroes.
Does anyone know the equivalent syntax in Dreamweaver replace operations, if any?
If we had a series of "A"s instead of zeroes, we could use:
$1AAAA
But these:
$10000
${1}0000
do not work.
I believe the regex flavor is ECMAScript. Just cannot find the information.
This may not be addressed in the syntax. If so, that would be good to know.
Thank you!
Edit: I should add that this is not matter of life and death as I have a number of grep tools at my fingertips. I would just like to know.
Dreamweaver's regular expression find and replace is supposed to be based on JavaScript's implementation of RegExp. You should be able to just use $1000 in the replacement text. However, like you've found, the replacement groups ($ + group number) are not properly recognized when the replacement text has digits immediately after the grouping token.
FWIW: I've logged a bug on this at http://adobe.ly/DWwish

Regular expression theory

I have a little problem with RE theory.
Given an alphabet {0, 1}, I have to create a regular expression that matches all string that does NOT contain the substring 111.
I'm not able to get the point, also for simplier substring like 00.
Edit: The solution must contains only the three standard operation: concatenation, alternation, kleene star, as you can see in the wiki link
Thank you.
As far as I understand, the language you want to regexify is not allowed to contain three or more consecutive 1's. Such a regexp could be (110|10|0*)*|1|11|0*1|0*11
How about this:
{ε|1}{ε|1}{ε|{0{ε|1}{ε|1}}*}
Back in the days when we didn't have the ?! negative lookahead facility I would use a negation match. So for grep I would
grep -v (pattern I'm searching for) someFile.txt
which would give the lines in the file that don't contain the pattern.
In perl I would use the
!~
negation matcher rather than the usual
=~
I don't know which regex variant you are using, but I'm struggling to see a way to solve your problem without either an overall negation or a ?! negative lookahead.
matcher.

Regex: Does not have/include pattern

I have a regex pattern to match an HTML script tag. How can I change this script tag pattern so that the patterns means "input string DOES NOT MATCH" the script tag pattern?
In other words, given a pattern, what is the alteration needed to change the meaning of the pattern to "does not match this pattern"?
For example, if I have a pattern: \d{3}-\d{3}-\d{4}, what is the equivalent pattern for this that means "does not match \d{3}-\d{3}-\d{4}"?
You can negate a regex pattern by using a negative lookahead. This is slightly different than simply negating the regex though. Negative lookahead would look like the following in Java (and many other languages):
(?!\d{3}-\d{3}-\d{4})
It should be noted that this doesn't exactly answer the question. Finding the inverse of a regular language is not an easy task using a regular expression (I don't think). A much easier way to solve the problem would be to inverse the program logic:
Instead of:
if (string.matches(yourRegex))
Do:
if (!string.matches(yourRegex))
That is not easily achievable for arbitrary patterns. In practice, it's almost always easier to do what you want in the surrounding code than in the pattern itself. For instance, instead of
grep '\d{3}-\d{3}-\d{4}' file
you could use
grep -v '\d{3}-\d{3}-\d{4|' file
Or in a program you could change something like
if (pattern.matches()) {
foo();
}
into something like
if (!pattern.matches()) {
foo();
}
In a more tedious approach, you would have to enumerate all possible values that should match instead of what should not match. So, say you want to match everything but the string <html>, you could write a regex like so:
([^<]|<([^h]|h([^t]|t([^m]|m([^l]|l[^>])))))
Reading that regex is like saying: "Okay, you can match any character but '<', or you could match '<' but then you can't match an 'h' after that... or you do match an 'h' after that but then you can't match a 't' after that... and so on.
It's butt ugly, but then again, for simple string matches, you can easily write a recursive function that transforms any given term into a pattern like the above.
easier to just negate the test surely? eg...
if (!regex.test(str)) ...
(javascript example)
Negating a character class is easy with ^ but a whole regex will get much more convoluted.
What language are you using? The easiest solution to the specific problem you stated is to simply prepend a negation operator (usually "!") to the match.
I definitely agree with the other answers saying you should negate testing for a match, but this should do what you want using just a regex:
(?!.*\d{3}-\d{3}-\d{4})
This is a negative lookahead, by not placing any characters outside of the lookahead the regex basically means "fail on any string that starts with any number of characters (.*) followed by the regex \d{3}-\d{3}-\d{4}".

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);