Why do I need to use double curly Brackets in my RegEx? - regex

I'm running a little regular expression in one of my xsl-transformations (xsl:analyze-string) and came across this effect that made me rather uncomfortable because I didn't really find any explanation...
I was searching for Non-Breaking-Spaces and En-Spaces, so I used the \p{Z} construct. According to many examples in the XSLT 2.0 Programmers Reference by Michael Kay, this should work. RegExBuddy also approves :)
Now my SaxonHE9.4N tells me
Error in regular expression: net.sf.saxon.trans.XPathException: expected ({)
After several trials and errors I simply doubled the Brackets \p{{Z}} ... and it worked!? But this time RegExBuddy disapproves!
Can someone give me an explanation of this effect? I couldn't find anything satisfying on the internet...
Thanks in advance!
Edit: I tried the same thing inside of a replace() function and the double bracket version didn't work. I had to do it with single brackets!

In an attribute value template, curly braces are special syntax indicating an XPath expression to be evaluated. If you want literal curly braces, you have to escape them by doubling:
An attribute value template consists of an alternating sequence of
fixed parts and variable parts. A variable part consists of an XPath
expression enclosed in curly brackets ({}). A fixed part may contain
any characters, except that a left curly bracket must be written as {{
and a right curly bracket must be written as }}.
Note:
An expression within a variable part may contain an unescaped curly
bracket within a StringLiteral XP or within a comment.
Not all attributes are AVTs, but the regex attribute of analyze-string is:
Note:
Because the regex attribute is an attribute value template, curly
brackets within the regular expression must be doubled. For example,
to match a sequence of one to five characters, write regex=".{{1,5}}".
For regular expressions containing many curly brackets it may be more
convenient to use a notation such as
regex="{'[0-9]{1,5}[a-z]{3}[0-9]{1,2}'}", or to use a variable.
(Emphasis added, in both quotes.)

Related

Regular expression for word *not* in specific latex command

I am looking for a regular expression which will match all occurrences of foo unless it is in a \ref{..} command. So \ref{sec:foo} should not match.
I will probably want to add more commands for which the arguments should be excluded later so the solution should be extendable in that regard.
There are some similar questions trying to detect when something is in parenthesis, etc.
Regex to match only commas not in parentheses?
Split string by comma if not within square brackets or parentheses
The first has an interesting solution using alternatives: \(.*?\)|(,) the first alternative matches the unwanted versions and the second has the match group. However, since I am using this is in a search&replace context I cannot really use match groups.
Finding a regex for what you want will need variable length look behind which is only available with very limited languages (like C# and PyPi in Python) hence you will have to settle with a less than perfect. You can use this regex to match foo that is not within curly braces as long as you don't have curly nested curly braces,
\bfoo\b(?![^{}]*})
This will not match a foo inside \ref{sec:foo} or even in somethingelse{sec:foo} as you can see it only doesn't match a foo that isn't contained in a curly braces. If you need a precise solution, then it will need variable length look behind support which as I said, is available in very limited languages.
Regex Demo

Match repetition with regexp in ocamllex

I'm trying to write a lexer with ocamllex for some special native language (that is a bit modified for my purposes). Some words shall be matched by their first char, that is doubled. But I dont find any way for express this repetition of the first char. Neither I can use the regex syntax
(['a'-'z'])\1['a'-'z']+
with that "\1". Ocamllex says "illegal escape sequence \1." and I think thats really okay with the syntax of escape expressions, but sure thats not what I wanted. Nor I can use the repetition syntax with curly braces in any way (but this wont solve the problem anyway):
['a'-'z']{2}['a'-'z']+
I think there is a conflict with the oCaml code in the curly braces after the regexp.
Does anybody have an idea for that?
thank you very much.
Ocamllex's regex doesn't have repetition syntax. The avaibable regex syntax is just as listed in reference manual:
http://caml.inria.fr/pub/docs/manual-ocaml-4.01/lexyacc.html#sec274
And I think you can manually list the all possible repetitions as below:
("aa"|"bb"|"cc"|"dd"|"ee"|"ff"| ..............)['a'-'z']+

In gedit, highlight a function call within the parenthesis

I'm currently editing my javascript.lang file to highlight function names.
Here is my expression for gtksourceview that I am currently using.
<define-regex id="function-regex" >
(?<=([\.|\s]))
([a-z]\w*)
(?=([\(].*))(?=(.*[\)]))
</define-regex>
here's the regex by itself
(?<=([\.|\s]))([a-z]\w*)(?=([\(].*))(?=(.*[\)]))
It appears to work for situations such as, foo(A) which I am satisfied with.
But where I am having trouble is if I want it to highlight a function name within the parentheses of another function call.
foo(bar(A))
or to put it more rigorously
foo{N}(foo{N-1}(...(foo{2}(foo{1}(A))...))
So with the example,
foo(bar(baz(A)))
my goal is for it to highlight foo, bar, baz and nothing else.
I don't know how to handle the bar function. I have read about a way of doing regex recursively with (?R) or (?0) but I have not had any success using that to highlight functions recursively in gedit.
P.S.
Here are the tests that I am currently using to determine success.
initialDrawGraph(toBeSorted);
$(element).removeClass(currentclass);
myFrame.popStack();
context.outputCurrentSortOrder(V);
myFrame.nextFunction = sorter.Sort.;
context.outputToDivConsole(formatStr(V),1);
Balancing parentheses is not a regular expression, since it needs memory (See: Can regular expressions be used to match nested patterns?). For some implementations, there is an implementation for recursion in regular expressions:
Matching Balanced Constructs
The main purpose of recursion is to match balanced constructs or
nested constructs. The generic regex is b(?:m|(?R))*e where b is
what begins the construct, m is what can occur in the middle of the
construct, and e is what can occur at the end of the construct. For
correct results, no two of b, m, and e should be able to match
the same text. You can use an atomic group instead of the
non-capturing group for improved performance: b(?>m|(?R))*e.
A common real-world use is to match a balanced set of parentheses.
\((?>[^()]|(?R))*\) matches a single pair of parentheses with any
text in between, including an unlimited number of parentheses, as long
as they are all properly paired. If the subject string contains
unbalanced parentheses, then the first regex match is the leftmost
pair of balanced parentheses, which may occur after unbalanced opening
parentheses. If you want a regex that does not find any matches in a
string that contains unbalanced parentheses, then you need to use a
subroutine call instead of recursion. If you want to find a sequence
of multiple pairs of balanced parentheses as a single match, then you
also need a subroutine call.
Ok, looks like I was making this more complicated than it needed to be.
I was able to achieve what I needed with this simpler regex. I just told it to stop looking for the close parenthesis.
([a-zA-Z0-9][a-zA-Z0-9]*)(?=\()
The following regex works for nested functions (Note: This is the python version of regex. You may or may not need to make some syntax tweaks. Hopefull, you'll get the idea):
[OBSOLETED] '(\w+\()+[^\)]*\)+'
[UPDATED] (Should Work. Hopefully)
(\w+\()+([^\)]*\)+)*

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com