Regular expression for word *not* in specific latex command - regex

I am looking for a regular expression which will match all occurrences of foo unless it is in a \ref{..} command. So \ref{sec:foo} should not match.
I will probably want to add more commands for which the arguments should be excluded later so the solution should be extendable in that regard.
There are some similar questions trying to detect when something is in parenthesis, etc.
Regex to match only commas not in parentheses?
Split string by comma if not within square brackets or parentheses
The first has an interesting solution using alternatives: \(.*?\)|(,) the first alternative matches the unwanted versions and the second has the match group. However, since I am using this is in a search&replace context I cannot really use match groups.

Finding a regex for what you want will need variable length look behind which is only available with very limited languages (like C# and PyPi in Python) hence you will have to settle with a less than perfect. You can use this regex to match foo that is not within curly braces as long as you don't have curly nested curly braces,
\bfoo\b(?![^{}]*})
This will not match a foo inside \ref{sec:foo} or even in somethingelse{sec:foo} as you can see it only doesn't match a foo that isn't contained in a curly braces. If you need a precise solution, then it will need variable length look behind support which as I said, is available in very limited languages.
Regex Demo

Related

Regular expression to match a string outside class or function block

Here's the reduced case of PHP code:
use Package;
use Package2;
class {
use Trait;
function fn() {
function() use ($var) {
}
}
}
I'd like to match only the use before Package; and Package2; not use Trait nor use ($var)
Nothing like negative lookahead and negative lookbehind seem to work. Tried this approach Regular Expression, match characters outside curly braces { }
Obviously doesn't work: https://regex101.com/r/L6N4Ye/1
Using the PCRE interpreter.
While using regex might not be the best choice here. You could use one if you have control over the format of the code you are parsing. Otherwise, using a PHP parser would be the best idea.
With that in mind, how about checking if the use is at the beggining of the string (^) ?
^use\s+(?![^{]*\})
see here
I am not aware of the PHP syntax, so please forgive me for missed syntactical considerations.
Since in this particular case, you are sure that all uses you are interested in lie before the class boundary, I think what may help is to look for all use that is not preceded by a {, which can be achieved through the following regex which uses a negative lookbehind for {:
(?<!\{\s{0,100})\s*use\s*(?<pkg>.*);
After applying this to the entire source code, you may look for the groups named pkg in the matched substrings.
However, the not-so-good part in the negative lookbehind is the \s{0,100}, which I have included only to allow spaces after the opening brace. There must be a better way for this. I had to do this because negative lookbehinds need a calculatable maximum length, due to which \\s* will not work.
My assumptions on the syntax:
use is always small case
A use package statement ends with ; necessarily
Whitespace is allowed freely between tokens as in the case of Java

Matching Word() when word is not (some word)

Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)

In gedit, highlight a function call within the parenthesis

I'm currently editing my javascript.lang file to highlight function names.
Here is my expression for gtksourceview that I am currently using.
<define-regex id="function-regex" >
(?<=([\.|\s]))
([a-z]\w*)
(?=([\(].*))(?=(.*[\)]))
</define-regex>
here's the regex by itself
(?<=([\.|\s]))([a-z]\w*)(?=([\(].*))(?=(.*[\)]))
It appears to work for situations such as, foo(A) which I am satisfied with.
But where I am having trouble is if I want it to highlight a function name within the parentheses of another function call.
foo(bar(A))
or to put it more rigorously
foo{N}(foo{N-1}(...(foo{2}(foo{1}(A))...))
So with the example,
foo(bar(baz(A)))
my goal is for it to highlight foo, bar, baz and nothing else.
I don't know how to handle the bar function. I have read about a way of doing regex recursively with (?R) or (?0) but I have not had any success using that to highlight functions recursively in gedit.
P.S.
Here are the tests that I am currently using to determine success.
initialDrawGraph(toBeSorted);
$(element).removeClass(currentclass);
myFrame.popStack();
context.outputCurrentSortOrder(V);
myFrame.nextFunction = sorter.Sort.;
context.outputToDivConsole(formatStr(V),1);
Balancing parentheses is not a regular expression, since it needs memory (See: Can regular expressions be used to match nested patterns?). For some implementations, there is an implementation for recursion in regular expressions:
Matching Balanced Constructs
The main purpose of recursion is to match balanced constructs or
nested constructs. The generic regex is b(?:m|(?R))*e where b is
what begins the construct, m is what can occur in the middle of the
construct, and e is what can occur at the end of the construct. For
correct results, no two of b, m, and e should be able to match
the same text. You can use an atomic group instead of the
non-capturing group for improved performance: b(?>m|(?R))*e.
A common real-world use is to match a balanced set of parentheses.
\((?>[^()]|(?R))*\) matches a single pair of parentheses with any
text in between, including an unlimited number of parentheses, as long
as they are all properly paired. If the subject string contains
unbalanced parentheses, then the first regex match is the leftmost
pair of balanced parentheses, which may occur after unbalanced opening
parentheses. If you want a regex that does not find any matches in a
string that contains unbalanced parentheses, then you need to use a
subroutine call instead of recursion. If you want to find a sequence
of multiple pairs of balanced parentheses as a single match, then you
also need a subroutine call.
Ok, looks like I was making this more complicated than it needed to be.
I was able to achieve what I needed with this simpler regex. I just told it to stop looking for the close parenthesis.
([a-zA-Z0-9][a-zA-Z0-9]*)(?=\()
The following regex works for nested functions (Note: This is the python version of regex. You may or may not need to make some syntax tweaks. Hopefull, you'll get the idea):
[OBSOLETED] '(\w+\()+[^\)]*\)+'
[UPDATED] (Should Work. Hopefully)
(\w+\()+([^\)]*\)+)*

Why do I need to use double curly Brackets in my RegEx?

I'm running a little regular expression in one of my xsl-transformations (xsl:analyze-string) and came across this effect that made me rather uncomfortable because I didn't really find any explanation...
I was searching for Non-Breaking-Spaces and En-Spaces, so I used the \p{Z} construct. According to many examples in the XSLT 2.0 Programmers Reference by Michael Kay, this should work. RegExBuddy also approves :)
Now my SaxonHE9.4N tells me
Error in regular expression: net.sf.saxon.trans.XPathException: expected ({)
After several trials and errors I simply doubled the Brackets \p{{Z}} ... and it worked!? But this time RegExBuddy disapproves!
Can someone give me an explanation of this effect? I couldn't find anything satisfying on the internet...
Thanks in advance!
Edit: I tried the same thing inside of a replace() function and the double bracket version didn't work. I had to do it with single brackets!
In an attribute value template, curly braces are special syntax indicating an XPath expression to be evaluated. If you want literal curly braces, you have to escape them by doubling:
An attribute value template consists of an alternating sequence of
fixed parts and variable parts. A variable part consists of an XPath
expression enclosed in curly brackets ({}). A fixed part may contain
any characters, except that a left curly bracket must be written as {{
and a right curly bracket must be written as }}.
Note:
An expression within a variable part may contain an unescaped curly
bracket within a StringLiteral XP or within a comment.
Not all attributes are AVTs, but the regex attribute of analyze-string is:
Note:
Because the regex attribute is an attribute value template, curly
brackets within the regular expression must be doubled. For example,
to match a sequence of one to five characters, write regex=".{{1,5}}".
For regular expressions containing many curly brackets it may be more
convenient to use a notation such as
regex="{'[0-9]{1,5}[a-z]{3}[0-9]{1,2}'}", or to use a variable.
(Emphasis added, in both quotes.)

Do not include the condition itself in regex

Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");