I'm currently editing my javascript.lang file to highlight function names.
Here is my expression for gtksourceview that I am currently using.
<define-regex id="function-regex" >
(?<=([\.|\s]))
([a-z]\w*)
(?=([\(].*))(?=(.*[\)]))
</define-regex>
here's the regex by itself
(?<=([\.|\s]))([a-z]\w*)(?=([\(].*))(?=(.*[\)]))
It appears to work for situations such as, foo(A) which I am satisfied with.
But where I am having trouble is if I want it to highlight a function name within the parentheses of another function call.
foo(bar(A))
or to put it more rigorously
foo{N}(foo{N-1}(...(foo{2}(foo{1}(A))...))
So with the example,
foo(bar(baz(A)))
my goal is for it to highlight foo, bar, baz and nothing else.
I don't know how to handle the bar function. I have read about a way of doing regex recursively with (?R) or (?0) but I have not had any success using that to highlight functions recursively in gedit.
P.S.
Here are the tests that I am currently using to determine success.
initialDrawGraph(toBeSorted);
$(element).removeClass(currentclass);
myFrame.popStack();
context.outputCurrentSortOrder(V);
myFrame.nextFunction = sorter.Sort.;
context.outputToDivConsole(formatStr(V),1);
Balancing parentheses is not a regular expression, since it needs memory (See: Can regular expressions be used to match nested patterns?). For some implementations, there is an implementation for recursion in regular expressions:
Matching Balanced Constructs
The main purpose of recursion is to match balanced constructs or
nested constructs. The generic regex is b(?:m|(?R))*e where b is
what begins the construct, m is what can occur in the middle of the
construct, and e is what can occur at the end of the construct. For
correct results, no two of b, m, and e should be able to match
the same text. You can use an atomic group instead of the
non-capturing group for improved performance: b(?>m|(?R))*e.
A common real-world use is to match a balanced set of parentheses.
\((?>[^()]|(?R))*\) matches a single pair of parentheses with any
text in between, including an unlimited number of parentheses, as long
as they are all properly paired. If the subject string contains
unbalanced parentheses, then the first regex match is the leftmost
pair of balanced parentheses, which may occur after unbalanced opening
parentheses. If you want a regex that does not find any matches in a
string that contains unbalanced parentheses, then you need to use a
subroutine call instead of recursion. If you want to find a sequence
of multiple pairs of balanced parentheses as a single match, then you
also need a subroutine call.
Ok, looks like I was making this more complicated than it needed to be.
I was able to achieve what I needed with this simpler regex. I just told it to stop looking for the close parenthesis.
([a-zA-Z0-9][a-zA-Z0-9]*)(?=\()
The following regex works for nested functions (Note: This is the python version of regex. You may or may not need to make some syntax tweaks. Hopefull, you'll get the idea):
[OBSOLETED] '(\w+\()+[^\)]*\)+'
[UPDATED] (Should Work. Hopefully)
(\w+\()+([^\)]*\)+)*
Related
I am looking for a regular expression which will match all occurrences of foo unless it is in a \ref{..} command. So \ref{sec:foo} should not match.
I will probably want to add more commands for which the arguments should be excluded later so the solution should be extendable in that regard.
There are some similar questions trying to detect when something is in parenthesis, etc.
Regex to match only commas not in parentheses?
Split string by comma if not within square brackets or parentheses
The first has an interesting solution using alternatives: \(.*?\)|(,) the first alternative matches the unwanted versions and the second has the match group. However, since I am using this is in a search&replace context I cannot really use match groups.
Finding a regex for what you want will need variable length look behind which is only available with very limited languages (like C# and PyPi in Python) hence you will have to settle with a less than perfect. You can use this regex to match foo that is not within curly braces as long as you don't have curly nested curly braces,
\bfoo\b(?![^{}]*})
This will not match a foo inside \ref{sec:foo} or even in somethingelse{sec:foo} as you can see it only doesn't match a foo that isn't contained in a curly braces. If you need a precise solution, then it will need variable length look behind support which as I said, is available in very limited languages.
Regex Demo
Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)
This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.
Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");
I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!
Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.
Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps
Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).
It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.
(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)