How to create regular expression to get all functions from code - regex

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!

Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.

Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps

Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).

It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.

(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)

Related

Efficient regex to insert missing apostrophes?

I'm trying to build a regular expression and replacement string that I can use to insert missing apostrophes. Examples:
Dont -> Don't
Ill -> I'll
I can get this working with capture groups, but I'm trying to only have to call .Replace one time. Right now I have something like:
$apostropheregex = '\b((didn|won|ain|don)(t)|(i)(ll|m))\b'
$apostrophereplacement='$2$4''$3$5'
But it feels ugly to be mashing together both prefix groups and both postfix groups with the assumption that we only matched one or the other (either a "ll" or a "t" match)
Does anyone have any suggestions? Is there a better way to approach this problem? Should I indeed treat these as two separate scenarios and run replace twice with separate regexes and replacement strings?
Update: To clarify, I'm aware that this could have unintended consequences, replacing strings that shouldn't be replaced since English grammatical context is not considered. I'm running this manually after reviewing strings first and I still think this is an interesting question.
Just a note: This is ill suited for... ill suited, which becomes i'll suited.
But you asked for a better regex and you shall receive. I would use:
\b(?|(don)(t)|(won)(t)|(you)(re))\b
The replacement will be $1'$2.
The main advantage of this regex is legibility. You should easily be able to add new alterations.
It works by using the branch reset group (?| ). This means that each alternation uses $1 and $2 (instead of 2n+1 and 2n+2).

In gedit, highlight a function call within the parenthesis

I'm currently editing my javascript.lang file to highlight function names.
Here is my expression for gtksourceview that I am currently using.
<define-regex id="function-regex" >
(?<=([\.|\s]))
([a-z]\w*)
(?=([\(].*))(?=(.*[\)]))
</define-regex>
here's the regex by itself
(?<=([\.|\s]))([a-z]\w*)(?=([\(].*))(?=(.*[\)]))
It appears to work for situations such as, foo(A) which I am satisfied with.
But where I am having trouble is if I want it to highlight a function name within the parentheses of another function call.
foo(bar(A))
or to put it more rigorously
foo{N}(foo{N-1}(...(foo{2}(foo{1}(A))...))
So with the example,
foo(bar(baz(A)))
my goal is for it to highlight foo, bar, baz and nothing else.
I don't know how to handle the bar function. I have read about a way of doing regex recursively with (?R) or (?0) but I have not had any success using that to highlight functions recursively in gedit.
P.S.
Here are the tests that I am currently using to determine success.
initialDrawGraph(toBeSorted);
$(element).removeClass(currentclass);
myFrame.popStack();
context.outputCurrentSortOrder(V);
myFrame.nextFunction = sorter.Sort.;
context.outputToDivConsole(formatStr(V),1);
Balancing parentheses is not a regular expression, since it needs memory (See: Can regular expressions be used to match nested patterns?). For some implementations, there is an implementation for recursion in regular expressions:
Matching Balanced Constructs
The main purpose of recursion is to match balanced constructs or
nested constructs. The generic regex is b(?:m|(?R))*e where b is
what begins the construct, m is what can occur in the middle of the
construct, and e is what can occur at the end of the construct. For
correct results, no two of b, m, and e should be able to match
the same text. You can use an atomic group instead of the
non-capturing group for improved performance: b(?>m|(?R))*e.
A common real-world use is to match a balanced set of parentheses.
\((?>[^()]|(?R))*\) matches a single pair of parentheses with any
text in between, including an unlimited number of parentheses, as long
as they are all properly paired. If the subject string contains
unbalanced parentheses, then the first regex match is the leftmost
pair of balanced parentheses, which may occur after unbalanced opening
parentheses. If you want a regex that does not find any matches in a
string that contains unbalanced parentheses, then you need to use a
subroutine call instead of recursion. If you want to find a sequence
of multiple pairs of balanced parentheses as a single match, then you
also need a subroutine call.
Ok, looks like I was making this more complicated than it needed to be.
I was able to achieve what I needed with this simpler regex. I just told it to stop looking for the close parenthesis.
([a-zA-Z0-9][a-zA-Z0-9]*)(?=\()
The following regex works for nested functions (Note: This is the python version of regex. You may or may not need to make some syntax tweaks. Hopefull, you'll get the idea):
[OBSOLETED] '(\w+\()+[^\)]*\)+'
[UPDATED] (Should Work. Hopefully)
(\w+\()+([^\)]*\)+)*

How would I use a regular expression to find calls to functions?

Suppose I want find all function calls in my listing (a vb.net listing), and I have the function name.
first I thought I could do a regular expression such as:
myfunc\( .* \)
That should work even if the function spans multiple lines, assuming that the dot is interpreted as including newlines (there is an option to do this in dot-net)
but then I realized that some of my arguments themselves could be function calls.
in other words:
myfunc( a,b,c,d(),e ),
which means that the parentheses don't match up.
so I thought that since the main function call usually is the first item on a line, I could do this:
^myfunc( .* \) $
The idea is that the function is the first item on a line (^) and the last paren is the last item on a line ($). but that doesn't work either.
What am I doing wrong?
You can't. By design, regular expressions cannot deal with recursion which is needed here.
For more information, you might want to read the first answer here: Can regular expressions be used to match nested patterns?
And yes, I know that some special "regular expressions" do allow for recursion. However, in most cases, this means that you are doing something horrible. It is much better to use something that can actually understand the syntax of your language.
This is not a direct answer to your question, but if you want to find all uses of your function you can use Visual Studio. Just right click on the function, and then select Find All References:
Visual Studio will show you the results. You can then double click on each line and Visual Studio will take you there.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?
Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.
As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.
It's as safe as anything else in regular expressions!
As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes
I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'
The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");