ctags and R regex - regex

I'm trying to use ctags with R. Using this answer I've added
--langdef=R
--langmap=r:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^f][^u][^n][^c][^t][^i][^o][^n]/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^f][^u][^n][^c][^t][^i][^o][^n]/\1/v,FunctionVariables
to my .ctags file. However when trying to generate tags with the following R file
x <- 1
foo <- function () {
y <- 2
return(y)
}
Only the function foo is recognized. I want to generate tags also for variables (i.e for x and y in my code above). Should I change the regex in the ctags file? Thanks in advance.

Those patterns don't appear to be correct, because a variable is only identified if it is assigned a value that has the same length as "function" but does not share any of its characters. E.g.:
x <- aaaaaaaa
The following ctags configuration should work properly:
--langdef=R
--langmap=R:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/v,FunctionVariables/
The idea here is that a function must be followed by a parenthesis while the variable declarations do not. Given the limitations of a regex parser this parenthesis must appear in the same line as the keyword "function" for this configuration to work.
Update: R support will be included in Universal Ctags, so give it a try and report bugs and missing features.

The problem with the original regex is that it does not capture variable declarations / assignments that are shorter than 8 characters (since the regex requires 8 characters that are NOT f-u-n-c-t-i-o-n in that specific order).
The problem with the regex updated by Vitor is that, it does not capture variable declarations / assignments which incorporate parantheses, a situation which is quite common in R.
And a forgotton issue is that, neither of them captures superassigned objects with <<- (only local assignment with <- is covered).
As I checked the Universal Ctags repo, although pcre regex is planned to be supported as also raised in Issue 519 and a commented out pcre flag exists in the configuration file, there is not yet support for pcre type positive/negative lookahead or lookbehind expressions in ctags unfortunately. When that support starts, things will be much easier.
My solution, first of all, takes into account superassignment by "<{1,2}-" and also the fact that the right side of assignment can include:
either a sequence of 8 characters which are not f-u-n-c-t-i-o-n followed by any or no characters (.*)
OR a sequence of at most 7 any characters (.{1,7})
My proposed regex patterns are as such:
--langdef=R
--langmap=R:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<{1,2}-[ \t]([^f][^u][^n][^c][^t][^i][^o][^n].*|.{1,7}$)/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<{1,2}-[ \t]([^f][^u][^n][^c][^t][^i][^o][^n].*|.{1,7}$)/\1/v,FunctionVariables/
And my tests show that, it captures much more objects than the previous ones.
Edit:
Even these do not capture the variables that are declared as arguments to functions. Since ctags regex makes a non-greedy search and non-capturing groups do not work, the result is still limited. But we can at least capture the first arguments to functions and define them as an additional type as a, Arguments:
--regex-R=/.+function[ ]*\([ \t]*(,*([^,= \(\)]+)[,= ]*)/\2/a,Arguments/

Related

Regex Notepad++ Function List

I am trying to create a regex expression that will create a function list for a propietary programming language inside of notepad++ using the functionList.xml file. The regex expression needs to capture all instances of any Method, Function or Macro following this syntax:
Method Blah1(pParam1, pParam2)
[New] dynamicVar = "string"
[New] dynamicVar2 = 4
[New] result = Bar2(dynamicVar, dynamicVar2)
End Method
Function Bar2(pParam3, pParam2)
Foo3
Return pParam3 && pParam2
End Function
Macro Foo3()
End Macro
So the regexp should capture 3 instances for the above example.
Link to regexr: http://www.regexr.com/3fcd7
functionList.xml
<parser
displayName="MOX"
id ="mox_function"
commentExpr="(?s:/\*.*?\*/)|(?m-s://.*?$)"
>
<function
mainExpr="^(s|Method|Macro|Function)"
>
<functionName>
<nameExpr expr="[A-Za-z_]\w*\s*[=:]|[A-Za-z_]?\w*\s*\(" />
<nameExpr expr="[A-Za-z_]?\w*" />
</functionName>
<className>
<nameExpr expr="([A-Za-z_]\w*\.)*[A-Za-z_]\w*\." />
<nameExpr expr="([A-Za-z_]\w*\.)*[A-Za-z_]\w*" />
</className>
</function>
</parser>
Try this.
(?m)^(Method|Macro).*\(.*\)\n.*\n?^\s?End.*$
https://regex101.com/r/pAkFDU/
I've never been able to use FunctionList. It's been known to have stability issues for a long time and is even not listed, by default, in the plugin manager.
It's possible highly likely that a too-inclusive expression won't properly capture nested groups (unless FunctionList recurses through matches, or supports Recursion in Regex). Just something you should be aware of.
Too-inclusive regexes frequently stumble on scenarios like this
Function foo()
Function refoo()
End Function // 1
End Function // 2
Function solo(stuff,thing)
End Function
Function bar()
sample = Function (x,y)
return x & y
End Function
End Function // 3
Either failing at points 1 or 3. It can be quite difficult to properly capture the function, unless you have recursion available. (Function|Method|Macro)[\s\S]*?End \1 captures too little; take the ? away and it captures too much.
Truly recursive ((?R)) Regular Expressions (or Balanced Pairs), which aren't widely supported, can properly capture to the right point, but without code to recurse through it, won't flag the nested functions.
(Function|Method|Macro) (\w+)\((.*)\)((?:(?R)|[\s\S]*?)+)End \1 would correctly stop at point 2.
You can use a similar expression with a lookahead: (Function|Method|Macro) (\w+)\((.*)\)(?=(?:(?R)|[\s\S]*?)+End \1) but it's also easy to fool. All it basically cares about is that End \1, follows it anywhere in the file.
If FunctionList does support recursive regex and doesn't recurse (code-side) through the results (I don't know, the plugin won't work for me), I would look at a simpler expression like below. Honestly, even if it does, you'll get the best performance out of a simple expression. It will work faster and you'll know exactly what it is and is not indicating. Anything else has too much overhead for relatively little gain.
^\s*(Function|Method|Macro) (\w+)\((.+)\)
Output: \2 Args: \3
^\s*(Function|Method|Macro) (\w+)\(\)
Output: \2
The outputs are just examples, the plugin I use lets you customize the output. Others may not. Also, prefixing with ^\s* allows you to find functions not assigned to variables or commented-out.
Or both above can be captured in one expression by changing the first expression's + to a *. It won't tell you if there's a matching End, but it will find the part you're interested in for a document outline.
Personally, I've always had good luck with an a similar plugin ambiguously named "SourceCookifier" but it only parses the file as it was last saved/opened. If I add a new function, it's not listed until save/open.
The interface for adding new languages isn't very intuitive, packing a lot of features in a small area, but it works and it works well. Like FunctionList, and probably most other alternatives, you'll need to add rules for "Mox".
I'm not saying you should switch to this particular utility, but FunctionList hasn't been updated in 7 years.

Regex expression to recognize XdY+Z OR XdY

I've been trying to develop a program that will be used for DMing in an MMORPG but I'm having trouble parsing for the actual regex expression I need.
To quote myself from another thread on a less active forum:
I've officially taken over the DiceRoller addon from years and years ago and I've reworked it a lot since I've taken it over and done a lot of testing in game. While I haven't uploaded anything yet, I've been struggling on a piece of regex expression that is currently crucial to the design of the addon.
Some background: the newest iteration of the DiceRoller addon makes it so you can type "!XdY" (where X is the number of dice, Y is the dice value) into raid chat and the DM who has the addon will go through some logic in the addon (random number lua protocol) and then spit out an input after adding up the dice.
It is as follows:
local count, size = string.match(message, "^!(%d+)[dD](%d+)$")
Now the functionality I need it to do is parse for both "!XdY" OR "XdY+Z", but it seems as if I can't get close to "XdY+Z" no matter which regex expression I use since I need it to do both expressions. I can provide more source code context if necessary.
This is the closest I've ever gotten:
http://i.imgur.com/eMhPHQB.png
and this is with the regex expression:
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+?(%d+)$")
As you can see, with the modifier it will work just fine. However, remove the modifier the regex expression still thinks that it is "XdY+Z" and so with "1d20" it think it is "1d2+0". It will think 1d200 is "1d20+0", etc. I've tried moving around the optional character "?" but it just causes the expression to not work at all. If I do !1d2 it doesn't work. It's almost as if the optional character NEEDS to be there?
Thanks for the help ahead of time, I've always struggled with regex.
local function dice(input)
local count, size, modifier = input:match"^!(%d+)[dD](%d+)%+?(%d*)$"
if count then
return tonumber(count), tonumber(size), tonumber("0"..modifier)
end
end
for _, input in ipairs{"!1d6", "!1d24", "!1d200", "!1d2+4", "!1d20+24"} do
print(input, dice(input))
end
Output:
!1d6 1 6 0
!1d24 1 24 0
!1d200 1 200 0
!1d2+4 1 2 4
!1d20+24 1 20 24
Lua regular expressions are very limited. You would need to use ^!(%d+)[dD](%d+)(?:+(%d+))?$ but this wouldn't be supported because of (?:+(%d+))? that uses a non-capturing group and a modifier on a group, both are not supported by Lua Patterns.
Consider using a regex library like this one that allows you to use PCRE, PHP regex engine, one of the most complete engine. But that would be overkill if you only want to use it for this regex. You can do it by code then, wouldn't be so hard for a simple task like this.
While Lua patterns are not powerful enough to parse this with one expression (as they don't support optional groups), there is an easy option to handle it with two expressions:
-- check the longer expression first
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+(%d+)$")
if not count then
count, size = string.match(message, "^!(%d+)[dD](%d+)$")
end

Apply different scopes to the same character in a Sublime Text syntax definition file

I'm trying to create a new .tmLanguage file for Sublime Text 2 that defines special fenced code blocks for knitr and Markdown. These blocks take this form:
```{r example_chunk, echo=true}
x <- rnorm(100)
y <- rnorm(100)
plot(y ~ x, pch=20)
```
There are two sections: (1) the parameters ({r ...}) and (2) the actual embedded code (between the closing } and the ``` at the end). There are four scopes that need to be applied to delineate the two sections:
punctuation.definition.parameters.begin.knitr
punctuation.definition.parameters.end.knitr
punctuation.section.embedded.begin.knitr
punctuation.section.embedded.end.knitr
Using regular expressions to peg these scopes to parts of code is easy enough (partial code available here). However, two of these scopes need to be applied to the same single character: the final } in the parameters section, which ends the parameters and signifies the beginning of the fenced/embedded code.
However, it appears to be impossible to assign two scopes to the same character in a .tmLanguage file. It's not possible to end the parameters section and begin the embedded section. The first scope defined takes precedence, thus breaking the syntax highlighting.
Is there a way to use a .tmLanguage syntax definition to apply two different scopes to the same character in Sublime Text? If not, is there some way I can peg punctuation.definition.parameters.end.knitr and punctuation.section.embedded.begin.knitr to two different somethings instead of the single {? (Keeping in mind that I can't add additional characters to the code block.)
As it turns out, it appears that it's impossible to use two punctuation definitions on the same character (unless there's some way to use some crazy system of nesting to get it to work).
But I fortunately figured out a workaround: a scope can be assigned to a \n. So I can use the following regexes to simulate the overlapping scopes:
(\}) will capture the closing brace, which can be used as punctuation.section.embedded.begin.knitr
(?<=\})(.*)(\n) will capture the \n following a closing brace as the second found group, allowing me to assign it to punctuation.section.embedded.begin.knitr
Tricky, but it works.
There is an easy way I've found out. Example code in JSON:
{
"match": "((\w+))",
"captures": {
"1": {"name": "scope.one"},
"2": {"name": "scope.two"}
}
}
Should work to my knowledge.
On the other hand you can do a better job by ending the first scope with
"match": "(?=(?:\}))" or just live with the ST3 language spec. :)
- match: '((what you want to match))'
captures:
1: first.scope.name
2: second.scope.name

Regex for converting file path to package/namespace

Given the following file path:
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
I would like to generate the following using a single regex replace (the root can be a constant):
some.very.interesting
This is for the purpose of generating a snippet for Sublime Text which can automatically insert the correct package/namespace header for my scala/java classes :)
Sublime Text uses the following syntax for their regex replace patterns (aka 'substitutions'):
{input/regex/replace/flags}
Hence why an iterative approach cannot be taken - it has to be done in one pass! Also, substitutions cannot be nested :(
If you know the maximum number of nested folders.You can specify that in your regex.
For 1 to 3 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3
For 1 to 5 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3.$4.$5
Given the constraints this is only thing you can do
Input
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
Regex
^/Users/Lawrence/MyProject/[^/]+/[^/]+/[^/]+/Code.scala
or
^/[^/]+/[^/]+/[^/]+/([^/]+)/([^/]+)/([^/]+)/
Replace
\1.\2.\3
Update
This gets you closer, but not exactly it:
Regex
(^/Users/Lawrence/MyProject/|/Code\.scala$|/)
Replacement
.
Output would be:
.some.very.interesting.
Without multiple replacements in a single line and without recursive back references it's going to be hard.
You might have to do a second replacement, replacing something like this with an empty string (if you can):
(^\.|\.$)

RegEx to match C# Interface file names only

In the Visual Studio 2010 "Productivity Power Tools" plugin (which is great), you can configure file tabs to be color coded based on regular expressions.
I have a RegEx to differentiate the tab color of Interface files (IMyInterface.cs) from regular .cs files:
[I]{1}[A-Z]{1}.*\.cs$
Unfortunately this also color codes any file that starts with a capital "I" (Information.cs, for example).
How could this RegEx be modified to only include files where the first letter is "I" and the second letter is not lowercase?
Your regexp should work as it is. It is possible that it is executed in ignore case mode. Try to disable that mode inside your regexp with (?-i):
(?-i)[I]{1}[A-Z]{1}.*\.cs$
Try this
"(?-i)^I[A-Z].*\.cs$"
Sets case insensitve off first.
Regular Expression Options
Filenames in Windows are not case-sensitive, so obviously Power Tools will be using case-insensitive matching.
How about this:
^I([A-Z][A-Za-z0-9]*){1}\.cs$
so
IMyInterface.cs // matches, MyInterface
IB.cs // B
IBa.cs // Ba
IC1.cs // C1
I.cs // don't
Information.cs // don't
Prooflink
I based mine off the default patterns placed in there and used ^I[A-Z].*\.cs[ ]*(\[read only\])?$ - I think that there is a precedence question, though, so that if you leave the default .cs pattern matcher in there and add yours to the end, you might have yours hidden, because it matched the general one first.
And you can't re-order or delete them, so it's a little fiddly to get the ordering working well ...
FWIW, I don't think the case-sensitivity question ((?-i) makes any difference.