How to do case-insensitive query in tree-sitter - case-insensitive

I'm working on trying to create and use tree-sitter grammar in a language server I am implementing in order to support features like finding all references of a variable. Given the grammar I would be able to write a query to find all of the references to a variable with a specific name (ex. myVar). However, the language I am writing a language server for uses case insensitive variables (ex. myVar can be referenced as MYVAR, MyVaR, myvar, etc.).
How would I be able to write a tree-sitter query to match a pattern where a token must case-insensitively match a particular string?
I could write the query to not filter by the variable name and implement my own filtering of the results, but I was wondering if there was a way to handle this within the query itself rather than implementing custom filtering code.
Example
Here is a simplified example case to show what I mean.
Given the following grammar, I want to query for all of the set_statements that set a new value to the variable myVar.
module.exports = grammar({
name: 'mylang',
rules: {
source_file: $ => repeat($._statement),
_statement: $ => choice(
$.set_statement,
),
set_statement: $ => seq(
'set',
field("variable", $.identifier),
field("value", $._expression),
),
_expression: $ => choice(
$.integer_literal
),
identifier: $ => /[a-zA-Z0-9]+/,
integer_literal: $ => /[0-9]+/,
}
});
Normally I would be able to do this with a query like the following.
(
(set_statement
variable: (identifier) #variable)
(#eq? #variable "myVar")
)
However, as we can see with the following example of running the query, this only picks up on the references to myVar that use the same casing as the query.
$ cat set_testing.txt
set myVar 0
set MYVAR 23
set myVar2 72
set MyVaR 14
$ tree-sitter query find_variable.query set_testing.txt
set_testing.txt
pattern: 0
capture: variable, start: (0, 4), text: "myVar"
I want to create a query that would instead find:
tree-sitter query find_variable.query set_testing.txt
set_testing.txt
pattern: 0
capture: variable, start: (0, 4), text: "myVar"
pattern: 0
capture: variable, start: (1, 4), text: "MYVAR"
pattern: 0
capture: variable, start: (3, 4), text: "MyVaR"

Change your query to match a regular expression matching all possible upper/lower combinations of an identifier, in this case myvar.
If you change find_variable.query to use match with a regular expression for all case combinations:
(
(set_statement
variable: (identifier) #variable)
(#match? #variable "^[mM][yY][vV][aA][rR]$")
)
Now running tree-sitter query find_variable.query set_testing.txt returns:
set_testing.txt
pattern: 0
capture: variable, start: (0, 4), text: "myVar"
pattern: 0
capture: variable, start: (1, 4), text: "MYVAR"
pattern: 0
capture: variable, start: (3, 4), text: "MyVaR"
Tree-sitter does not support case insensitive regular expression searches Issue #261 so the regular expressions are a little longer.

Related

How to extract parameter names and values using regular expressions

I would like to know how to extract values of all this parameters.
My regular expression:
([\w]+)(\s*=\s*)(['|"|\w])(.+)['|"|\w]
Parameter names and values that should match:
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P
When i run my expression in regex101.com it only finds the first parameter that matches. In this case being: name='John Doe'
Desired output is name John Doe
I am having extra trouble understanding how to find and extract parameter names and values without parantesis and equals signs.
Try this:
(\w+)\s*=\s*['"]?([^'"\n]+)
The keyword will be in capture group 1 (there's no need for [] around \w).
There's no need for a capture group around the equal sign.
[^"]? allows an optional quote after the equal sign. There's no need to put it in a capture group.
([^'"\n]+) then matches everything that isn't another quote or newline. So it will capture everything until either a quote or the end of the line. This value will be put into group 2.
DEMO
i hope this will useful for you:
/^.*?=.*?$/gm
I test the pattern in: https://regexr.com/
var str = `
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P`;
console.log( str.match(/^.*?=.*?$/gm).map(str => str.replace(/("|')/g, '').replace(/=/g, ' ') ) )

Single RegEx to catch multiple options and replace with their corresponding replacements

The problem goes like this:
value match: 218\d{3}(\d{4})#domain.com replace with 10\1 to get 10 followed by last 4 digits
for example 2181234567 would become 104567
value match: 332\d{3}(\d{4})#domain.com replace with 11\1 to get 11 followed by last 4 digits
for example 3321234567 would become 114567
value match: 420\d{3}(\d{4})#domain.com replace with 12\1 to get 12 followed by last 4 digits
..and so on
for example 4201234567 would become 124567
Is there a better way to catch different values and replace with their corresponding replacements in a single RegEx than creating multiple expressions?
Like (218|332|420)\d{3}(\d{4})#domain.com to replace 10\4|11\4|12\4) and get just their corresponding results when matched.
Edit: Didn't specify the use case: It's for my PBX, that just uses RegEx to match patterns and then replace it with the values I want it to go out with. No code. Just straight up RegEx in the GUI.
Also for personal use, if I can get it to work with Notepad++
Ctrl+H
Find what: (?:(218)|(332)|(420))\d{3}(\d{4})(?=#domain\.com)
Replace with: (?{1}10$4)(?{2}11$4)(?{3}12$4)
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
(?: # non capture group
(218) # group 1, 218
| # OR
(332) # group 2, 332
| # OR
(420) # group 3, 420
) # end group
\d{3} # 3 digits
(\d{4}) # group 4, 4 digits
(?=#domain\.com) # positive lookahead, make sure we have "#domain.com" after
# that allows to keep "#domain.com"
# if you want to remove it from the result, just put "#domain\.com"
# without lookahead.
Replacement:
(?{1} # if group 1 exists
10 # insert "10"
$4 # insert content of group 4
) # endif
(?{2}11$4) # same as above
(?{3}12$4) # same as above
Screenshot (before):
Screenshot (after):
I don't think you can use a single regular expression to conditionally replace text as per your example. You either need to chain multiple search & replace, or use a function that does a lookup based on the first captured group (first three digits).
You did not specify the language used, regular expressions vary based on language. Here is a JavaScript code snippet that uses the function with lookup approach:
var str1 = '2181234567#domain.com';
var str2 = '3321234567#domain.com';
var str3 = '4201234567#domain.com';
var strMap = {
'218': '10',
'332': '11',
'420': '12'
// add more as needed
};
function fixName(str) {
var re = /(\d{3})\d{3}(\d{4})(?=\#domain\.com)/;
var result = str.replace(re, function(m, p1, p2) {
return strMap[p1] + p2;
});
return result;
}
var result1 = fixName(str1);
var result2 = fixName(str2);
var result3 = fixName(str3);
console.log('str1: ' + str1 + ', result1: ' + result1);
console.log('str2: ' + str2 + ', result2: ' + result2);
console.log('str3: ' + str3 + ', result3: ' + result3);
Output:
str1: 2181234567#domain.com, result1: 104567#domain.com
str2: 3321234567#domain.com, result2: 114567#domain.com
str3: 4201234567#domain.com, result3: 124567#domain.com
#Toto has a nice answer, and there is another method if the operator (?{1}...) is not available (but thanks, Toto, I did not know this feature of NotePad++).
More details on my answer here: https://stackoverflow.com/a/63676336/1287856
Append to the end of the doc:
,218=>10,332=>11,420=>12
Search for:
(218|332|420)\d{3}(\d{4})(?=#domain.com)(?=[\s\S]*,\1=>([^,]*))
Replace with
\3\2
watch in action:

Groovy regex PatternSyntaxException when parsing GString-style variables

Groovy here. I'm being given a String with GString-style variables in it like:
String target = 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
Keep in mind, this is not intended to be used as an actual GString!!! That is, I'm not going to have 3 string variables (animal, role and bodyPart, respectively) that Groovy will be resolving at runtime. Instead, I'm looking to do 2 distinct things to these "target" strings:
I want to be able to find all instances of these variables refs ("${*}") in the target string, and replace it with a ?; and
I also need to find all instances of these variables refs and obtain a list (allowing dupes) with their names (which in the above example, would be [animal,role,bodyPart])
My best attempt thus far:
class TargetStringUtils {
private static final String VARIABLE_PATTERN = "\${*}"
// Example input: 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
// Example desired output: 'How now brown ?. The ? has oddly-shaped ?.'
static String replaceVarsWithQuestionMarks(String target) {
target.replaceAll(VARIABLE_PATTERN, '?')
}
// Example input: 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
// Example desired output: [animal,role,bodyPart] } list of strings
static List<String> collectVariableRefs(String target) {
target.findAll(VARIABLE_PATTERN)
}
}
...produces PatternSytaxException anytime I go to run either method:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal repetition near index 0
${*}
^
Any ideas where I'm going awry?
The issue is that you have not escaped the pattern properly, and findAll will only collect all matches, while you need to capture a subpattern inside the {}.
Use
def target = 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
println target.replaceAll(/\$\{([^{}]*)\}/, '?') // => How now brown ?. The ? has oddly-shaped ?.
def lst = new ArrayList<>();
def m = target =~ /\$\{([^{}]*)\}/
(0..<m.count).each { lst.add(m[it][1]) }
println lst // => [animal, role, bodyPart]
See this Groovy demo
Inside a /\$\{([^{}]*)\}/ slashy string, you can use single backslashes to escape the special regex metacharacters, and the whole regex pattern looks cleaner.
\$ - will match a literal $
\{ - will match a literal {
([^{}]*) - Group 1 capturing any characters other than { and }, 0 or more times
\} - a literal }.

In emacs, can I use alternation in the regexp for align-regexp?

For example, I have the following snippet:
'abc' => 1,
'abcabc' =>2,
'abcabcabc' => 3,
And I want to format it to:
'abc' => 1,
'abcabc' => 2,
'abcabcabc' => 3,
I know there are easier ways to do it but here I'm just want to practice my understanding of align-regexp. I've tried this command but it does not work:
C-u M-x align-regexp \(\s-+\)=\|\(>\s-*\)\d 1 1 y
Where I'm wrong?
Thanks.
So the question is: With \(\s-+\)=\|\(>\s-*\)\d matching \(\s-+\)= or \(>\s-*\)\d1, can we use align-regexp to align on each of those alternatives throughout a line.
The answer is no -- align-regexp modifies one specific matched group of the regexp. In this case it was group 1, and group 1 is the \(\s-+\) at the beginning. Group 1 of the regexp does not vary depending on what was actually matched, and so it never refers to \(>\s-*\)2.
If you can express your regexp such that it really is a single group of the regexp which should be replaced for every match throughout the line, you can get the effect you want, however.
e.g. >?\(\s-*\)[0-9=] would -- at least for the data shown -- give the desired result.
1 In Emacs \d matches d. That should be [0-9].
2 You generally don't want any non-whitespace in the alignment group, as Emacs replaces the content of that group.

.NET regex with quote and space

I'm trying to create a regex to match this:
/tags/ud617/?sort=active&page=2" >2
So basically, "[number]" is the only dynamic part:
/tags/ud617/?sort=active&page=[number]" >[number]
The closest I've been able to get (in PowerShell) is:
[regex]::matches('/tags/ud617/?sort=active&page=2" >2
','/tags/ud617/\?sort=active&page=[0-9]+')
But this doesn't provide me with a full match of the dynamic string.
Ultimately, I'll be creating a capture group:
/tags/ud617/?sort=active&page=([number])
Seems easy enough:
$regex = '/tags/ud617/\?sort=active&page=(\d+)"\s>2'
'/tags/ud617/?sort=active&page=2" >2' -match $regex > $nul
$matches[1]
2
[regex]::matches('/tags/ud617/?sort=active&page=3000 >2','/tags/ud617/\?sort=active&page=(\d+) >(\d+)')
Outputs:
Groups : {/tags/ud617/?sort=active&page=3000 >2, 3000, 2}
Success : True
Captures : {/tags/ud617/?sort=active&page=3000 >2}
Index : 0
Length : 41
Value : /tags/ud617/?sort=active&page=3000 >2
This captures the page value and the number after the greater than i.e. 2