Regex to match tokens in a string using string.gmatch - regex

I need a regex to use in string.gmatch that matches sequences of alphanumeric characters and non alphanumeric characters (quotes, brackets, colons and the like) as separated, single, matches, so basically:
str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for token in str:gmatch(regex) do
print(token)
end
Should print:
function
test
(
arg1
,
arg2
)
{
dosomething
(
0x121f
,
"
String
"
)
;
}
How can I achieve this? In standard regex I've found that ([a-zA-Z0-9]+)|([\{\}\(\)\";,]) works for me but I'm not sure on how to translate this to Lua's regex.

local str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for p, w in str:gmatch"(%p?)(%w*)" do
if p ~= "" then print(p) end
if w ~= "" then print(w) end
end

You need a workaround involving a temporary char that is not used in your code. E.g., use a § to insert it after the alphanumeric and non-alphanumeric characters:
str = str:gsub("%s*(%w+)%s*", "%1§") -- Trim chunks of 1+ alphanumeric characters and add a temp char after them
str = str:gsub("(%W)%s*", "%1§") -- Right trim the non-alphanumeric char one by one and add the temp char after each
for token in str:gmatch("[^§]+") do -- Match chunks of chars other than the temp char
print(token)
end
See this Lua demo
Note that %w in Lua is an equivalent of JS [a-zA-Z0-9], as it does not match an underscore, _.

Related

Get the word before & after '_-_' with REGEX PowerShell

I am trying to get the Word before and decimal string following a non guaranteed string that looks like ' - '.
Consider this string
"some str (targetWord - 12434 trailing string)"
this string is not guaranteed to have spaces before or after the '-'
so it could look like one of the following
"some str (targetWord-12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
"some str (targetWord -12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
So far I have the following
$allServices = (Get-Service "Known Service Prefix*").DisplayName
foreach ($service in $allServices){
$service = $service.split('\((.*?)\)')[1] #esc( 'Match any non greedy' esc)
if($service.split()[0] -Match '-'){
$arr_services += $service.split('( - )')[0..1]
}else{
$arr_services += ($service -replace '-','').split()[0..1]
}
}
This works to handle the simple case of ' - ' & '-', but cant handle anything else. I feel like this is the kind of problem that could be handled by one line of REGEX or at most two.
What I want to end up with is an array of strings, where the evens (including zero) are the targetWord, and the odd values are the decimal strings.
My issue isn't that I can't make this happen, it's that it looks like crap...
what I mean is my goal is to try and use REGEX to get each word, ignore the '-', and push out to a growing array the targetWord & decimalString.
I see this as more of a puzzle than anything and am trying to use this to improve my REGEX skills. Any help is appreciated!
A single regex passed to the -match operator should suffice:
$arr_services = $allServices | ForEach-Object {
if ($_ -match '\((?<word>\w+) *- *(?<number>\d+)') {
# Output the word and number consecutively.
$Matches.word, $Matches.number
}
}
# Output the resulting array.
$arr_services
Note how the pipeline output can be directly collected in a variable as an array ($arr_services = ...) - no need to iteratively "add" to an array. If you need to ensure that $arr_services is always an array - even if the pipeline outputs only one object, use [array] $arr_services = ...
With your sample strings, the above yields (a flat array of consecutive word-number pairs):
targetWord
12434
targetWord
12434
targetWord
12434
targetWord
12434
As for the regex:
\( matches a literal (
\w+ matches a nonempty run (+) of word characters (\w - letters, digits, _), captured in named capture group word ((?<word>...).
 *- * matches a literal - surrounded by any number of spaces - including none (*).
\d+ matches a nonempty run of digits (\d), captured in named group digits.
if the -match operator finds a match, the results are reflected in the automatic $Matches variable, a hashtable that enables accessing named capture groups directly by name.
here's one way to handle the data set you posted. it presumes all the strings will have the same general format that you posted. that means it WILL FAIL if your sample data set is not realistic. [grin]
$InStuff = #(
'some str (targetWord - 12434 trailing string)'
'some str (targetWord-12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
'some str (targetWord -12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
)
$Results = foreach ($IS_Item in $InStuff)
{
$Null = $IS_Item -match '.+\((?<Word>.+) *- *(?<Number>\d{1,}) .+\)'
[PSCustomObject]#{
Word = $Matches.Word.Trim()
Number = $Matches.Number
}
}
$Results
output ...
Word Number
---- ------
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434

Kotlin .split() with multiple regex

Input: """aaaabb\\\\\cc"""
Pattern: ["""aaa""", """\\""", """\"""]
Output: [aaa, abb, \\, \\, \, cc]
How can I split Input to Output using patterns in Pattern in Kotlin?
I found that Regex("(?<=cha)|(?=cha)") helps patterns to remain after spliting, so I tried to use looping, but some of the patterns like '\' and '[' require escape backslash, so I'm not able to use loop for spliting.
EDIT:
val temp = mutableListOf<String>()
for (e in Input.split(Regex("(?<=\\)|(?=\\)"))) temp.add(e)
This is what I've been doing, but this does not work for multiple regex, and this add extra "" at the end of temp if Input ends with "\"
You may use the function I wrote for some previous question that splits by a pattern keeping all matched and non-matched substrings:
private fun splitKeepDelims(s: String, rx: Regex, keep_empty: Boolean = true) : MutableList<String> {
var res = mutableListOf<String>() // Declare the mutable list var
var start = 0 // Define var for substring start pos
rx.findAll(s).forEach { // Looking for matches
val substr_before = s.substring(start, it.range.first()) // // Substring before match start
if (substr_before.length > 0 || keep_empty) {
res.add(substr_before) // Adding substring before match start
}
res.add(it.value) // Adding match
start = it.range.last()+1 // Updating start pos of next substring before match
}
if ( start != s.length ) res.add(s.substring(start)) // Adding text after last match if any
return res
}
You just need a dynamic pattern from yoyur Pattern list items by joining them with a |, an alternation operator while remembering to escape all the items:
val Pattern = listOf("aaa", """\\""", "\\") // Define the list of literal patterns
val rx = Pattern.map{Regex.escape(it)}.joinToString("|").toRegex() // Build a pattern, \Qaaa\E|\Q\\\E|\Q\\E
val text = """aaaabb\\\\\cc"""
println(splitKeepDelims(text, rx, false))
// => [aaa, abb, \\, \\, \, cc]
See the Kotlin demo
Note that between \Q and \E, all chars in the pattern are considered literal chars, not special regex metacharacters.

Split with a multicharacter regex pattern and keep delimiters

I have next string and regex for splitting it:
val str = "this is #[loc] sparta"
val regex = "((?<=( #\\[\\w{3,100}\\] ))|(?=( #\\[\\w{3,100}\\] )))"
print(str.split(Regex(regex)))
//print - [this is, #[loc] , sparta]
Works fine. But in develop I did not realize when in #[***] block must be a not only text (\w) - he have and "-" and numbers (UUID), and my correct blocks is -
val str = "this is #[loc_75acca83-a39b-4df1-8c3c-b690df00db62]"
and in this case regex don't work.
How to change this part - "\w{3,100}" for new requirements?
I try change to any - "\.{3,100}" - not work
To fix your issue, you may replace your regex with
val regex = """((?<=( #\[[^\]\[]{3,100}] ))|(?=( #\[[^\]\[]{3,100}] )))"""
The \w can be replaced with [^\]\[] that matches any char but [ and ].
Note the use of a raw string literal, """...""", that allows the use of a single backslash as a regex escape.
See the Kotlin online demo.
Alternatively, you may use the following method to split and keep delimiters:
private fun splitKeepDelims(s: String, rx: Regex, keep_empty: Boolean = true) : MutableList<String> {
var res = mutableListOf<String>() // Declare the mutable list var
var start = 0 // Define var for substring start pos
rx.findAll(s).forEach { // Looking for matches
val substr_before = s.substring(start, it.range.first()) // // Substring before match start
if (substr_before.length > 0 || keep_empty) {
res.add(substr_before) // Adding substring before match start
}
res.add(it.value) // Adding match
start = it.range.last()+1 // Updating start pos of next substring before match
}
if ( start != s.length ) res.add(s.substring(start)) // Adding text after last match if any
return res
}
Then, just use it like
val str = "this is #[loc_75acca83-a39b-4df1-8c3c-b690df00db62] sparta"
val regex = """#\[[\]\[]+]""".toRegex()
print(splitKeepDelims(str, regex))
// => [this is , #[loc_75acca83-a39b-4df1-8c3c-b690df00db62], sparta]
See the Kotlin demo.
The \[[^\]\[]+] pattern matches
\[ - a [ char
[^\]\[]+ - 1+ chars other than [ and ]
] - a ] char.

Exact matching with Question mark in Perl

I want to find string ?Allen in the string array but there is question mark in keyword and it causes some problems.
I write this code to find string in array
#arr = ("My name is ?Allen",
"My name is ?Allens",
"My name is s?Allen",
"My name is s?Allens",
"My name is ?allen");
$keyword = "?Allen";
for (my $i=0; $i <= 4; $i++){
if ($arr[$i] =~ /\b$keyword\b/){
print "str $i = match\n";
}else{
print "str $i = no\n";
}
}
finally I get this result
str 0 = match
str 1 = no
str 2 = match
str 3 = no
str 4 = no
but I want to find only first index array as matching string like this:
str 0 = match
str 1 = no
str 2 = no
str 3 = no
str 4 = no
Note that your regex contains non-word special chars that you need to quote before using them in the actual pattern. Also, the fact that the special chars can appear at the leading/trailing positions means you cannot expect \b to always work the same (since its meaning is context dependent). Thus, you may fix the code with
/(?<!\S)\Q$keyword\E(?!\S)/
where
(?<!\S) - requires a whitespace char or start of string before
\Q$keyword\E - a literal search string (see Quoting Metacharacters)
(?!\S) - that should be followed with a whitespace or end of string.
Another alternative for \Q...\E (mentioned by Dave Cross) is using quotemeta:
This is the internal function implementing the \Q escape in double-quoted strings.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}