Exact matching with Question mark in Perl - regex

I want to find string ?Allen in the string array but there is question mark in keyword and it causes some problems.
I write this code to find string in array
#arr = ("My name is ?Allen",
"My name is ?Allens",
"My name is s?Allen",
"My name is s?Allens",
"My name is ?allen");
$keyword = "?Allen";
for (my $i=0; $i <= 4; $i++){
if ($arr[$i] =~ /\b$keyword\b/){
print "str $i = match\n";
}else{
print "str $i = no\n";
}
}
finally I get this result
str 0 = match
str 1 = no
str 2 = match
str 3 = no
str 4 = no
but I want to find only first index array as matching string like this:
str 0 = match
str 1 = no
str 2 = no
str 3 = no
str 4 = no

Note that your regex contains non-word special chars that you need to quote before using them in the actual pattern. Also, the fact that the special chars can appear at the leading/trailing positions means you cannot expect \b to always work the same (since its meaning is context dependent). Thus, you may fix the code with
/(?<!\S)\Q$keyword\E(?!\S)/
where
(?<!\S) - requires a whitespace char or start of string before
\Q$keyword\E - a literal search string (see Quoting Metacharacters)
(?!\S) - that should be followed with a whitespace or end of string.
Another alternative for \Q...\E (mentioned by Dave Cross) is using quotemeta:
This is the internal function implementing the \Q escape in double-quoted strings.

Related

Get the word before & after '_-_' with REGEX PowerShell

I am trying to get the Word before and decimal string following a non guaranteed string that looks like ' - '.
Consider this string
"some str (targetWord - 12434 trailing string)"
this string is not guaranteed to have spaces before or after the '-'
so it could look like one of the following
"some str (targetWord-12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
"some str (targetWord -12434 trailing string)"
"some str (targetWord- 12434 trailing string)"
So far I have the following
$allServices = (Get-Service "Known Service Prefix*").DisplayName
foreach ($service in $allServices){
$service = $service.split('\((.*?)\)')[1] #esc( 'Match any non greedy' esc)
if($service.split()[0] -Match '-'){
$arr_services += $service.split('( - )')[0..1]
}else{
$arr_services += ($service -replace '-','').split()[0..1]
}
}
This works to handle the simple case of ' - ' & '-', but cant handle anything else. I feel like this is the kind of problem that could be handled by one line of REGEX or at most two.
What I want to end up with is an array of strings, where the evens (including zero) are the targetWord, and the odd values are the decimal strings.
My issue isn't that I can't make this happen, it's that it looks like crap...
what I mean is my goal is to try and use REGEX to get each word, ignore the '-', and push out to a growing array the targetWord & decimalString.
I see this as more of a puzzle than anything and am trying to use this to improve my REGEX skills. Any help is appreciated!
A single regex passed to the -match operator should suffice:
$arr_services = $allServices | ForEach-Object {
if ($_ -match '\((?<word>\w+) *- *(?<number>\d+)') {
# Output the word and number consecutively.
$Matches.word, $Matches.number
}
}
# Output the resulting array.
$arr_services
Note how the pipeline output can be directly collected in a variable as an array ($arr_services = ...) - no need to iteratively "add" to an array. If you need to ensure that $arr_services is always an array - even if the pipeline outputs only one object, use [array] $arr_services = ...
With your sample strings, the above yields (a flat array of consecutive word-number pairs):
targetWord
12434
targetWord
12434
targetWord
12434
targetWord
12434
As for the regex:
\( matches a literal (
\w+ matches a nonempty run (+) of word characters (\w - letters, digits, _), captured in named capture group word ((?<word>...).
 *- * matches a literal - surrounded by any number of spaces - including none (*).
\d+ matches a nonempty run of digits (\d), captured in named group digits.
if the -match operator finds a match, the results are reflected in the automatic $Matches variable, a hashtable that enables accessing named capture groups directly by name.
here's one way to handle the data set you posted. it presumes all the strings will have the same general format that you posted. that means it WILL FAIL if your sample data set is not realistic. [grin]
$InStuff = #(
'some str (targetWord - 12434 trailing string)'
'some str (targetWord-12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
'some str (targetWord -12434 trailing string)'
'some str (targetWord- 12434 trailing string)'
)
$Results = foreach ($IS_Item in $InStuff)
{
$Null = $IS_Item -match '.+\((?<Word>.+) *- *(?<Number>\d{1,}) .+\)'
[PSCustomObject]#{
Word = $Matches.Word.Trim()
Number = $Matches.Number
}
}
$Results
output ...
Word Number
---- ------
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434
targetWord 12434

Regex to match tokens in a string using string.gmatch

I need a regex to use in string.gmatch that matches sequences of alphanumeric characters and non alphanumeric characters (quotes, brackets, colons and the like) as separated, single, matches, so basically:
str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for token in str:gmatch(regex) do
print(token)
end
Should print:
function
test
(
arg1
,
arg2
)
{
dosomething
(
0x121f
,
"
String
"
)
;
}
How can I achieve this? In standard regex I've found that ([a-zA-Z0-9]+)|([\{\}\(\)\";,]) works for me but I'm not sure on how to translate this to Lua's regex.
local str = [[
function test(arg1, arg2) {
dosomething(0x12f, "String");
}
]]
for p, w in str:gmatch"(%p?)(%w*)" do
if p ~= "" then print(p) end
if w ~= "" then print(w) end
end
You need a workaround involving a temporary char that is not used in your code. E.g., use a § to insert it after the alphanumeric and non-alphanumeric characters:
str = str:gsub("%s*(%w+)%s*", "%1§") -- Trim chunks of 1+ alphanumeric characters and add a temp char after them
str = str:gsub("(%W)%s*", "%1§") -- Right trim the non-alphanumeric char one by one and add the temp char after each
for token in str:gmatch("[^§]+") do -- Match chunks of chars other than the temp char
print(token)
end
See this Lua demo
Note that %w in Lua is an equivalent of JS [a-zA-Z0-9], as it does not match an underscore, _.

In DOORS DXL, how do I use a regular expression to determine whether a string starts with a number?

I need to determine whether a string begins with a number - I've tried the following to no avail:
if (matches("^[0-9].*)", upper(text))) str = "Title"""
I'm new to DXL and Regex - what am I doing wrong?
You need the caret character to indicate a match only at the start of a string. I added the plus character to match all the numbers, although you might not need it for your situation. If you're only looking for numbers at the start, and don't care if there is anything following, you don't need anymore.
string str1 = "123abc"
string str2 = "abc123"
string strgx = "^[0-9]+"
Regexp rgx = regexp2(strgx)
if(rgx(str1)) { print str1[match 0] "\n" } else { print "no match\n" }
if(rgx(str2)) { print str2[match 0] "\n" } else { print "no match\n" }
The code block above will print:
123
no match
#mrhobo is correct, you want something like this:
Regexp numReg = "^[0-9]"
if(numReg text) str = "Title"
You don't need upper since you are just looking for numbers. Also matches is more for finding the part of the string that matches the expression. If you just want to check that the string as a whole matches the expression then the code above would be more efficient.
Good luck!
At least from example I found this example should work:
Regexp plural = regexp "^([0-9].*)$"
if plural "15systems" then print "yes"
Resource:
http://www.scenarioplus.org.uk/papers/dxl_regexp/dxl_regexp.htm

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Regular expression for removing white spaces but not those inside ""

I have the following input string:
key1 = "test string1" ; key2 = "test string 2"
I need to convert it to the following without tokenizing
key1="test string1";key2="test string 2"
You'd be far better off NOT using a regular expression.
What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").
For example, what happens when you decide you want to escape characters?
key1="this is a \"quoted\" string"
Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.
As a bonus, you'll get the ability to detect syntax errors.
Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:
s/ *([^ "]*) *("[^"]*")?/\1\2/g
sed:
$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'
C# code:
using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);
Output:
key1="test string1";key2="test string 2"
Escape-aware version
On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:
s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g
which in C# looks like:
Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");
Please do not go blind from those backslashes!
Example
Input: key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"