Match everything except for characters within quotes - regex

After some playing around I came up with a way to capture characters within single/double quotes:
['"](?:[^'"]*?(?:\\")*)*["']
Not sure if this is entirely correct. In any event, I am now trying to match everything BUT these.
Example:
'stringA' '\"stringB\"' variableA variableB
The above regex matches: 'stringA' '\"stringB\"'
I would like to match variableA variableB
Is there a way I can achieve this with Perl? I was trying to use negative/positive lookahead/behinds but I encountered issues as my lookbehind had \s* which was not allowed.
Thanks for your help.

Use the PCRE verb (*SKIP)(*F),
['"](?:[^'"]*?(?:\\")*)*["'](*SKIP)(*F)|\S+
DEMO
['"](?:[^'"]*?(?:\\")*)*["'] Matches strings within double or single quotes.
(*SKIP)(*F) Causes the preceding pattern to fail. And try to match the pattern that was in the right side of | operator from the remaining strings.
\S+ Matches one or more non-space characters except the double or single quoted string.

You could use a long complicated regex like the following:
my #words = split m{
' (?: [^'\\]* | \\. )* ' (*SKIP)(*F)
|
" (?: [^"\\]* | \\. )* " (*SKIP)(*F)
|
\s+
}x, $_;
However, I would recommend using Text::ParseWords:
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::ParseWords;
while (<DATA>) {
my #words = parse_line(qr{\s+}, 0, $_);
print "$_\n" for #words;
}
__DATA__
'stringA' '\"stringB\"' variableA variableB
Outputs:
stringA
\"stringB\"
variableA
variableB

Related

String split in windows powershell

Can you please help me to get the desired output, where SIT is the environment and type of file is properties, i need to remove the environment and the extension of the string.
#$string="<ENV>.<can have multiple period>.properties
*$string ="SIT.com.local.test.stack.properties"
$b=$string.split('.')
$b[0].Substring(1)*
Required output : com.local.test.stack //can have multiple period
This should do.
$string = "SIT.com.local.test.stack.properties"
# capture anything up to the first period, and in between first and last period
if($string -match '^(.+?)\.(.+)\.properties$') {
$environment = $Matches[1]
$properties = $Matches[2]
# ...
}
You may use
$string -replace '^[^.]+\.|\.[^.]+$'
This will remove the first 1+ chars other than a dot and then a dot, and the last dot followed with any 1+ non-dot chars.
See the regex demo and the regex graph:
Details
^ - start of string
[^.]+ - 1+ chars other than .
\. - a dot
| - or
\. - a dot
[^.]+ - 1+ chars other than .
$ - end of string.
You can use -match to capture your desired output using regex
$string ="SIT.com.local.test.stack.properties"
$string -match "^.*?\.(.+)\.[^.]+$"
$Matches.1
You can do this with the Split operator also.
($string -split "\.",2)[1]
Explanation:
You split on the literal . character with regex \.. The ,2 syntax tells PowerShell to return 2 substrings after the split. The [1] index selects the second element of the returned array. [0] is the first substring (SIT in this case).

Bash regex matching "0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."

In a Bash script I'm writing, I need to capture the /path/to/my/file.c and 93 in this line:
0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).
0xffffffc0006e0584 is in another_function(char *arg1, int arg2) (/path/to/my/other_file.c:94).
With the help of regex101.com, I've managed to create this Perl regex:
^(?:\S+\s){1,5}\((\S+):(\d+)\)
but I hear that Bash doesn't understand \d or ?:, so I came up with this:
^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)
But when I try it out:
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[0]}
I don't get any match. What am I doing wrong? How can I write a Bash-compatible regex to do this?
You are right, Bash uses POSIX ERE and does not support \d shorthand character class, nor does it support non-capturing groups. See more regex features unsupported in POSIX ERE/BRE in this post.
Use
.*\((.+):([0-9]+)\)
Or even (if you need to grab the first (...) substring in a string):
\(([^()]+):([0-9]+)\)
Details
.* - any 0+ chars, as many as possible (may be omitted, only necessary if there are other (...) substrings and you only need to grab the last one)
\( - a ( char
(.+) - Group 1 (${BASH_REMATCH[1]}): any 1+ chars as many as possible
: - a colon
([0-9]+) - Group 2 (${BASH_REMATCH[2]}): 1+ digits
\) - a ) char.
See the Bash demo (or this one):
test='0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).'
reg='.*\((.+):([0-9]+)\)'
# reg='\(([^()]+):([0-9]+)\)' # This also works for the current scenario
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[1]};
echo ${BASH_REMATCH[2]};
fi
Output:
/path/to/my/file.c
93
In the first pattern you use \S+ which matches a non whitespace char. That is a broad match and will also match for example / which is not taken into account in the second pattern.
The pattern starts with [:alpha:] but the first char is a 0. You could use [:alnum:] instead. Since the repetition should also match _ that could be added as well.
Note that when using a quantifier for a capturing group, the group captures the last value of the iteration. So when using {1,5} you use that quantifier only for the repetition. Its value would be some_function
You might use:
^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Regex demo | Bash demo
Your code could look like
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[4]}
Result
/path/to/my/file.c
93
Or a bit shorter version using \S and the values are in group 2 and 3
^([[:alnum:]_]+[[:space:]]){1,5}\((\S+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Explanation
^ Start of string
([[:alnum:]_]+[[:space:]]){1,5} Repeat 1-5 times what is captured in group 1
\( match (
(\S+\.[[:alpha:]]) Capture group 2 Match 1+ non whitespace chars, . and an alphabetic character
: Match :
([[:digit:]]+) Capture group 3 Match 1+ digits
\)\. Match ).
$ End of string
See this page about bracket expressions
Regex demo

Perl Non-greedy Matching -- Is the "?" character used correctly?

I am trying to match the parameter name of a parameter declaration line such as below:
parameter BWIDTH = 32;
The Perl regular expression used is:
$line =~ /(\w+)\s*=/
where the parameter name, BWIDTH, is captured into $1. Most parameters I encountered are declared in such a way that the name precedes the equal sign, "=", which is the reason the regular expression is designed with the "=" in it (/(\w+)\s*=/).
However there are special cases where the parameter is declared:
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
In this case, the parameter name that I am trying to capture is PORT_WIDTH. Revising the regular expression to match this instance does not capture PORT_WIDTH successfully, although it does capture BWIDTH fine.
$line =~ /(\w+)(\s*\[.*?\])*\s*=/
where (\s*\[.*?\])* matches reg [31:0] PORT_WIDTH [BWIDTH-1:0] which is greedy matching.
I am baffled as to why the metacharacter ? does not halt the greedy matching? How should I revise the regular expression?
Replace the .*? with [^][]* to match 0+ chars other than ] and [:
/(\w+)(\s*\[[^][]*])*\s*=/
^^^^^^
You may also turn the second capturing group into a non-capturing one if you are not using that value.
Pattern details:
(\w+) - Group 1: one or more word chars
(\s*\[[^][]*])* - a capturing group (add ?: after ( to make it non-capturing) zero or more occurrences of:
\s* - 0+ whitespaces
\[ - a literal [
[^][]* - a negated character class matching zero or more chars other than ] and [
] - a literal ]
\s* - zero or more whitespaces
= - an equal sign.
Greediness vs. non-greediness affects where a match ends, but it still starts as early as possible. Basically, a greedy match is the leftmost-longest possible match, while non-greedy is leftmost-shortest. But non-greedy is still leftmost, not rightmost.
To get what you want, I would use a more explicit description of what I want matched: /(\w+)(\s*\[[^]]*\])?\s*=/ In English, that's a word (\w+), optionally followed by some text in square brackets ((\s*\[[^]]*\])?), and then optional whitespace and an equals sign. Note that I used a negated character class ([^]]) instead of a non-greedy match for what's inside the brackets - IMO, negated character classes are generally a better option than non-greedy matching.
Results with this regex:
$ perl -E '$x = q(parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;); $x =~ /(\w+)(:?\s*\[[^]]*\])?\s*=/; say $1;'
PORT_WIDTH
$ perl -E '$x = q(parameter BWIDTH = 32;); $x =~ /:?(\w+)(\s*\[[^]]*\])?\s*=/; say $1;'
BWIDTH
You have information available to you which you are choosing not to use. You know the basic structure of each statement you are trying to parse. The statements have mandatory and optional parts. So, put the information you have in to the match. For example:
#!/usr/bin/env perl
use strict;
use warnings;
my $stuff_in_square_brackets = qr{ \[ [^\]]+ \] }x;
my $re = qr{
^
parameter \s+
(?: reg \s+)?
(?: $stuff_in_square_brackets \s+)?
(\w+) \s+
(?: $stuff_in_square_brackets \s+)?
= \s+
(\w+) ;
$
}x;
while (my $line = <DATA>) {
if (my($p, $v) = ($line =~ $re)) {
print "'$p' = '$v'\n";
}
}
__DATA__
parameter BWIDTH = 32;
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
Output:
'BWIDTH' = '32'
'PORT_WIDTH' = '32'

Why does adding the Perl /x switch stop my regex from matching?

I'm trying to match:
JOB: fruit 342 apples to get
The code matches:
$line =~ /^JOB: fruit (\d+) apples to get/
But, when I add the /x switch in:
$line =~ /^JOB: fruit (\d+) apples to get/x
It does not match.
I looked into the /x switch, and it says it just lets you do comments. I don't know why adding /x stops my regex from matching.
The /x modifier tells Perl to ignore most whitespace that isn't escaped in the regex.
For example, let's just focus on apples to get. You could match it with:
$line =~ /apples to get/
But if you try:
$line =~ /apples to get/x
then Perl will ignore the spaces. So it would be like trying to match applestoget.
You can read more about it in perlre. They have this nice example of how you can use the modifier to make the code more readable.
# Delete (most) C comments.
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
They also mention how to match whitespace or # again while using the /x modifier.
Use of /x means that if you want real whitespace or # characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes.
Part of allowing comments is also ignoring literal white space. Use \s or [ ] for spaces you wish to match.
For example
$line =~ /^ #beginning of string
JOB:[ ]fruit[ ] #some literal text
(\d+) #capture digits to $1
[ ]apples[ ]to[ ]get #more literal text
/x
Notice all those spaces before the beginning of the comments. It would stink if they counted....

How can I highlight consecutive duplicate words with a Perl regular expression?

I want a Perl regular expression that will match duplicated words in a string.
Given the following input:
$str = "Thus joyful Troy Troy maintained the the watch of night..."
I would like the following output:
Thus joyful [Troy Troy] maintained [the the] watch of night...
This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:
$str = 'This is Goethe the the the their sentence';
$str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;
The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.
If you like the /x flag to add insignificant whitespace and comments:
$str =~ s/
\b
(
(\w+)
(?:
\s+
\2
\b
)+
)
/[\1]/xg;
I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:
use 5.010;
$str =~ s/
\b
(
(\w+)
(?:
\s+
\g{-1}
\b
)+
)
/[\1]/xg;
Counting isn't all that great either, so I can use labeled matches:
use 5.010;
$str =~ s/
\b
(
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[\1]/xg;
I can label the first capture ($1) and access its value in %+ later:
use 5.010;
$str =~ s/
\b
(?<dups>
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[$+{dups}]/xg;
I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:
$str =~ s/
\b
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
/[${^MATCH}]/pgx; # DOESN'T WORK
I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.
This works:
$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;
You can try:
$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...
Regex used: \b(\w+)\s+\1\b
Explanation:
\b: word bondary
\w+: a word
(): to remember the above word
\s+: whitespace
\1: the remembered word
It effectively finds two full words separated by whitespace and places [ ] around them.
EDIT:
If you want to preserve the amount of whitespace between the words you can use:
$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;
Try the following:
$str =~ s/\b(\S+)\b(\s+\1\b)+/[\1]/g;