Perl regex for a character NOT within string characters - regex

I am writing a perl script that 'compiles' shell code. One thing I need to do is detect ; characters and deal with them (things like multiple commands on one line), but only when they are not escaped (by \ ), or within a string. For example, we shouldn't match 'some ; text ;' , but we should match the semicolons in between the two echo statements in echo ";ignore; inside ;" ; echo 'something;' \; 'else';
In the above example, exactly TWO semicolons should have been matched.
I have tried this with a regex loop
while ($_ =~ /('[^']+')*?("[^"]+")*?(?<!\\)(?<match>;)/g)
{
print "semiolon: $+{match}\n";
# process the match . . .
}
Whilst this works for some examples, there are some cases where it doesn't properly detect the semicolon is 'inside' two strings; as it can't match a PAIR of them before the current match. How would I go about ensuring that we only match semicolons outside a string?
Thanks in advance.

I agree with the other commenters that there are much better ways to develop a parser like this.
Nevertheless, I want to suggest two proposals:
while(/\G((?:[^;'"\\]++|'[^']*+'|"[^"]*+"|\\.)*;)/gx){
print " command: $1\n";
# process the match . . .
}
\G is a zero-width assertion that matches the position where the previous m//g left off, see perlop#\G-assertion. (In the docs there is also an example of a lex-like scanner that might be of interest.)
The non-capturing group contains harmless chars, quoted strings, and escaped characters
Note the use of possessive quantifiers in order to avoid performance issues due to backtracing.
i removed the negative assertion (?<!\\), because this would fail in cases such as echo \\;
This code will work with your given examples. However, e.g. bash allows escaping double-quotes inside of double-quoted string such as echo "\"".
If your shell should accept such a code, too, then the regexp has to be expanded:
while(/\G( # anchor for beginning
(?:[^;'"\\]++ # harmless chars
|'[^']*+' # or single-quoted string
|"(?: # or double-quoted string,
[^"\\]++ # containing harmless chars
|\\. # or an escaped char
)*+" # with arbitrary many repetitions
|\\. # or an escaped char
)*+ # with arbitrary many repetitions
;) # end with semi-colon
/gx){
print " command: $1\n";
# process the match . . .
}
Such pure regexp solutions are very error-prone. And the more exceptions you find that have to be treated, the more complicated the pattern get and the more difficult it gets to debug that code.
some tests:
use strict;
use warnings;
use Test::More tests => 16;
my $samples = [
{"'some ; text ;'" => []},
{'echo;' => ['echo;']},
{'echo ";ignore; inside ;" ; echo \'something;\' \; \'else\';' => [
'echo ";ignore; inside ;" ;', ' echo \'something;\' \; \'else\';']},
{'echo moep; echo moep;' => [ 'echo moep;', ' echo moep;']},
{'echo \a ; echo moep;' => [ 'echo \a ;', ' echo moep;']},
{'echo \\a ; echo moep;' => [ 'echo \\a ;', ' echo moep;']},
{'echo \\\a ; echo moep;' => [ 'echo \\\a ;', ' echo moep;']},
{'echo \; echo moep;' => [ 'echo \; echo moep;']},
{'echo \\; echo moep;' => [ 'echo \; echo moep;']}, # '\\;' eq '\;' !
{'echo \\\; echo moep;' => [ 'echo \\\;', ' echo moep;']},
{'echo ";\';\';"; echo moep;' => [ 'echo ";\';\';";', ' echo moep;']},
{'echo "\";"; echo moep;' => [ 'echo "\";";', ' echo moep;']},
{'echo ";\""; echo moep;' => [ 'echo ";\"";', ' echo moep;']},
{'echo "\";\""; echo moep;' => [ 'echo "\";\"";', ' echo moep;']},
{'echo ";\\\\"; echo moep;' => [ 'echo ";\\\\";', ' echo moep;']},
{'echo "\\\\\";\""; echo moep;' => [ 'echo "\\\\\";\"";', ' echo moep;']},
];
for my $sample(#$samples){
while(my ($line, $test) = each %$sample){
my #result = $line =~ /\G((?:[^;'"\\]++|'[^']*+'|"(?:[^"\\]++|\\.)*+"|\\.)*+;)/g;
is_deeply(\#result, $test, $line);
}
}
Still, you can easily find many false positive/negative samples. For example I did not cope with parentheses. This would make the above pattern much more complicated by using recursive subpatterns.

Related

How to store each occurrence of multiline string in array using bash regex

Given a text file test.txt with contents:
hello
someline1
someline2
...
world1
line that shouldn't match
hello
someline1
someline2
...
world2
How can I store both of these multiline matches in separate array indexes?
I'm currently trying to use regex="hello.*world[12]"
Unfortunately I can only use native Bash, so Perl etc is off the table. Thanks
As the regex of bash does not have such functionality as findall() function of python, we need to capture the matched substring one by one in the loop.
Would you please try the following:
#!/bin/bash
str=$(<test.txt)
regex="hello.world[12]"
while [[ $str =~ ($regex)(.*) ]]; do
ary+=( "${BASH_REMATCH[1]}" ) # store the match into an array
str="${BASH_REMATCH[2]}" # remaining substring
done
for i in "${!ary[#]}"; do # see the result
echo "[$i] ${ary[$i]}"
done
Output:
[0] hello
world1
[1] hello
world2
[Edit]
If there exist some lines between "hello" and "world", we need to change the approach as the regex of bash does not support the shortest match. Then how about:
regex1="hello"
regex2="world"
while IFS= read -r line; do
if [[ $line =~ $regex1 ]]; then
str="$line"$'\n'
f=1
elif (( f )); then
str+="$line"$'\n'
if [[ $line =~ $regex2 ]]; then
ary+=("$str")
f=0
fi
fi
done < test.txt
I would use awk and mapfile (bash version >= 4.3)
#!/bin/bash
mapfile -d '' arr < <(
awk '/hello/{f=1} f; /world[12]/ && f {f=0; printf "\000"}' test.txt
)
arr=([0]=$'hello\nsomeline1\nsomeline2\n...\nworld1\n' [1]=$'hello\nsomeline1\nsomeline2\n...\nworld2\n')
notes:
awk '/hello/{f=1} f; /world[12]/ && f{f=0; printf "\000"}'
. when encountering hello, set the flag to true
. for each line, print it if the flag is true
. when encountering world[12] and the flag is true, set the flag to false and print a null-byte delimiter
mapfile -d '' arr
split the input into an array in which each element was delimited by a null-byte (instead of \n)
version for older bash:
#!/bin/bash
arr=()
while IFS='' read -r -d '' block
do
arr+=( "$block" )
done < <(
awk '/hello/{f=1} f; /world[12]/ && f{f=0; printf "\000"}' test.txt
)

Bash regex to match quoted string

I’m trying to come up with a regular expression I can use to match strings surrounded by either single or double quotation marks. The regex should match all of the following strings:
"ABC&VAR#"
'XYZ'
"ABC.123"
'XYZ&VAR#123'
Here is what I have so far:
^([\x22\x27]?)[\w.&#]+\1$
\x22 represents the " character, and \x27 is the ' character.
This works in RegExr, but not in Bash comparisons using the =~ operator. What am I overlooking?
Update: The problem was that my regex uses two features of PCRE syntax that Bash does not support: the \w atom, and backreferences. Thanks to Inian for reminding me of this. I decided to use grep -oP instead of Bash’s built-in =~ operator, so that I can take advantage of PCRE niceties. See my comment below.
BASH regex doesn't support back-reference. In BASH you can do this.
arr=('"ABC&VAR#"' "'XYZ'" '"ABC.123"' "'XYZ&VAR#123'" "'foobar\"")
re="([\"']).*(['\"])"
for s in "${arr[#]}"; do
[[ $s =~ $re && ${BASH_REMATCH[1]} = ${BASH_REMATCH[2]} ]] && echo "matched $s"
done
Additional check ${BASH_REMATCH[1]} = ${BASH_REMATCH[2]} is being done to make sure we have same opening and closing quote.
Output:
matched "ABC&VAR#"
matched 'XYZ'
matched "ABC.123"
matched 'XYZ&VAR#123'
You can use regexp (\"|\').*(\"|\') for egrep.
Here is my example of how does it work:
a="\"ABC&VAR#\""
b="'XYZ'"
c="\"ABC.123\""
d="'XYZ&VAR#123'"
echo "Line correct: ${a} and ${b} and ${c} and ${d}"
if [ `echo "${a}" | egrep "(\"|\').*(\"|\')"` -o `echo "${b}" | egrep "(\"|\').*(\"|\')"` -o `echo "${c}" | egrep "(\"|\').*(\"|\')"` -o `echo "${d}" | egrep "(\"|\').*(\"|\')"` ]
then
echo "Found"
else
echo "Not Found"
fi
Output:
Line correct: "ABC&VAR#" and 'XYZ' and "ABC.123" and 'XYZ&VAR#123'
Found
To avoid so long if expression, use array for example for your variables.
In this case you will have something like that:
a="\"ABC&VAR#\""
b="'XYZ'"
c="\"ABC.123\""
d="'XYZ&VAR#123'"
arr=( "\"ABC&VAR#\"" "'XYZ'" "\"ABC.123\"" "'XYZ&VAR#123'" )
for line in "${arr[#]}"
do
[ `echo "${line}" | egrep "(\"|\').*(\"|\')"` ] && echo "Found match" || echo "Matches not found"
done

Trying to write a regex in bash

I am new to regex and I am trying to write a regex in a bash script .
I am trying to match line with a regex which has to return the second word in the line .
regex = "commit\s+(.*)"
line = "commit 5456eee"
if [$line =~ $regex]
then
echo $2
else
echo "No match"
fi
When I run this I get the following error:-
man.sh: line 1: regex: command not found
man.sh: line 2: line: command not found
I am new to bash scripting .
Can anyone please help me fix this .
I just want to write a regex to capture the word that follows commit
You don't want a regex, you want parameter expansion/substring extraction:
line="commit 5456eee"
first="${line% *}"
regex="${line#* }"
if [[ $line =~ $regex ]]
then
echo $2
else
echo "No match"
fi
$first == 'commit', $regex == '5456eee'. Bash provides all the tools you need.
If you really only need the second word you could also do it with awk
line = "commit 5456eee"
echo $line | awk '{ print $2 }'
or if you have a file:
cat filename | awk '{ print $2 }'
Even if it's no bash only solution, awk should be present on most linux os's.
You should remove the spaces around the equals sign, otherwise bash thinks you want to execute the regex command using = and "commit\s+(.*)" as arguments.
Then you should remove the spaces also in the if condition and quote the strings:
$ regex="commit\s+(.*)"
$ line="commit 5456eee"
$ if [ "$line"=~"$regex" ]
> then
> echo "Match"
> else
> echo "No match"
> fi
Match
maybe you didn't start your script with the
#!/bin/sh
or
#!/bin/bash
to define the language you're using... ?
It must be your first line.
then be careful, spaces are consistant in bash. In your "if" statement, it should be :
if [ $line =~ $regex ]
check this out and tell us more about the errors you get
if you make this script to a file like test.sh
and execute like that :
test.sh commit aaa bbb ccc
$0 $1 $2 $3 $4
you can get the arguments eassily by $0 $1...
A simple way to get the resulting capture group that was matched (if there is one) is to use BASH_REMATCH, which puts the match results into it's own array:
regex=$"commit (.*)"
line=$"commit 5456eee"
if [[ $line =~ $regex ]]
then
match=${BASH_REMATCH[1]}
echo $match
else
echo "No match"
fi
Since you have only one capture group it will be defined within the array as BASH_REMATCH[1]. In the above example I've assigned the variable $match to the result of BASH_REMATCH[1] which returns:
5456eee

Bash regex match spanning multiple lines

I'm trying to create a bash script that validates files. One of the requirements is that there has to be exactly one "2" in the file.
Here's my code at the moment:
regex1="[0-9b]*2[0-9b]*2[0-9b]*"
# This regex will match if there are at least two 2's in the file
if [[ ( $(cat "$file") =~ $regex1 ) ]]; then
# stuff to do when there's more than 1 "2"
fi
#...
regex2="^[013456789b]*$"
# This regex will match if there are at least no 2's in the file
if [[ ( $(cat "$file") =~ $regex2 ) ]]; then
# stuff to do when there are no 2's
fi
What I'm trying to do is match the following pieces:
654654654654
254654845845
845462888888
(because there are 2 2's in there, it should be matched)
987886546548
546546546848
654684546548
(because there are no 2's in there, it should be matched)
Any idea how I make it search all lines with the =~ operator?
I'm trying to create a bash script that validates files. One of the
requirements is that there has to be exactly one "2" in the file.
Try using grep
#!/bin/bash
file='input.txt'
n=$(grep -o '2' "$file" | wc -l)
# echo $n
if [[ $n -eq 1 ]]; then
echo 'Valid'
else
echo 'Invalid'
fi
How about this:
twocount=$(tr -dc '2' input.txt | wc -c)
if (( twocount != 1 ))
then
# there was either no 2, or more than one 2
else
# exactly one 2
fi
Using anchors as you've been, match a string of non-2s, a 2, and another string of non-2s.
^[^2]*2[^2]*$
Multiline regex match is indeed possible using awk with null record separator.
Consider below code:
awk '$0 ~ /^.*2.*2/ || $0 ~ /^[013456789]*$/' RS= file
654654654654
254654845845
845462888888
Take note of RS= which makes awk join multiple lines into single line $0 until it hits a double newline.

How to check if string contains characters in regex pattern in shell?

How do I check if a variable contains characters (regex) other than 0-9a-z and - in pure bash?
I need a conditional check. If the string contains characters other than the accepted characters above simply exit 1.
One way of doing it is using the grep command, like this:
grep -qv "[^0-9a-z-]" <<< $STRING
Then you ask for the grep returned value with the following:
if [ ! $? -eq 0 ]; then
echo "Wrong string"
exit 1
fi
As #mpapis pointed out, you can simplify the above expression it to:
grep -qv "[^0-9a-z-]" <<< $STRING || exit 1
Also you can use the bash =~ operator, like this:
if [[ ! "$STRING" =~ [^0-9a-z-] ]] ; then
echo "Valid";
else
echo "Not valid";
fi
case has support for matching:
case "$string" in
(+(-[[:alnum:]-])) true ;;
(*) exit 1 ;;
esac
the format is not pure regexp, but it works faster then separate process with grep - which is important if you would have multiple checks.
Using Bash's substitution engine to test if $foo contains $bar
bar='[^0-9a-z-]'
if [ -n "$foo" -a -z "${foo/*$bar*}" ] ; then
echo exit 1
fi