GREP: variable in regular expression - regex

If I want to look whether a string is alphanumeric and shorter than a certain value, say 10, I would do like this (in BASH+GREP):
if grep '^[0-9a-zA-Z]\{1,10\}$' <<<$1 ; then ...
(BTW: I'm checking for $1, i.e. the first argument)
What if I want the value 10 to be written on a variable, e.g.
UUID_LEN=10
if grep '^[0-9a-zA-Z]\{1,$UUID_LEN\}$' <<<$1 ; then ...
I tried all sort of escapes, braces and so on, but could not avoid the error message
grep: Invalid content of \{\}
After googling and reading bash and grep tutorials I'm pretty convinced it can't be done. Am I wrong? Any way to go around this?

You need to use double quotes so that the shell expands the parameter before passing the resulting argument to grep:
if grep "^[0-9a-zA-Z]\{1,$UUID_LEN\}$" <<<$1 ; then ...
bash can perform regular expression matching itself, without having to start another process to run grep:
if [[ $1 =~ ^[0-9a-zA-Z]{1,$UUID_LEN}$ ]]; then

Related

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

Sed | Variable containing regex causes invalid reference error

I'm having problems with sed and the back-referencig when using variables containing regexes.
It is a parser written in bash. At a very earlier point, I want to use sed to clean every line into the needed data: the indentation, a key and a value (colon separated). The data is similar to yaml but using an equals.
A basic example of the data:
overview = peparing 2016-10-22
license= sorted 2015-11-01
The function I'm having problems with does the logic in a while loop:
function prepare_parsing () {
local file=$1
# regex components:
local s='[[:space:]]*' \
w='[a-zA-Z0-9_]*' \
fs=':'
# regexes(NoQuotes, SingleQuotes, DoubleQuotes):
local searchNQ='^('$s')('$w')'$s'='$s'(.*)'$s'$' \
searchSQ='^('$s')('$w')'$s'='$s\''(.*)'\'$s'\$' \
searchDQ='^('$s')('$w')'$s'='$s'"(.*)"'$s'\$' \
replace="\1$fs\2$fs\3"
while IFS="$fs" read -r indentation key value; do
...
SOME CUSTOM LOGIC
...
done < <(sed -n "s/${searchNQ}/${replace}/p" $file)
}
When trying to call the function, I receive the known invalid reference error into \3: invalid reference \3 on s' command's RHS
To debug this, after the vars definition, I've printed their values using the printf and the %q option.
printf "%q\n" $searchNQ $searchSQ $searchDQ $replace
Getting these values:
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\(.\*\)\[\[:space:\]\]\*\$
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\'\(.\*\)\'\[\[:space:\]\]\*\\\$
\^\(\[\[:space:\]\]\*\)\(\[a-zA-Z0-9_\]\*\)\[\[:space:\]\]\*=\[\[:space:\]\]\*\"\(.\*\)\"\[\[:space:\]\]\*\\\$
$'\\1\034\\2\034\\3'
And maybe here's the problem, the excessive escape sequences when the shell (bash) expand the variables (for example, it seems to be escaping the *, the [], ...).
If I pass the -r option to sed, it works perfectly, but I have to avoid this since the system that will execute the script won't have this sed implementation: I have to use basic sed.
Do you have any idea on how to store the regex into variables and make them usable for the backreferencing on the RHS?
It works in these two cases:
When using a plain regex string:
sed -n "s/^\([[:space:]]*\)\([a-zA-Z0-9_]*\)[[:space:]]*=[[:space:]]*\(.*\)[[:space:]]*\$/\1:\2:\3/p" $file
And when I use just the vars s, w and fs:
sed -n "s/^\($s\)\($w\)$s=$s\(.*\)$s\$/\1$fs\2$fs\3/p" $file
Many thanks for the help!
perl that supports extended RegExps may be used instead of sed, like
perl -n -e "s/${searchNQ}/${replace}/; print"

Bash string replacement with regex repetition

I have a file: filename_20130214_suffix.csv
I'd like replace the yyyymmdd part in bash. Here is what I intend to do:
file=`ls -t /path/filename_* | head -1`
file2=${file/20130214/20130215}
#this will not work
#file2=${file/[0-9]{8}/20130215/}
The problem is that parameter expansion does not use regular expressions, but patterns or globs(compare the difference between the regular expression "filename_..csv" and the glob "filename_.csv"). Globs cannot match a fixed number of a specific string.
However, you can enable extended patterns in bash, which should be close enough to what you want.
shopt -s extglob # Turn on extended pattern support
file2=${file/+([0-9])/20130215}
You can't match exactly 8 digts, but the +(...) lets you match one or more of the pattern inside the parentheses, which should be sufficient for your use case.
Since all you want to do in this case is replace everything between the _ characters, you could also simply use
file2=${file/_*_/_20130215_}
[[ $file =~ ^([^_]+_)[0-9]{8}(_.*) ]] && file2="${BASH_REMATCH[1]}20130215${BASH_REMATCH[2]}"

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill