Bash script grep for pattern in variable of text

Bash script grep for pattern in variable of text - regex

I have a variable which contains text; I can echo it to stdout so I think the variable is fine. My problem is trying to grep for a pattern in that variable of text. Here is what I am trying:
ERR_COUNT=`echo $VAR_WITH_TEXT | grep "ERROR total: (\d+)"`
When I echo $ERR_COUNT the variable appears to be empty, so I must be doing something wrong.
How to do this properly? Thanks.
EDIT - Just wanted to mention that testing that pattern on the example text I have in the variable does give me something (I tested with: http://rubular.com)
However the regex could still be wrong.
EDIT2 - Not getting any results yet, so here's the string I'm working with:
ALERT line125: Alert: Cannot locate any description for 'asdf' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../hgfd.controls) ALERT line126: Alert: Cannot locate any description for 'zxcv' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../dfhg.controls) ALERT line127: Alert: Cannot locate any description for 'rtyu' in the qwer.xml hierarchy. (due to (?i-xsm:\balert?\b) ALERT in ../kjgh.controls) [1] 22280 IGNORE total: 0 WARN total: 0 ALERT total: 3 ERROR total: 23 [1] + Done /tool/pandora/bin/gvim -u NONE -U NONE -nRN -c runtime! plugin/**/*.vim -bg ...
That's the string, so hopefully there should be no ambiguity anymore... I want to extract the number "23" (after "ERROR total: ") into a variable and I'm having a hard time haha.
Cheers

You can use bash's =~ operator to extract the value.
[[ $VAR_WITH_TEXT =~ ERROR\ total:\ ([0-9]+) ]]
Note that you have to escape the spaces, or only only quote
the fixed parts of the regular expression:
[[ $VAR_WITH_TEXT =~ "ERROR total: "([0-9]+) ]]
since quoting any of the metacharacters causes them to be treated
literally.
You can also save the regex in a variable:
regex="ERROR total: ([0-9]+)"
[[ $VAR_WITH_TEXT =~ $regex ]]
In any case, once the expression matches, the parenthesized expression
can be found in BASH_REMATCH array.
ERR_COUNT=${BASH_REMATCH[1]}
(The zeroth element contains the entire matched regular expression; the parenthesized subexpressions are found in the remaining elements in the order they appear in the full regex.)
If you want to use grep, you'll need a version that can accept Perl-style regexes.
ERR_COUNT=$( echo "$VAR_WITH_TEXT" | grep -Po "(?<=ERROR total: )\d+" )
As long as you need to use Perl-style regexes to enable the look-behind assertion, you can replace [0-9] with \d.

Your error is in the pattern: (\d+) matches:
'('
a digit
'+'
')'
According to your comment, what you want is \(\d\+\), which:
defines a sub-pattern by \( ... \)
Inside it matches at least one (\+) digit (\d).
In this case, if you don't need a sub-pattern, you can just drop the \( and \).
Note: if your grep doesn't understand \d, you can replace it by [0-9]. Easiest way is to write grep '\d' and test it by writing a couple test lines.

# setting example data
test="adfa\nfasetrfaqwe\ndsfa ERROR total: 32514235dsfaewrf"
one solution:
echo $(sed -n 's/^.*ERROR total: \([0-9]*\).*$/\1/p' < <(echo $test))
32514235
other solution:
# throw away everything up to "ERROR total: "
test=${test##*ERROR total: }
# cut from behind assuming number contains no spaces and is
# separated by space
test=${test%% *}
echo $test
32514235

The \d is probably only recognized as a digit in perl regex mode, you probably want to use grep -P.
If you only want the number you could try:
ERR_COUNT=$(echo $VAR_WITH_TEXT | perl -pe "s/.*ERROR total: (\d+).*/\1/g")
or:
ERR_COUNT=$(echo $VAR_WITH_TEXT | sed -n "s/.*ERROR total: ([0-9]+).*/\1/gp")

Related

extract substring with SED

I have the next strings:
for example:
input1 = abc-def-ghi-jkl
input2 = mno-pqr-stu-vwy
I want extract the first word between "-"
for the fisrt string I want to get: def
if the input is the second string, I want to get: pqr
I want to use the command SED, Could you help me please?

Use
sed 's,^[^-]*-\([^-]*\).*,\1,' file
The string after the first - will be captured up to the second - and the rest will be matched, then the matched line will be replaced with the group text.

With bash:
var='input1 = abc-def-ghi-jkl'
var=${var#*-} # remove shortest prefix `*-`, this removes `input1 = abc-`
echo "${var%%-*}" # remove longest suffix `-*`, this removes `-ghi-jkl`
Or with awk:
awk -F'-' '{print $2}' <<<'input1 = abc-def-ghi-jkl'
Use - as input field separator and print the second field.
Or with cut:
cut -d'-' -f2 <<<'input1 = abc-def-ghi-jkl'

When you want to use sed, you can choose between solutions like
# Double processing
echo "$input1" | sed 's/[^-]*-//;s/-.*//'
# Normal approach
echo "$input1" | sed -r 's/^[^-]*-([^-]*)|-.*)/\1/g'
# Funny alternative
echo "$input1" | sed -r 's/(^[^-]*-|-.*)//g'
The obvious "external" tool would be cut. You can also look at a Bash builtin solution like
[[ ${input1} =~ ([^-]*)-([^-]*) ]] && printf %s "${BASH_REMATCH[2]}"

grep solution (in my opinion this is the most natural approach, as you are only trying to find matches to a regular expression - you are not looking to edit anything, so there should be no need for the more advanced command sed)
grep -oP '^[^-]*-\K[^-]*(?=-)' << EOF
> abc-qrs-bobo-the-clown
> 123-45-6789
> blah-blah-blah
> no dashes here
> mahi-mahi
> EOF
Output
qrs
45
blah
Explanation
Look at the inputs first, included here for completeness as a heredoc (more likely you would name your file as the last argument to grep.) The solution requires at least two dashes to be present in the string; in particular, for mahi-mahi it will find no match. If you want to find the second mahi as a match, you can remove the lookahead assertion at the end of the regular expression (see below).
The regular expression does this. First note the command options: -o to return only the matched substring, not the entire line; and -P to use Perl extensions. Then, the regular expression: start from the beginning of the line (^); look for zero or more non-dash characters followed by dash, and then (\K) discard this part of the required match from the substrings found to match the pattern. Then look for zero or more non-dash characters again - this will be returned by the command. Finally, require a dash following this pattern, but do not include it in the match. This is done with a lookahead (marked by (?= ... )).

Regular Expression: Capture character pattern zero or one positions from start of string

I have a series of entries, which can be represented by this string:
my_string="-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz"
For each entry, I need to return whether it starts with 'R' or 'D'. In order to do this, I need to ignore any character that comes before it. So, I wrote this regular expression:
for i in $my_string; do echo $i | grep -E -o "^*?[RD]"; done
However, this is only returning R or D for entries which are not preceded by a character.
How do I get this regex to return the R or D value in every case, whether there is a character in front of it or not? Keep in mind that the only thing which can be 'hard-coded' into the expression is the pattern to be matched.

It will be easy if you use sed:
sed -r 's/^.?([RD]).*$/\1/'
i.e.
for i in $my_string; do echo $i | sed -r 's/^.?([RD]).*$/\1/'; done
Update:
Here is what each part of the command means:
-r : extended regular expression, although I think -e should work but
turns out that during my testing, in order to use capturing group
in regex, I need -r. Anyway, not the main point
The script can be read as:
s/XXXX/YYYY/ : substitude from XXXX to YYYY
The "from" pattern (XXXX) means:
^ : start with
.? : zero or one occurence of any character
( : start of group
[RD] : either R or D
) : end of group (which means, the group will contains either R or D
.* : any number of any character
$ : till the end
the "to" pattern (YYYY):
\1 : content of capture group 1 in the "from" pattern (which is the "R or D")

Use a parameter expansion to remove the prefix before using grep:
for i in $my_string; do echo ${i#[^RD]} | grep -o "^[RD]" ; done
or use a simple test without grep (since you already know that each item starts with a R or a D):
for i in $my_string; do
if [[ $i =~ ^[^D]?R ]] ; then
echo 'R'
else
echo 'D'
fi
done

This regex worked in my local tests. Please have a try:
^.?[RD]
I can't think of a way to ONLY return the letter you want. I'd have a command after to detect whether the returned string is greater than 1 character long, and if so, I'd return only the second character.

I'm not 100% sure of what you are asking ( i understood you want to match only R and D at the beginning of a filename, whatever the character before it, if there is one ), but I think you should use lookbehind, in php you would do
$re = "/(?<=^\S|\s\S|\s)[RD]/";
$str = "-D-K4_NNNN_M116_R1_001.gz _D-K4_NNNN_M56_R1_001.gz R-K4_NNNN_KQ9_R1_001.gz D-K4_NNNN_M987_R1_001.gz _R-K4_NNNN_M987_R1_001.gz";
preg_match_all($re, $str, $matches);
You can see the output here.
To use Perl syntax in bash you must enable it. https://unix.stackexchange.com/questions/84477/forcing-bash-to-use-perl-regex-engine
You can test your regexp here if you need https://regex101.com/r/vV3nS3/1

This does it when using the modifier 'g' for global: (^| ).?(R|D)
See the regex101 here

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!

Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"

I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done

Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.

This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.

After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.

echo "$var" | pcregrep -o "(?<=yet more )text"

Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

Grep regular expression for digits in character string of variable length

I need some way to find words that contain any combination of characters and digits but exactly 4 digits only, and at least one character.
EXAMPLE:
a1a1a1a1 // Match
1234 // NO match (no characters)
a1a1a1a1a1 // NO match
ab2b2 // NO match
cd12 // NO match
z9989 // Match
1ab26a9 // Match
1ab1c1 // NO match
12345 // NO match
24 // NO match
a2b2c2d2 // Match
ab11cd22dd33 // NO match

to match a digit in grep you can use [0-9]. To match anything but a digit, you can use [^0-9]. Since that can be any number of , or no chars, you add a "*" (any number of the preceding). So what you'll want is logically
(anything not a digit or nothing)* (any single digit) (anything not a digit or nothing)* ....
until you have 4 "any single digit" groups. i.e. [^0-9]*[0-9]...
I find with grep long patterns, especially with long strings of special chars that need to be escaped, it's best to build up slowly so you're sure you understand whats going on. For example,
#this will highlight your matches, and make it easier to understand
alias grep='grep --color=auto'
echo 'a1b2' | grep '[0-9]'
will show you how it's matching. You can then extend the pattern once you understand each part.

I'm not sure about all the other input you might take (i.e. is ax12ax12ax12ax12 valid?), but this will work based on what you posted:
%> grep -P "^(?:\w\d){4}$" fileWithInput

With grep:
grep -iE '^([a-z]*[0-9]){4}[a-z]*$' | grep -vE '^[0-9]{4}$'
Do it in one pattern with Perl:
perl -ne 'print if /^(?!\d{4}$)([^\W\d_]*\d){4}[^\W\d_]*$/'
The funky [^\W\d_] character class is a cosmopolitan way to spell [A-Za-z]: it catches all letters rather than only the English ones.

If you don't mind using a little shell as well, you could do something like this:
echo "a1a1a1a1" |grep -o '[0-9]'|wc -l
which would display the number of digits found in the string. If you like, you could then test for a given number of matches:
max_match=4
[ "$(echo "a1da4a3aaa4a4" | grep -o '[0-9]'|wc -l)" -le $max_match ] || echo "too many digits."

Assuming you only need ASCII, and you can only access the (fairly primitive) regexp constructs of grep, the following should be pretty close:
grep ^[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*$ | grep [a-zA-Z]

You might try
[^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*
But this will match 1234. why doesn't that match your criteria?

The regex for that is:
([A-Za-z]\d){4}
[A-Za-z] - for character class
\d - for number
you wrapp them in () to group them indicating the format character follow by number
{4} - indicating that it must be 4 repetitions

you can use normal shell script, no need complicated regex.
var=a1a1a1a1
alldigits=${var//[^0-9]/}
allletters=${var//[0-9]/}
case "${#alldigits}" in
4)
if [ "${#allletters}" -gt 0 ];then
echo "ok: 4 digits and letters: $var"
else
echo "Invalid: all numbers and exactly 4: $var"
fi
;;
*) echo "Invalid: $var";;
esac

thanks for your answers
finaly i wrote some script and it work perfect:
. /P ab2b2 cd12 z9989 1ab26a9 1ab1c1 1234 24 a2b2c2d2
#!/bin/bash
echo "$#" |tr -s " " "\n"s >> sorting
cat sorting | while read tostr
do
l=$(echo $tostr|tr -d "\n"|wc -c)
temp=$(echo $tostr|tr -d a-z|tr -d "\n" | wc -c)
if [ $temp -eq 4 ]; then
if [ $l -gt 4 ]; then
printf "%s " "$tostr"
fi
fi
done
echo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash script grep for pattern in variable of text - regex

The \d is probably only recognized as a digit in perl regex mode, you probably want to use grep -P. If you only want the number you could try: ERR_COUNT=$(echo $VAR_WITH_TEXT | perl -pe "s/.ERROR total: (\d+)./\1/g") or: ERR_COUNT=$(echo $VAR_WITH_TEXT | sed -n "s/.ERROR total: ([0-9]+)./\1/gp")

Related

extract substring with SED

Regular Expression: Capture character pattern zero or one positions from start of string

Shell script linux, validating integer

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Grep regular expression for digits in character string of variable length

Categories

Resources