SED: Number of returned lines - regex

To a file jungle.txt with following text ...
A lion sleeps in the jungle
A lion sleeps tonight
A tiger awakens in the swamp
The parrot observes
Wimoweh, wimoweh, wimoweh, wimoweh
... one could perform GREP search ...
$ grep lion jungle.txt
... or SED search ...
$ sed "/lion/p" jungle.txt
... to find occurences of a pattern ("lion" in this case).
Is there some easy way to get a number of returned lines? Or at least to know that there was more than 1 found? As always, I've googled a lot first, but surprisingly found no answer.
Thanks!

grep can count matching lines:
grep -c 'lion' file
Output:
2
Syntax:
-c: Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines. (-c is specified by POSIX.)

This might work for you (GNU sed):
sed '/lion/!d' file | sed '$=;d'
or if you prefer:
sed -n '/lion/p' file | sed -n '$='
N.B. if the file is empty or the first sed command finds nothing the result of the second sed command is blank.

You can use awk
awk '/lion/ {a++} END {print a+0}'
2
But I would say that the best solution is the one posted by Cyros using grep -c 'lion' file

Just pass the grep command output to wc- l command to count the number of returned lines,
$ grep 'lion' file | wc -l
2
From wc --help
-l, --lines print the newline counts

Related

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'
It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.
For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

Seletively extract number from file name

I have a list of files in the format as: AA13_11BB, CC290_23DD, EE92_34RR. I need to extract only the numbers after the _ character, not the ones before. For those three file names, I would like to get 11, 23, 34 as output and after each extraction, store the number into a variable.
I'm very new to bash and regex. Currently, from AA13_11BB, I am able to either obtain 13_11:
for imgs in $DIR; do
LEVEL=$(echo $imgs | egrep -o [_0-9]+);
done
or two separate numbers 13 and 11:
LEVEL=$(echo $imgs | egrep -o [0-9]+)
May I please have some advice how to obtain my desired output? Thank you!
Use egrep with sed:
LEVEL=$(echo $imgs | egrep -o '_[0-9]+' | sed 's/_//' )
To complement the existing helpful answers, using the core of hjpotter92's answer:
The following processes all filenames in $DIR in a single command and reads all extracted tokens into array:
IFS=$'\n' read -d '' -ra levels < \
<(printf '%s\n' "$DIR"/* | egrep -o '_[0-9]+' | sed 's/_//')
IFS=$'\n' read -d '' -ra levels splits the input into lines and stores them as elements of array ${levels[#]}.
<(...) is a process substitution that allows the output from a command to act as an (ephemeral) input file.
printf '%s\n' "$DIR"/* uses pathname expansion to output each filename on its own line.
egrep -o '_[0-9]+' | sed 's/_//' is the same as in hjpotter92's answer - it works equally on multiple input lines, as is the case here.
To process the extracted tokens later, use:
for level in "${levels[#]}"; do
echo "$level" # work with $level
done
You can do it in one sed using the regex .*_([0-9]+).* (escape it properly for sed):
sed "s/.*_\([0-9]\+\).*/\1/" <<< "AA13_11BB"
It replaces the line with the first captured group (the sub-regex inside the ()), outputting:
11
In your script:
LEVEL=$(sed "s/.*_\([0-9]\+\).*/\1/" <<< $imgs)
Update: as suggested by #mklement0, in both BSD sed and GNU sed you can shorten the command using the -E parameter:
LEVEL=$(sed -E "s/.*_([0-9]+).*/\1/" <<< $imgs)
Using grep with -P flag
for imgs in $DIR
do
LEVEL=$(echo $imgs | grep -Po '(?<=_)[0-9]{2}')
echo $LEVEL
done

bash search for multiple patterns on different lines in a file

I have a number of files and I want to filter out the ones that contain 2 patterns. However these patterns are on different lines. I've tried it using grep and awk but in both cases they only seem to work on matches patterns on the same line. I know grep is line based but I'm less familiar with awk. Here's what I came up with but it only works prints lines that match both strings:
awk '/string1/ && /string2/' file
Grep will easily handle this using xargs:
grep -l string1 * | xargs grep -l string2
Use this command in the directory where the files are located, and resulting matches will be displayed.
Depending om whether you really want to search for regexps:
gawk -v RS='^$' '/regexp1/ && /regexp2/ {print FILENAME}' file
or for strings:
gawk -v RS='^$' 'index($0,"string1") && index($0,"string2") {print FILENAME}' file
The above uses GNU awk for multi-char RS to read the whole file as a single record.
You can do it with find
find -type f -exec bash -c "grep -q string1 {} && grep -q string2 {} && echo {}" ";"
You could do it like this with GNU awk:
awk '/foo/{seenFoo++} /bar/{seenBar++} seenFoo&&seenBar{print FILENAME;seenFoo=seenBar=0;nextfile}' file*
That says... if you see foo, increment variable seenFoo, likewise if you see bar, increment variable seenBar. If, at any point, you have seen both foo and bar, print the name of the current file and skip to the next input file ignoring all remaining lines in current file, and, before you start the next file, clear the flags to say we have seen neither foo nor bar in the new file.

Grabbing a substring from text with bash

I am tring to extract a substring from some text and I am struggling to find the correct sed or regex that will do it for me.
My input text could be one of the following
feature/XXX-9999-SomeOtherText
develop
feature/XXX-99999-SomeMoreText
bugfix/XXX-9999
feature/XXXX-9999
XXX-9999
and I want to pull out just the XXX-9999, but there can be any number of Xs and 9s. where there are no Xs or 9s (as per the second example) I would like to return an empty value.
I have tried several ways using sed and the closest I got was
echo "feature/XXX-9999-SomeOtherText" | sed 's/.*\([[:alpha:]]\{3\}-[[:digit:]]\{4\}\).*/\1/'
which works if there are 3 Xs and 4 9s but anything else gives the full input string.
You can use grep and its -o option:
grep -o 'X\+-9\+'
If you want non-matching lines to result in empty lines you can add || echo ''.
You can use this sed,
sed 's#\(^\|.*/\)\([a-Z0-9]\+-[0-9]\+\).*#\2#g; /[a-zA-Z0-9]\+-[0-9]\+/!s#.*##g' yourfile
echo "feature/XXX-9999-SomeOtherText\nnoX nor 9" | sed 's/.*\([[:alpha:]]\{1,\}-[[:digit:]]\{1,\}\).*/\1/
t
s/.*//'
you use a count that is fixed in your test {3} so any number of X equal or greater succeed but not less. Change it to a minimum {1,} (equivalent to the + for GNU sed).
I also add the non container to empty line (not removing the line), if not needed, remove fom t until last /
Run on your posted sample input file:
$ sed -r -n 's/[^X]*(X+-9+).*/\1/p' file
XXX-9999
XXX-99999
XXX-9999
XXXX-9999
XXX-9999
$ sed -r -n 's/[^X]*(X+-9+)?.*/\1/p' file
XXX-9999
XXX-99999
XXX-9999
XXXX-9999
XXX-9999
The above IMHO shows a couple of the most likely interpretations of where there are no Xs or 9s (as per the second example) I would like to return an empty value.
If your sed doesn't support -r then this would work with any sed:
sed -n 's/[^X]*\(XX*-99*\).*/\1/p' file
sed -n 's/[^X]*\(XX*-99*\)*.*/\1/p' file

find the first match of a regex in a file, and print it

I have a collection of words on one side, and a file on the other side. I need their intersection. i.e. the words that do appear at least once in the file.
I am able to extract the matching lines with
sed -rn 's/(word1|word2|blablabla|wordn)/\1/p' myfile.txt
but I cannot go forward.
Thank-you for helping, Olivier
Perhaps' grep may work here?
grep -o -E 'word1|word2|word3' file.txt | sort -u
You can do it using grep and sort:
grep -o 'word1\|word2\|word3' myfile.txt | sort -u
The -o switch makes grep only output the matching string not the complete line. sort -u sorts the matching words and removes duplicates.
If I got you, you just need to pipe sed results to uniq:
sed -rn 's/.*(word1|word2|blablabla|wordn).*/\1/p' myfile.txt | uniq
Also you need to match the whole line in sed in order to get just the desired words as output. That's why I've placed .* in front and at the end of the pattern.