Grep pattern matching omitting parts in output - regex

Is it possible using just one grep and regexp combination to achieve the following. Say I have a file like so:
$ cat f.txt
line 1 foo
line 2 boo
no match
line 3 blank
line X no match
I want to match all the lines that start with the word line and followed by a number but only display the what come after that, so the part that is matched by (.*).
$ grep -E '^line [0-9]+(.*)' f.txt
line 1 foo
line 2 boo
line 3 blank
Can you say match but don't display this part ^line [0-9]+ like doing the inverse of grep -o '^line [0-9]+'
So my expected output would look like this
$ grep -E ***__magic__*** f.txt
foo
boo
blank

You can use sed
~$ cat 1.txt
line 1 foo
line 2 boo
no match
line 3 blank
line X no match
$ grep -E '^line [0-9]' 1.txt | sed 's/^line [0-9] //'
foo
boo
blank
UPDATED
...or without using sed
$ grep -E '^line [0-9]' 1.txt | grep -oE '[a-z]*$'
foo
boo
blank

Given your example file:
$ cat cat_1.txt
line 1 foo
line 2 boo
no match
line 3 blank
line X no match
This is easy with Perl:
perl -lne 'print $1 if /^line \d+ (.*)/' cat_1.txt
Or with sed:
sed -En 's/^line [0-9]+ (.*)/\1/p' cat_1.txt
Either case, prints:
foo
boo
blank

Related

Grep match a word but not another word

I want to match a line with foo not followed by bar, e.g.
foo 123 <-- match
foo bar <-- not match
Using the following regex does not work:
echo "foo 123" | grep -E 'foo.*(?!bar).*'
Any idea?
On systems that don't have grep -P like OSX you can use this awk command:
awk -F 'foo' 'NF>1{s=$0; $1=""; if (!index($0,"bar")) print s}' file
Script Demo
You could try the below grep command which uses -P(perl-regexp) parameter,
grep -P 'foo(?:(?!bar).)*$' file
Example:
$ cat file
foo 123
foo bar
$ grep -P 'foo(?:(?!bar).)*$' file
foo 123
Or
use only negative lookahead to check whether the string bar is after to foo without matching any following character.
$ grep -P 'foo(?!.*bar)' file
foo 123
You can use -v to invert the match:
grep -v 'foo.*bar' file

How to match and keep the first number in a line using sed?

Question
Let's say I have one line of text with a number placed somewhere (it could be at the beginning, in the middle or at the end of the line).
How to match and keep the first number found in a line using sed?
Minimal example
Here is my attempt (following this page of a tutorial on regular expressions) and the output for different positions of the number:
$echo "SomeText 123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "SomeText 123" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
As you can only the last digit is kept in the process whereas the desired output should be 123...
Using sed:
echo "SomeText 123SomeText 456" | sed -r 's/^[^0-9]*([0-9]+).*$/\1/'
123
You can also do this in gnu awk:
echo "SomeText 123SomeText 456" | awk '{print gensub(/^[^0-9]*([0-9]+).*$/, "\\1", $0)}'
123
To complement the sed solutions, here's an awk alternative (assuming that the goal is to extract the 1st number on each line, if any (i.e., ignore lines without any numbers)):
awk -F'[^0-9]*' '/[0-9]/ { print ($1 != "" ? $1 : $2) }'
-F'[^0-9]*' defines any sequence of non-digit chars. (including the empty string) as the field separator; awk automatically breaks each input line into fields based on that separator, with $1 representing the first field, $2 the second, and so on.
/[0-9]/ is a pattern (condition) that ensures that output is only produced for lines that contain at least one digit, via its associated action (the {...} block) - in other words: lines containing NO number at all are ignored.
{ print ($1!="" ? $1 : $2) } prints the 1st field, if nonempty, otherwise the 2nd one; rationale: if the line starts with a number, the 1st field will contain the 1st number on the line (because the line starts with a field rather than a separator; otherwise, it is the 2nd field that contains the 1st number (because the line starts with a separator).
You can also use grep, which is ideally suited to this task. sed is a Stream EDitor, which is only going to indirectly give you what you want. With grep, you only have to specify the part of the line you want.
$ cat file.txt
SomeText 123SomeText
123SomeText
SomeText 123
$ grep -o '[0-9]\+' file.txt
123
123
123
grep -o prints only the matching parts of a line, each on a separate line. The pattern is simple: one or more digits.
If your version of grep is compatible with the -P switch, you can use Perl-style regular expressions and make the command even shorter:
$ grep -Po '\d+' file.txt
123
123
123
Again, this matches one or more digits.
Using grep is a lot simpler and has the advantage that if the line doesn't match, nothing is printed:
$ echo "no number" | grep -Po '\d+' # no output
$ echo "yes 123number" | grep -Po '\d+'
123
edit
As pointed out in the comments, one possible problem is that this won't only print the first matching number on the line. If the line contains more than one number, they will all be printed. As far as I'm aware, this can't be done using grep -o.
In that case, I'd go with perl:
perl -lne 'print $1 if /.*?(\d+).*/'
This uses lazy matching (the question mark) so only non-digit characters are consumed by the .* at the start of the pattern. The $1 is a back reference, like \1 in sed. If there are more than one number on the line, this only prints the first. If there aren't any at all, it doesn't print anything:
$ echo "no number" | perl -ne 'print "$1\n" if /.*?(\d+).*/'
$ echo "yes123number456" | perl -lne 'print $1 if /.*?(\d+).*/'
123
If for some reason you still really want to use sed, you can do this:
sed -n 's/^[^0-9]*\([0-9]\{1,\}\).*$/\1/p'
unlike the other answers, this is compatible with all version of sed and will only print lines that contain a match.
Try this sed command,
$echo "SomeText 123SomeText" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123
Another example,
$ echo "SomeText 123SomeText 456" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123 456
It prints all the numbers in a file and the captured numbers are separated by spaces while printing.

Bash Regex Matching digits

I am try to match only the digits from a text file that is in this format:
> 1234
I'm running thru the file with a loop. Every line store in $i
$i | grep "\d{4}"
Output looks like this:
>
1234
>
5678
Why is it still outputing the >? I want to remove those.
From man grep
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a
separate output line.
Try grep -o ....
Test:
i="> 1234"
$ echo "$i"
> 1234
$ echo "$i" | grep -oP "\d{4}"
1234

How can i display the second matched regex in sed

Suppose I have this text
The code for 233-CO is the main reason for 45-DFG and this 45-GH
Now I have this regexp \s[0-9]+-\w+ which matches 233-CO, 45-DFG and 45-GH.
How can I display just the third match 45-GH?
sed -re 's/\s[0-9]+-\w+/\3/g' file.txt
where \3 should be the third regexp match.
Is it mandatory to use sed? You could do it with grep, using arrays:
text="The code for 233-CO is the main reason for 45-DFG and this 45-GH"
matches=( $(echo "$text" | grep -o -m 3 '\s[0-9]\+-\w\+') ) # store first 3 matches in array
echo "${matches[0]} ${matches[2]}" # prompt first and third match
To find the last occurence of your pattern, you can use this:
$ sed -re 's/.*\s([0-9]+-\w+).*/\1/g' file
45-GH
if awk is accepted, there is an awk onliner, you give the No# of match you want to grab, it gives your the matched str.
awk -vn=$n '{l=$0;for(i=1;i<n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' file
test
kent$ echo $STR #so we have 7 matches in str
The code for 233-CO is the main reason for 45-DFG and this 45-GH,foo 004-AB, bar 005-CC baz 006-DDD and 007-AWK
kent$ n=6 #now I want the 6th match
#here you go:
kent$ awk -vn=$n '{l=$0;for(i=1;i<=n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' <<< $STR
006-DDD
This might work for you (GNU sed):
sed -r 's/\b[0-9]+-[A-Z]+\b/\n&\n/3;s/.*\n(.*)\n.*/\1/' file
s/\b[0-9]+-[A-Z]+\b/\n&\n/3 prepend and append \n (newlines) to the third (n) pattern in question.
s/.*\n(.*)\n.*/\1/ delete the text before and after the pattern
With grep for matching and sed for printing the occurrence:
$ egrep -o '\b[0-9]+-\w+' file | sed -n '1p'
233-CO
$ egrep -o '\b[0-9]+-\w+' file | sed -n '2p'
45-DFG
$ egrep -o '\b[0-9]+-\w+' file | sed -n '3p'
45-GH
Or with a little awk passing the occurrence to print using the variable o:
$ awk -v o=1 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
233-CO
$ awk -v o=2 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-DFG
$ awk -v o=3 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-GH

Using sed to delete all lines between two matching patterns

I have a file something like:
# ID 1
blah blah
blah blah
$ description 1
blah blah
# ID 2
blah
$ description 2
blah blah
blah blah
How can I use a sed command to delete all lines between the # and $ line? So the result will become:
# ID 1
$ description 1
blah blah
# ID 2
$ description 2
blah blah
blah blah
Can you please kindly give an explanation as well?
Use this sed command to achieve that:
sed '/^#/,/^\$/{/^#/!{/^\$/!d}}' file.txt
Mac users (to prevent extra characters at the end of d command error) need to add semicolons before the closing brackets
sed '/^#/,/^\$/{/^#/!{/^\$/!d;};}' file.txt
OUTPUT
# ID 1
$ description 1
blah blah
# ID 2
$ description 2
blah blah
blah blah
Explanation:
/^#/,/^\$/ will match all the text between lines starting with # to lines starting with $. ^ is used for start of line character. $ is a special character so needs to be escaped.
/^#/! means do following if start of line is not #
/^$/! means do following if start of line is not $
d means delete
So overall it is first matching all the lines from ^# to ^\$ then from those matched lines finding lines that don't match ^# and don't match ^\$ and deleting them using d.
$ cat test
1
start
2
end
3
$ sed -n '1,/start/p;/end/,$p' test
1
start
end
3
$ sed '/start/,/end/d' test
1
3
In general form, if you have a file with contents of form abcde, where section a precedes pattern b, then section c precedes pattern d, then section e follows, and you apply the following sed commands, you get the following results.
In this demonstration, the output is represented by => abcde, where the letters show which sections would be in the output. Thus, ae shows an output of only sections a and e, ace would be sections a, c, and e, etc.
Note that if b or d appear in the output, those are the patterns appearing (i.e., they're treated as if they're sections in the output).
Also don't confuse the /d/ pattern with the command d. The command is always at the end in these demonstrations. The pattern is always between the //.
sed -n -e '/b/,/d/!p' abcde => ae
sed -n -e '/b/,/d/p' abcde => bcd
sed -n -e '/b/,/d/{//!p}' abcde => c
sed -n -e '/b/,/d/{//p}' abcde => bd
sed -e '/b/,/d/!d' abcde => bcd
sed -e '/b/,/d/d' abcde => ae
sed -e '/b/,/d/{//!d}' abcde => abde
sed -e '/b/,/d/{//d}' abcde => ace
Another approach with sed:
sed '/^#/,/^\$/{//!d;};' file
/^#/,/^\$/: from line starting with # up to next line starting with $
//!d: delete all lines except those matching the address patterns
I did something like this long time ago and it was something like:
sed -n -e "1,/# ID 1/ p" -e "/\$ description 1/,$ p"
Which is something like:
-n suppress all output
-e "1,/# ID 1/ p" execute from the first line until your pattern and p (print)
-e "/\$ description 1/,$ p" execute from the second pattern until the end and p (print).
I might be wrong with some of the escaping on the strings, so please double check.
The example below removes lines between "if" and "end if".
All files are scanned, and lines between the two matching patterns are removed ( including them ).
IFS='
'
PATTERN_1="^if"
PATTERN_2="end if"
# Search for the 1st pattern in all files under the current directory.
GREP_RESULTS=(`grep -nRi "$PATTERN_1" .`)
# Go through each result
for line in "${GREP_RESULTS[#]}"; do
# Save the file and line number where the match was found.
FILE=${line%%:*}
START_LINE=`echo "$line" | cut -f2 -d:`
# Search on the same file for a match of the 2nd pattern. The search
# starts from the line where the 1st pattern was matched.
GREP_RESULT=(`tail -n +${START_LINE} $FILE | grep -in "$PATTERN_2" | head -n1`)
END_LINE="$(( $START_LINE + `echo "$GREP_RESULT" | cut -f1 -d:` - 1 ))"
# Remove lines between first and second match from file
sed -e "${START_LINE},${END_LINE}d;" $FILE > $FILE
done