How to use grep to extract multiple groups - regex

Say I have this file data.txt:
a=0,b=3,c=5
a=2,b=0,c=4
a=3,b=6,c=7
I want to use grep to extract 2 columns corresponding to the values of a and c:
0 5
2 4
3 7
I know how to extract each column separately:
grep -oP 'a=\K([0-9]+)' data.txt
0
2
3
And:
grep -oP 'c=\K([0-9]+)' data.txt
5
4
7
But I can't figure how to extract the two groups. I tried the following, which didn't work:
grep -oP 'a=\K([0-9]+),.+c=\K([0-9]+)' data.txt
5
4
7

I am also curious about grep being able to do so. \K "removes" the previous content that is stored, so you cannot use it twice in the same expression: it will just show the last group. Hence, it should be done differently.
In the meanwhile, I would use sed:
sed -r 's/^a=([0-9]+).*c=([0-9]+)$/\1 \2/' file
it catches the digits after a= and c=, whenever this happens on lines starting with a= and not containing anything else after c=digits.
For your input, it returns:
0 5
2 4
3 7

You could try the below grep command. But note that , grep would display each match in separate new line. So you won't get the format like you mentioned in the question.
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file
0
5
2
4
3
7
To get the mentioned format , you need to pass the output of grep to paste or any other commands .
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file | paste -d' ' - -
0 5
2 4
3 7

use this :
awk -F[=,] '{print $2" "$6}' data.txt
I am using the separators as = and ,, then spliting on them

Related

Grep everything before a specific character [duplicate]

This question already has answers here:
How can I print all the characters until a certain pattern (excluding the pattern itself) using grep/awk/sed?
(2 answers)
Closed 2 years ago.
I have a file, my_file.
The contents of the file look like this:
4: something
5: something
7: another thing
I want to print out the following:
4
5
7
Basically I want to get all the numbers before the character :
Here is what I tried:
grep -i "^[0-9]+(?=(:)" my_file
This returned nothing. How can I change this command to make it work?
This is a use-case for awk:
$ awk -F":" '{print $1}' < inputfile
because you're using : as a field delimiter.
Try this:
grep -Eo "^[0-9]+" my_file # you can use either E (extended) or P (pearl) regular expressions
-o is for only matching
We also need to specify that we are using regex.
Both of the following will work:
-E extended regular expressions
-P pearl regular expressions
Breakdown:
^ signifies the start
[0-9] match a digit
+ match 1 or more from [0-9]
Output:
4
5
7
Using grep
grep -oE '^[0-9]+:' my_file | tr -d ':'
using sed
sed 's#:.*$##g' my_file
Demo :
$cat test.txt
4: something
5: something
7: another thing
$sed 's#:.*$##g' test.txt
4
5
7
$grep -oE '^[0-9]+:' test.txt | tr -d ':'
4
5
7

Deleting n lines in both directions and the match in sed?

Deleting the match and two lines before it works:
sed -i.bak -e '/match/,-2d' someCommonName.txt
Deleting the match and two lines after it works:
sed -i.bak -e '/match/,+2d' someCommonName.txt
But deleting the match, two lines after it and two lines before it does not work?
sed -i.bak -e '/match/-2,+2d' someCommonName.txt
sed: -e expression #1 unknown command: `-'
Why is that?
sed operates on a range of addresses. That means either one or two expressions, not three.
/match/ is an address which matches a regex.
-2 is an address which specifies two lines before
+2 is an address which specifies two lines after
Therefore:
/match/,-2 is a range which specifies the line matching match to two lines before.
/match/-2,+2d, on the other hand, includes three addresses, and thus makes no sense.
To delete two lines before and after a pattern, I would recommend something like this (modified from this answer):
sed -n "1N;2N;/\npattern$/{N;N;d};P;N;D"
This keeps 3 lines in the buffer and reads through the file. When the pattern is found in the last line, it reads two more lines and deletes all 5. Note that this will not work if the pattern is in the first two lines of the file, but it is a start.
sed -i .bak '/match/,-2 {/match/!d;};/match/,+2d' YourFile
try this (cannot test here, -2 is not available in my sed version)
I don't have a complete solution but an outline: sed is a pretty simple tool which doesn't do two things at once. My approach would be to run sed once deleting the two lines after the pattern but keeping the pattern itself. The result can then be piped to sed again to remove the pattern and the two lines before.
FWIW this is how I'd really do the job (just change the b and a values to delete different numbers of lines before/after match is found):
$ cat file
1
2
3
4
5 match
6
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
2
8
9
$ awk -v b=3 -v a=1 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
7
8
9
Note that the above assumes that when 2 "match"s appear within a removal window you want to base the deletions on the original occurrence, not what would happen after the first match being found causes the 2nd match to be deleted:
$ cat file2
1
2
3
4 match
5
6 match
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file2 file2
1
9
as opposed to the output being:
1
7
8
9
since deleting the 2 lines after the first match would delete the 2nd match and so the 2 lines after THAT would not be deleted since they no longer are within 2 lines after a match.
Something else to consider:
$ diff --changed-group-format='%<' --unchanged-group-format='' file <(grep -A2 -B2 match file)
1
2
8
9
$ diff --changed-group-format='%<' --unchanged-group-format='' file2 <(grep -A2 -B2 match file2)
1
9
That uses bash and GNU diff 3.2, idk if/which other shells/diffs would support those constructs/options.

How to delete lines before a match perserving it?

I have the following script to remove all lines before a line which matches with a word:
str='
1
2
3
banana
4
5
6
banana
8
9
10
'
echo "$str" | awk -v pattern=banana '
print_it {print}
$0 ~ pattern {print_it = 1}
'
It returns:
4
5
6
banana
8
9
10
But I want to include the first match too. This is the desired output:
banana
4
5
6
banana
8
9
10
How could I do this? Do you have any better idea with another command?
I've also tried sed '0,/^banana$/d', but seems it only works with files, and I want to use it with a variable.
And how could I get all lines before a match using awk?
I mean. With banana in the regex this would be the output:
1
2
3
This awk should do:
echo "$str" | awk '/banana/ {f=1} f'
banana
4
5
6
banana
8
9
10
sed -n '/^banana$/,$p'
Should do what you want. -n instructs sed to print nothing by default, and the p command specifies that all addressed lines should be printed. This will work on a stream, and is different than the awk solution since this requires the entire line to match 'banana' exactly whereas your awk solution merely requires 'banana' to be in the string, but I'm copying your sed example. Not sure what you mean by "use it with a variable". If you mean that you want the string 'banana' to be in a variable, you can easily do sed -n "/$variable/,\$p" (note the double quotes and the escaped $) or sed -n "/^$variable\$/,\$p" or sed -n "/^$variable"'$/,$p'. You can also echo "$str" | sed -n '/banana/,$p' just like you do with awk.
Just invert the commands in the awk:
echo "$str" | awk -v pattern=banana '
$0 ~ pattern {print_it = 1} <--- if line matches, activate the flag
print_it {print} <--- if the flag is active, print the line
'
The print_it flag is activated when pattern is found. From that moment on (inclusive that line), you print lines when the flag is ON. Previously the print was done before the checking.
cat in.txt | awk "/banana/,0"
In case you don't want to preserve the matched line then you can use
cat in.txt | sed "0,/banana/d"

grep line containing a pattern to line containing other pattern

Say the input is:
">"1aaa
2
3
4
">"5bbb
6
7
">"8ccc
9
">"10ddd
11
12
I want this output (per example for the matching pattern "bbb"):
">"5bbb
6
7
I had tried with grep:
grep -A 2 -B 0 "bbb" file.txt > results.txt
This works. However, the number of lines between ">"5bbb and ">"8ccc are variable. Does anyone knows how to achieve that using Unix command line tools?
With awk you could simply using a flag like so:
$ awk '/^">"/{f=0}/bbb/{f=1}f' file
">"5bbb
6
7
You could also parametrize the pattern like so:
$ awk '/^">"/{f=0}$0~pat{f=1}f' pat='aaa' file
">"1aaa
2
3
4
Explanation:
/^">"/ # Regular expression that matches lines starting ">"
{f=0} # If the regex matched unset the print flag
/bbb/ # Regular expression to match the pattern bbb
{f=1} # If the regex matched set the print flag
f # If the print flag is set then print the line
Something like this should do it:
sed -ne '/bbb/,/^"/ { /bbb/p; /^[^"]/p; }' file.txt
That is:
for the range of lines between matching /bbb/ and /^"/
if the line matches /bbb/ print it
if the line doesn't start with " print it
otherwise nothing else is printed
This might work for you (GNU sed):
sed '/^"/h;G;/\n.*bbb/P;d' file

delete lines with specific pattern

Hi I have to delete some lines in a file:
file 1
1 2 3
4 5 6
file 2
1 2 3 6
5 7 8 7
4 5 6 9
I have to delete all the lines of file 1 that i find in file 2:
output
5 7 8 7
I used sed:
for sample_index in $(seq 1 3)
do
sample=$(awk 'NR=='$sample_index'' file1)
sed "/${sample}/d" file2 > tmp
done
but it doesnt work.it doesn't print anything. do you have any idea?It gives me error of 'sed: -e expression #1, char 0: precedent regular expression needed'
This could be a start:
$ grep -vf file1 file2
5 7 8 7
One potential pitfall here is that the output won't change if you put 5 6 9 as the second line of file1. I'm not sure if if you want that or not. If not, you can try
grep -vf <(sed 's/^/^/' file1) file2
This should work if your real data as 3 columns:
awk 'NR==FNR{a[$1$2$3]++;next}!($1$2$3 in a)' file{1,2}
For variable columns:
awk 'NR==FNR{a[$0]++;next}{for(x in a) if(index($0,x)>0) next}1' file{1,2}
And the code for GNU sed
sed -r 's#(.*)#/\1/d#' file1 | sed -f - file2