How to remove rows that match one of several regex patterns? - regex
I have a tab-delimited text file and wish to efficiently remove whole rows that fulfil either of the following criteria:
values in the ALT column that are equal to .
values in the NA00001 column and subsequent columns that have the same digit before and after either of the two delimiters, | or /, for e.g. 0|0, 1|1, 2/2 etc.
An example input file is below:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 0|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1110696 rs6040360 A . 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
Example output file is:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
Your example doesn't appear to include any lines that meet the "values in the ALT column that are equal to ." criterion, or lines that don't meet the second criterion (except the header line). So I added some lines of my own to your example for testing; I hope I've understood your criteria.
The first criterion is easily matched by testing the particular field, if we're using something like awk: $5 == "." {next} in an awk script would skip that line. Just using a regular expression is pretty simple too: ^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I, where ^I is a tab character, matches lines with just "." in the fifth (ALT) field.
With strict regular expressions you can't express "the same digit before and after [a delimiter]" directly. You have to do it with alternation of sub-expressions with specific values: 0[|/]0|1[|/]1|2[|/]2... But there are only 10 digits, so this isn't particularly burdensome. So, for example, you can do this filtering with one long egrep command line:
egrep -v '^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I|0[|/]0|1[|/]1|2[|/]2|3[|/]3|4[|/]4|5[|/]5|6[|/]6|7[|/]7|8[|/]8|9[|/]9' input-file
Obviously that's not something you'd want to type by hand on a regular basis, and isn't ideal for maintenance. A little awk script is better:
#! /usr/bin/awk -f
# Skip lines with "." in the fifth (ALT) field
$5 == "." {next}
# Skip lines with the same digit before and after the delimiter in any field
/0[|/]0/ {next}
/1[|/]1/ {next}
/2[|/]2/ {next}
/3[|/]3/ {next}
/4[|/]4/ {next}
/5[|/]5/ {next}
/6[|/]6/ {next}
/7[|/]7/ {next}
/8[|/]8/ {next}
/9[|/]9/ {next}
# Copy all other lines to the output
{print}
I've put the individual digit checks as separate awk statements for readability.
With extended regular expressions (EREs), you can express "same character before and after the delimiter" directly, using a back-reference. Backreferences should be used with caution, since they can create pathological performance characteristics; and, of course, you'll have to use a language that supports them, such as perl. POSIX awk and Gnu gawk don't. Here's a Perl one-liner that handles the second criterion:
LINE: while (<STDIN>) { next LINE if /(\d)[|\/]\g1/; print }
That's probably not very good Perl - I almost never use the language - but it works in my testing. The (\d) matches and remembers the digit before the delimiter, and the \g1 matches the remembered digit after the delimiter.
perl -alnE '$F[4] eq "." and
$F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
say'
Update: Sorry the OP ask for the oposite...
perl -alnE 'say unless (
$f[4] eq "." or
( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1!
)
)'
or equivalent
perl -ane 'next if ( $f[4] eq ".");
next if ( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1! );
print '
Related
capture repeating regex pattern as one group, sed in bash script
I wrote a working expression that extracts two pieces of data from valid lines of text. The first capture group is the numerical section including periods. The second is the remaining characters of the line as long as the line is valid. A line is invalid if the numerical section ends with a period or the line ends with a number. 1.1 the quick 1-1 (no match due to ending hypen and number) 11.2 brown fox jumped (should return '11.2' and 'brown fox jumped') 1.41.1 over the lazy (should return '1.41.1' and 'over the lazy') 2.1. dog (no match due to numerical section trailing period) The expression ^((?:[0-9]+\.)+[0-9]+) (.*)[^0-9]$ works when tested on various regex testing sites. My issue is... that I have failed to adapt this expression to work with sed from a bash script that loops through lines of text ($L). IFS=$'\t' read -r NUM STR < <(sed 's#^\(\(?:[0-9]\+\.\)\+[0-9]\+\) \(.*)[^0-9]$#\1\t\2#p;d' <<< $L ) What does work is below where I replaced the capturing of repeating groups with repeating digits and periods. I would prefer not to do this because it could match lines starting with periods and multiple periods in a row. Also it loses the last char of the captured string but I expect I can figure that part out. FS=$'\t' read -r NUM STR < <(sed 's#^\([0-9\.]\+[0-9]\+\) \(.*[^0-9]\)$#\1\t\2#p;d' <<< $L ) Please help me understand what I'm doing wrong. Thank you.
An ERE for that would be: ^([0-9]+(\.[0-9]+)*) (.*[^0-9])$ with \1 and \3 being the capture groups of interest But I'm not sure that using sed + read is the best approach for capturing the data in variables; you could just use bash builtins instead: #!/bin/bash while IFS=' ' read -r num str do [[ $num =~ ^([0-9]+(\.[0-9]+)*)$ && $str =~ [^0-9]$ ]] || continue declare -p num str done < input.txt There's a side-effect with this solution though: The read will strip the leading, trailing and the first middle space++ chars of the line. If you need those spaces then you can match the whole line instead: #!/bin/bash regex='^([0-9]+(\.[0-9]+)*) (.*[^0-9])$' while IFS='' read -r line do [[ $line =~ $regex ]] || continue num=${BASH_REMATCH[1]} str=${BASH_REMATCH[3]} declare -p num str done < input.txt
replace strings with certain format in bash
I have a file like this. it is a 7-column tab file with separator of one space (sep=" "). however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers. test_find.txt A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007 G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015 I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?) I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt to only first row, but cannot generalize it. I tried to find all patterns using sed -n 's/^[^[a-z]]*{\(.*\)\\[^\|]*$/\1/p' test_find.txt but doesn't work. can anyone help me? I want output like this: A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007 G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015 Thank you!!!!
We'll need to two-step processing: first extract the 4th column which may contain spaces; next replace the spaces in the 4th column with underscores. With GNU awk: gawk '{ if (match($0, /^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$/, a)) { gsub(/ /, "_", a[3]) print a[1] a[3] a[4] } }' test_find.txt Output: A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007 G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015 The regex ^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$ matches a line capturing each submatches. The 3rd argument (GNU awk extension) a is an array name which is assigned to the capture group. a[1] holds 1st-3rd columns, a[3] holds 4th column, and a[4] holds 5th-7th columns. The gsub function replaces whitespaces with an underscores. Then the columns are concatenated and printed.
Assuming you have special character at the end before the final column with integers, You can try this sed $ sed -E 's~([[:alpha:]/]+) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field (+ the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g. awk '{start=length($1)+length($2)+length($3)+4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt 'Neat' version: awk '{ start=length($1)+length($2)+length($3)+4 end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6 text=substr($0, start, end) gsub(" ", "_", text) print $1, $2, $3, text, $(NF-2), $(NF-1), $NF }' test.txt A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007 G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015 Breakdown: awk '{ # How many characters are there before column 4 begins (length of each field + total count of field separators (in this case, "4")) start=length($1)+length($2)+length($3)+4; # How many characters are there in column 4 (total - (first 3 fields + last 3 fields + total field separators (6))) end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; # Use the substr function to define column 4 text=substr($0, start, end); # Substitute spaces for underscores in column 4 gsub(" ", "_", text); # Print everything print $1, $2, $3, text, $(NF-2), $(NF-1), $NF }' test.txt
Bash regex overwrite line if multiple match
I have a bash script where I have 3 regular expressions. I would like to, through conditional if, to find the match of the first pattern in the file. If there is a match, then look for a match in the second pattern but only with the lines that have matched the first pattern. Finally, to check the third pattern only with the lines that have matched the second pattern (which are also the ones that had already matched the first pattern). I have the following code but I don't know how to tell that if there is a match to overwrite the "line" value to decrease the number of total lines to only the ones matching. #!/bin/bash pattern1= egrep '^([^,]*,){31}[1-9][0-9].*' pattern2= egrep '^([^,]*,){16}[0-1].[3-9].*' pattern3= egrep '^([^,]*,){32}[2-9][0-9].*' while read line do if [[$line == $pattern1]];then newline == $pattern1 if [[$newline == $pattern2 ]];then newline2 == $pattern2 if [[$newline2 == $pattern3 ]]; then echo $pattern3 fi done < mj1.csv #this is the input file I will call this script like ./b1.sh <filename>. Some input data: EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc 1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5 1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9 1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7 1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2 1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3 1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2 1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5 1985,10,10,11/15/1984,21,272,21.74469541,CHI,1,BOS,0,-20,1,33,12,24,0.5,0,1,0,3,3,1,0,2,2,2,2,1,1,4,27,17.1 1985,11,11,11/17/1984,21,274,21.75017112,CHI,1,PHI,0,-9,1,44,4,17,0.235,0,0,,8,8,1,0,5,5,7,5,2,4,5,16,12.5 1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8 1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7 1985,14,14,11/23/1984,21,280,21.76659822,CHI,0,SEA,1,19,1,30,9,13,0.692,0,0,,5,6,0.833,0,4,4,3,4,1,4,4,23,19.5 1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9 1985,16,16,11/27/1984,21,284,21.77754962,CHI,0,GSW,0,-6,1,24,6,10,0.6,0,0,,1,1,1,0,2,2,3,3,2,4,1,13,11.1 1985,17,17,11/29/1984,21,286,21.78302533,CHI,0,PHO,0,-5,1,30,9,17,0.529,1,1,1,3,4,0.75,1,2,3,2,2,0,2,5,22,14 1985,18,18,11/30/1984,21,287,21.78576318,CHI,0,LAC,1,4,1,37,9,15,0.6,0,0,,2,4,0.5,2,3,5,5,3,0,4,4,20,15.5 1985,19,19,12/2/1984,21,289,21.79123888,CHI,0,LAL,1,1,1,42,7,13,0.538,0,0,,6,8,0.75,2,0,2,3,1,1,4,3,20,12.9 1985,20,20,12/4/1984,21,291,21.79671458,CHI,1,NJN,1,15,1,35,7,13,0.538,0,0,,6,6,1,1,2,3,6,1,0,3,3,20,16 1985,21,21,12/7/1984,21,294,21.80492813,CHI,1,NYK,1,2,1,43,8,16,0.5,0,1,0,5,7,0.714,1,1,2,3,2,0,6,5,21,9.3 1985,22,22,12/8/1984,21,295,21.80766598,CHI,1,DAL,1,2,1,35,10,23,0.435,0,0,,0,0,,4,3,7,2,0,2,2,3,20,11.2 1985,23,23,12/11/1984,21,298,21.81587953,CHI,1,DET,0,-7,1,37,13,28,0.464,0,1,0,1,3,0.333,1,7,8,6,2,0,3,4,27,16.2 1985,24,24,12/12/1984,21,299,21.81861739,CHI,0,DET,0,-7,1,30,6,17,0.353,0,2,0,9,10,0.9,0,1,1,2,2,1,1,5,21,12.5 1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5 1985,26,26,12/15/1984,21,302,21.82683094,CHI,1,PHI,0,-12,1,27,7,16,0.438,0,0,,0,0,,1,1,2,2,1,0,1,2,14,7.2 1985,27,27,12/18/1984,21,305,21.83504449,CHI,1,HOU,0,-8,1,45,8,20,0.4,0,1,0,2,4,0.5,1,2,3,8,3,0,1,2,18,14.5 1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6 To make things easier, pattern1 matches all rows where column PTS is higher than 10, pattern 2 matches the rows where column FG_PCT is higher than 0.3, and pattern 3 matches all rows where column GmSc is higher than 19.
While an awk solution is going to be a bit faster ... we'll focus on a bash solution per OP's request. First issue is regex matching uses the =~ operator and not the == operator. Second issue is that to keep a row if only all 3 regexes match means we want to and (&&) the results of all 3 regex matches. Third issue addresses some basic syntax issues with OP's current code (eg, space after [[ and before ]]; improper assignments of regex patterns to the pattern* variables). One bash idea: pattern1='^([^,]*,){31}[1-9][0-9].*' pattern2='^([^,]*,){16}[0-1].[3-9].*' pattern3='^([^,]*,){32}[2-9][0-9].*' head -1 mj1.csv > mj1.new.csv while read -r line do if [[ "${line}" =~ $pattern1 && "${line}" =~ $pattern2 && "${line}" =~ $pattern3 ]] then # do whatever with $line, eg: echo "${line}" fi done < mj1.csv >> mj1.new.csv This generates: $ cat mj1.new.csv EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc 1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9 1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3 1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2 1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5 1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8 1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7 1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9 1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5 1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6 NOTE: OP hasn't (yet) provided the expected output so at this point I have to assume OP's regexes are correct
awk: Use gensub to substitute multiple lines from a paragraph record
I have an input file with multiple paragraphs separated by at least two newlines (\n\n), and I'm wanting to extract fields from lines within certain paragraphs. I think the processing will be simplest if I can get gensub to work as I'm hoping. Considering the following input file: [Record R1] Var1=0 Var2=20 Var3=5 [Record R2] Var1=10 Var3=9 Var4=/var/tmp/ Var2=12 [Record R3] Var1=2 Var3=5 Var5=19 I want to print only the value of Var2 from records R1 and R3 (where Var2 doesn't actually exist). I can easily group all of the variables into their corresponding record by setting RS="\n\n", then they are all contained within $0. But since I don't know where it will appear it the list ahead of time, I want to use something like gensub to extract it. This is what I have going: awk ' BEGIN { RS="\n\n" } /Record R1/ || /Record R3/ { print gensub(/[\n.]*Var2=(.*)[\n.]*/, "\\1", "g", $0) } ' /tmp/input.txt But instead of only printing 20 (the value of Var2 from R1), it prints the following: [Record R1] Var1=0 20 Var3=5 [Record R3] Var1=2 Var3=5 Var5=19 The intent is that the regex in the gensub command would capture all characters (newlines: \n; and non-newlines: .) before and after Var2=XX and replace everything with XX. But instead, it's only capturing the characters on the same line as Var2=XX. Can awk's gensub do this kind of multi-line substitution? I know an alternative would be to loop over all the fields in the record, the split the field that matches Var2= on the = sign, but that feels less efficient as I scale this out to multiple variables.
I don't understand what it is you're trying to do with gensub() but to do what you seem to be trying to do in any awk is: awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[12]$/) print f["Var2"]; delete f}' file 20 12 awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[13]$/) print f["Var2"]; delete f}' file 20 gensub() doesn't care if the string it's operating on is one line or many lines btw - \n is just one more character, no different from any other character. Oh, hang on, now I see what you're thinking with that gensub() - your problems are: [\n.]* means zero or more newlines or periods but you don't have any periods in your input so it's the same as \n* but you don't have any newlines immediately before a Var2 Var2 doesn't exist in your 2nd records so the regexp can't match it. The (.*) will match everything to the end of the record (leftmost longest matches). The "g" is misleading since you only expect 1 match. So using gensub() on multi-line text isn't an issue, your regexps just wrong.
another awk $ awk -v RS= '/\[Record R[13]\]/{for(i=2;i<=NF;i++) {v=sub(/ *Var2=/,"",$i); if(v) print $i}}' file 20
Search for Pattern in Text String, then Extract Matched Pattern
I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string: 10289 20244 Text File: KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009 I am trying to achieve this using the following bash code: Bash Code: cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}' The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20. I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.
grep solution: grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file The output: 10289 20244 [0-9]{3} - matches 3 digits \b - word boundary
awk '{print $(NF-2),$(NF-1)}' text_file 10289 20244 Prints next to last and the one previous.
awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.