grep regex how to get only results with one preceeding word? - regex

My string is :
www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com
I an trying to get results only with "one" word before texas.com. Expectation when I do a regex grep :
mail.texas.com
www2.texas.com
So mail & www2 are the "one" word that I'm talking about. I tried :
grep "*.texas.com", but I get all of them in results. Can someone please help ?

You can use
grep '^[^.]*\.texas\.com'
Details:
^ - start of string
[^.]* - zero or more chars other than a . char
\.texas\.com - .texas.com string (literal . char must be escaped in the regex pattern).
See the online demo:
#!/bin/bash
s='www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com'
grep '^[^.]*\.texas\.com' <<< "$s"
Output:
mail.texas.com
www2.texas.com

With awk:
awk 'BEGIN{FS=OFS="."} /texas.com$/ && NF==3' file
Output:
mail.texas.com
www2.texas.com
Set one dot as input and output field separator, check for texas.com at the end ($) of your line and check for three fields.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

With your shown samples, please try following awk code.
awk -F'.' 'NF==3 && $2=="texas" && $3=="com"' Input_file
Explanation: Simple making field separator as . for all the lines in awk program. Then in main program checking condition if NF==3(means number of fields in current line)are 3 AND 2nd field is texas and 3rd field is com if all 3 conditions are MET then print the line.

Related

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

How to match and cut the string with different conditions using sed?

I want to grep the string which comes after WORK= and ignore if there comes paranthesis after that string .
The text looks like this :
//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU
So, desirable output should print only :
TEXT.L01.L02
TEST1.TEST2
OP.TEE.GHU
So far , I could just match and cut before WORK= but could not remove WORK= itself:
sed -E 's/(.*)(WORK=.*)/\2/'
I am not sure how to continue . Can anyone help please ?
You can use
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' file > newfile
Details:
-n - suppresses the default line output
/WORK=.*([^()]*)/! - if a line contains a WORK= followed with any text and then a (...) substring skips it
s/.*WORK=\([^,]*\).*/\1/p - else, takes the line and removes all up to and including WORK=, and then captures into Group 1 any zero or more chars other than a comma, and then remove the rest of the line; p prints the result.
See the sed demo:
s='//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU'
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' <<< "$s"
Output:
TEXT.LO1.LO2
TEST1.TEST2
OP.TEE.GHU
Could you please try following awk, written and tested with shown samples in GNU awk.
awk '
match($0,/WORK=[^,]*/){
val=substr($0,RSTART+5,RLENGTH-5)
if(val!~/\([a-zA-Z]+\)/){ print val }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/WORK=[^,]*/){ ##Using match function to match WORK= till comma comes.
val=substr($0,RSTART+5,RLENGTH-5) ##Creating val with sub string of match regex here.
if(val!~/\([a-zA-Z]+\)/){ print val } ##checking if val does not has ( alphabets ) then print val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -n '/.*WORK=\([^,]\+\).*/{s//\1/;/(.*)/!p}' file
Extract the string following WORK= and if that string does not contain (...) print it.
This will work if there is only zero or one occurrence of WORK= and that the exclusion depends only on the (...) occurring within that string and not other following fields.
For a global solution with the same stipulations for parens:
sed -n '/WORK=\([^,]\+\)/{s//\n\1\n/;s/[^\n]*\n//;/(.*).*\n/!P;D}' file
N.B. This prints each such string on a separate line an excludes empty strings.

Filter (or 'cut') out column that begins with 'OS=abc'

My .fasta file consists of this repeating pattern.
>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa
I want to filter out only the lines that contain '>' THEN filter out the string after 'OS=' and before 'OX=', (example line1=Ctenodactylus gundi)
The first part('>') is easy enough:
grep '>' my.fasta | cut -d " " -f 3 >> species.txt
The problem is that the number of fields is not constant BEFORE 'OS='.
But the number of column/fields between 'OS=' and 'OX=' is 2.
You can use the -P option to enable PCRE-based regex matching, and use lookaround patterns to ensure that the match is enclosed between OS= and OX=:
grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'
Note that the -P option is available only to the GNU's version of grep, which may not be available by default in some environments.
IMHO awk will be more feasible here(since it could take care of regex and printing with condition part all together), could you please try following.
awk '/^>/ && match($0,/OS=.*OX=/){print substr($0,RSTART+3,RLENGTH-6)}' Input_file
Output will be as follows.
Ctenodactylus gundi
Nostoc commune
Gallus gallus
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
/^>/ && match($0,/OS=.*OX=/){ ##Checking condition if line starts from > AND matches regex OS=,*OX= means match from OS= till OX= in each line, if both conditions are TRUE.
print substr($0,RSTART+3,RLENGTH-6) ##Then print sub string of current line, whose starting point is RSTART+3 to till RLENGTH-6 of current line.
}
' Input_file ##Mentioning Input_file name here.
Using any awk in any shell on every UNIX box:
$ awk -F' O[SX]=' '/^>/{print $2}' file
Ctenodactylus gundi
Nostoc commune
Gallus gallus
sed solution:
$ sed -nE '/>/ s/^.*OS=(.*) OX=.*$/\1/p' .fasta
Ctenodactylus gundi
Nostoc commune
Gallus gallus
-n so that the pattern space is not printed unless requested; -E (extended regular expressions) so that we can use subexpressions and backreferences. The p flag to the s command means "print the pattern space".
The regular expression is supposed to match the entire line, singling out in a subexpression the fragment we must extract. I assumed OX is preceded by exactly one space, which must not appear in the output; that can be adjusted if/as needed.
This assumes that all lines that begin with > will have an OS= ... fragment immediately followed by an OX= ... fragment; if not, that can be added to the />/ filter before the s command. (By the way - can there be some OT= ... fragment between OS=... and OX= ...?)
Question though - wouldn't you rather include some identifier (perhaps part of the "label" at the beginning of each line) for each line of output? You have the fragments you requested - but do you know where each one of them comes?

Search for Pattern in Text String, then Extract Matched Pattern

I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string:
10289 20244
Text File:
KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009
I am trying to achieve this using the following bash code:
Bash Code:
cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}'
The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20.
I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.
grep solution:
grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file
The output:
10289 20244
[0-9]{3} - matches 3 digits
\b - word boundary
awk '{print $(NF-2),$(NF-1)}' text_file
10289 20244
Prints next to last and the one previous.
awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file
This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.

Grep Regex: List all lines except

I'm trying to automagically remove all lines from a text file that contains a letter "T" that is not immediately followed by a "H". I've been using grep and sending the output to another file, but I can't come up with the magic regex that will help me do this.
I don't mind using awk, sed, or some other linux tool if grep isn't the right tool to be using.
That should do it:
grep -v 'T[^H]'
-v : print lines not matching
[^H]: matches any character but H
You can do:
grep -v 'T[^H]' input
-v is the inverse match option of grep it does not list the lines that match the pattern.
The regex used is T[^H] which matches any lines that as a T followed by any character other than a H.
Read lines from file exclude EMPTY Lines and Lines starting with #
grep -v '^$\|^#' folderlist.txt
folderlist.txt
# This is list of folders
folder1/test
folder2
# This is comment
folder3
folder4/backup
folder5/backup
Results will be:
folder1/test
folder2
folder3
folder4/backup
folder5/backup
Adding 2 awk solutions to the mix here.
1st solution(simpler solution): With simple awk and any version of awk.
awk '!/T/ || /TH/' Input_file
Checking 2 conditions:
If a line doesn't contain T OR
If a line contains TH then:
If any of above condition is TRUE then print that line simply.
2nd solution(GNU awk specific): Using GNU awk using match function where mentioning regex (T)(.|$) and using match function's array creation capability.
awk '
!/T/{
print
next
}
match($0,/(T)(.|$)/,arr) && arr[1]=="T" && arr[2]=="H"
' Input_file
Explanation: firstly checking if a line doesn't have T then print that simply. Then using match function of awk to match T followed by any character OR end of the line. Since these are getting stored into 2 capturing groups so checking if array arr's 1st element is T and 2nd element is H then print that line.