Incorporate egrep regexps with awk? - regex

I've been trying to understand how awk can work with egrep regular expressions.
I have the following example:
John,Milanos
Anne,Silverwood
Tina,Fastman
Adrian,Thomassonn
I'm looking to use egrep regexps to process the second column (the last names in this scenario) while printing the entire line for output.
The closest I've come to my answer was using?
$ awk -F ',' '{print $2}' | egrep '([a-z])\1.*([a-z])\2'
Thomassonn
I would then take "Thomassonn" and egrep back into my initial list of full names to get the full record. However, I've encountered plenty of errors and false positives once I used other filters.
Desired result:
Adrian,Thommasson

awk does not support back-references within a regex. egrep, however, is sufficient to achieve your desired result:
$ egrep ',.*([a-z])\1.*([a-z])\2' file
Adrian,Thomassonn
Variations
If there are three or more columns and you want to search only the second:
egrep '^[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the third column:
egrep '^[^,]*,[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the first of any number of columns:
egrep '^[^,]*([a-z])\1[^,]*([a-z])\2' file

awk doesn't support backreferences, here's one way to do what you want instead:
$ cat tst.awk
BEGIN{ FS="," }
{
numMatches = 0
fld = $2
for (charNr=1; charNr <= length($2); charNr++) {
char = substr($2,charNr,1)
if (char ~ /[a-z]/)
numMatches += gsub(char"{2}"," ",fld)
}
}
numMatches >= 2
$
$ awk -f tst.awk file
Adrian,Thomassonn
If you want to match sequences of 3 or any other number of repeated chars, just change {2} to {3} or whatever number you like.
By the way, for portability to all locales you should use [[:lower:]] instead of [a-z] if that's what you really mean.

Related

Print everything before relevant symbol and keep 1 character after relevant symbol

I'm trying to find a one-liner to print every before relevant symbol and keep just 1 character after relevant symbol:
Input:
thisis#atest
thisisjust#anothertest
just#testing
Desired output:
thisis#a
thisjust#a
just#t
awk -F"#" '{print $1 "#" }' will almost give me what I want but I need to find a way to print the second character as well. Any ideas?
You can substitute what's after the first character after # with nothing with sed:
sed 's/\(#.\).*/\1/'
You could use grep:
$ grep -o '[^#]*#.' infile
thisis#a
thisisjust#a
just#t
This matches a sequence of characters other than #, followed by # and any character. The -o option retains only the match itself.
With the special RT variable in GNU's awk, you can do:
awk 'BEGIN{RS="#.|\n"}RT!="\n"{print $0 RT}'
Get the index of the '#', then pull out the substring.
$ awk '{print substr($0,1,index($0,"#")+1);}' in.txt
thisis#a
thisisjust#a
just#t
1st Solution: Could you please try following.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH)}' Input_file
Above will print lines as per your ask which have # in them and leave lines which does not have it, in case you want to completely print those lines use following then.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH);next} 1' Input_file
2nd solution:
awk 'BEGIN{FS=OFS="#"} {print $1,substr($2,1,1)}' Input_file
Some small variation of Ravindes 2nd example
awk -F# '{print $1"#"substr($2,1,1)}' file
awk -F# '{print $1FS substr($2,1,1)}' file
Another grep variation (shortest posted so far):
grep -oP '.+?#.' file
o print only matching
P Perl regex (due to +?)
. any character
+ and more
? but stop with:
#
. pluss one more character
If we do not add ?. This line test#one#two becomes test#one#t instead of test#o do to the greedy +
If you want to use awk, the cleanest way to do this with is using index which finds the position of a character:
awk 'n=index($0,'#') { print substr($0,1,n+1) }' file
There are, however, shorter and more dedicated tools for this. See the other answers.

How to display words as per given number of letters?

I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.

How do I grep filter a column by word count?

I'm trying to design a grep filter in which I have 2 or less words. I'm turning out blank in searching for this answer oddly enough.
Something like:
cat someFile.txt | grep count(\w) < 3
Does this functionality even exist?
With grep, you could match on a pattern that matches exactly 1 or 2 words:
grep -E '^\w+(\s+\w+)?$' someFile.txt
(Note that this assumes you either don't have any blank lines, or don't want to select those anyway.)
With awk you could just use the number of fields condition:
awk 'NF < 3' someFile.txt
Just use awk instead of grep for this like this:
awk 'NF < 3' file
NF stands for number of fields.
Grep
grep -E '^$|^\S+(\s+\S+)?$' file
\S is non-space character;
? makes the preceding pattern optional (repeating zero or one times).
| is the alternation operator (the result is true, if either of the patterns match);
^$ matches empty line;
The same pattern will work with -P option (Perl-compatible regular expressions) as well.
GNU Sed:
sed -nr '/^$|^\S+(\s+\S+)?$/ p' file
where
p is a command that prints the current pattern space (the current line, in particular), if the preceding pattern matches the line;
-n turns off automatic printing of the pattern space.
The pattern is the same as for the grep command above.
Perl
perl -C -F'/\s+/' -ane 'print if scalar #F < 3' < file
where
-C enables Unicode support;
-F specifies pattern for -a switch (autosplit mode that splits the input into #F array);
-n causes the script specified by -e to run for each line from the input;
scalar #F returns the number of items in #F, i.e. the number of fields.

Using regex to match a pattern in the middle of a string with awk, sed, grep ... something linux-y

I have a file with ID numbers and a bunch of patterns that represent gene trees
ex:
021557 (sfra,(pdep,snud),((spal,sint),(sdro,(hpul,(sprp,afra)))));
005852 (snud,sfra,(pdep,(hpul,((afra,sprp),(sint,(spal,sdro))))));
023685 (sfra,snud,(pdep,(hpul,((sprp,(afra,spal)),(sdro,sint)))));
022020 (sfra,snud,(pdep,(hpul,(afra,(sprp,(sdro,(sint,spal)))))));
028284 (sfra,snud,(pdep,(hpul,(sprp,((sdro,sint),(spal,afra))))));
I am interested in a certain sister taxon grouping of (spal,afra).I want to print the IDs from another column if the tree contains (spal,afra).
Output if it was only run on the data above should be:
023685
028284
I was going to do something like:
awk '{if ($2 == "(spal,afra)") { print $1 } }'
but I realize that the part that I'm trying to match is within a bunch of other characters, and at no predictable location...
So I need to search for
any number of lowercase letters or parentheses or commas
(spal,afra)
any number of lowercase letters or parentheses or commas or ;
Also, I guess I want to know of occurences in the other order (afra,spal). But I was going to run separate matches, combine the output and do something with sort and uniq-c if I remember right... I can probably figure that out by myself later.
I'm kind of new to this and I've already spent a couple of hours trying to figure something out. Thank you!
You seem to have this as an input file
$ cat file
021557 (sfra,(pdep,snud),((spal,sint),(sdro,(hpul,(sprp,afra)))));
005852 (snud,sfra,(pdep,(hpul,((afra,sprp),(sint,(spal,sdro))))));
023685 (sfra,snud,(pdep,(hpul,((sprp,(afra,spal)),(sdro,sint)))));
022020 (sfra,snud,(pdep,(hpul,(afra,(sprp,(sdro,(sint,spal)))))));
028284 (sfra,snud,(pdep,(hpul,(sprp,((sdro,sint),(spal,afra))))));
Using awk
To print the first column for any line that contains (spal,afra):
$ awk '/[(]spal,afra[)]/{print $1}' file
028284
The condition /[(]spal,afra[)]/ selects lines that contain (spal,afra) and print $1 prints the first field on those lines.
In awk regular expressions, parens are active characters. Since we want to match literal parens, we put them in square brackets like [(] and [)].
Using sed
$ sed -n '/(spal,afra)/ s/\t.*//p' file
028284
sed -n will not print anything unless we explicitly ask it to. /(spal,afra)/ selects lines containing (spal,afra). s/\t.*//p removes everything after the first tab and then prints what remains.
By default, sed uses basic regular expressions. This means that ( and ) are not active. Consequently, we do not need to escape them.
Using grep and cut
$ grep '(spal,afra)' file | cut -f1
028284
grep '(spal,afra)' file selects lines that contain (spal,afra) and cut -f1 selects the first field from those lines.
Like sed, grep defaults to using basic regular expressions. This means that ( and ) are both treated as literal characters and there is no need to escape them.
Alternative: Looking for either (spal,afra) or (afra,spal)
If we want to look for (afra,spal) in addition to (spal,afra), then we need to update the regular expressions. Taking awk for example:
awk '/[(](spal,afra|afra,spal)[)]/{print $1}' file2
023685
028284
Here, the vertical bar, |, separates choices. The regex accepts either what is before or after the bar.
You can use this non-regex search in awk:
awk 'index($0, "(spal,afra)") || index($0, "(afra,spal)") {print $1}' file
023685
028284
This should work (sed with extended regex):
sed -nr 's/([^[:space:]]*)[^;]*(\(spal,afra\)|\(afra,spal\)).*/\1/p' file
Output:
023685
028284

Using awk to grab only numbers from a string

Background:
I have a column that should get user input in form of "Description text ref12345678". I have existing scripts that grab the reference number but unfortunately some users add it incorrectly so instead of "ref12345678" it can be "ref 12345678", "RF12345678", "abcd12345678" or any variation. Naturally the wrong formatting breaks some of the triggered scripts.
For now I can't control the user input to this field, so I want to make the scripts later in the pipeline just to get the number.
At the moment I'm stripping the letters with awk '{gsub(/[[:alpha:]]/, "")}; 1', but substitution seems like an inefficient solution. (I know I can do this also with sed -n 's/.*[a-zA-Z]//p' and tr -d '[[:alpha:]]' but they are essentially the same and I want awk for additional programmability).
The question is, is there a way to set awk to either print only numbers from a string, or set delimits to numeric items in a string? (or is substitution really the most efficient solution for this problem).
So in summary: how do I use awk for $ echo "ref12345678" to print only "12345678" without substitution?
if awk is not a must:
grep -o '[0-9]\+'
example:
kent$ echo "ref12345678"|grep -o '[0-9]\+'
12345678
with awk for your example:
kent$ echo "ref12345678"|awk -F'[^0-9]*' '$0=$2'
12345678
You can also try the following with awk assuming there will be only one number in a string:
awk '{print ($0+0)}'
This converts your entire string to numeric, and the way that awk is implemented only the values that fit the numeric description will be left. Thus for example:
echo "19 trees"|awk '{print ($0+0)}'
will produce:
19
In AWK you can specify multiple conditions like:
($3~/[[:digit:]+]/ && $3 !~/[[:alpha:]]/ && $3 !~/[[:punct:]]/ ) {print $3}
will display only digit without any alphabet and punctuation.
with !~ means not contain any.
grep works perfectly :
$ echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'
300
9
1.1
Step by step explanation:
-E
Use extended regex.
-o
Return only the matches, not the context
[+-]?[0-9]+([.][0-9]+)?+
Match numbers which are identified as:
[+-]?
An optional leading sign
[0-9]+
One or more numbers
([.][0-9]+)?
An optional period followed by one or more numbers.
it is convenient to put the output in an array
arr=($(echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'))
and then use it like this
Tin=${arr[0]}
maxl=${arr[1]}
etc..
Another option (assuming GNU awk) involves specifying a non-numeric regular expression as a separator
awk -F '[^0-9]+' '{OFS=" "; for(i=1; i<=NF; ++i) if ($i != "") print($i)}'