delimiter inside reqular expression Awk - regex

I have a term like aa-and-bb in the 10th column of a tab limited file, file.tsv.
I can get aa-and-bb as
cat file.tsv | awk 'BEGIN{FS="\t"};{print $10}'
How do I further get aa from aa-and-bb?

You can use split().
split( $10, arr, "-" ); print arr[ 1 ];

If you can guarantee no other -s in fields 1-9, you can add - as a separator:
awk -F'\t|-' '{print $10}'

I am guessing that all three terms, aa, and, and bb are variable, and you want only the first term.
cat file.tsv | awk 'BEGIN{FS="\t"};{print $10}' | sed 's/-.*$//'

$ awk -F'\t' '{sub(/-.*$/, "", $10);print $10}' file.tsv
aa
But it is not 100% clear how your data looks, so we are just guessing here that you want to split on the dash.

Related

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

AWK regex convert 3 letter word beginning with 'a' to uppercase

I have my regex expression to find 3 letter words beginning with "a"...
\b[aA][a-z]{2}\b
(seems to work, according to this! check it out: http://rubular.com/r/Jil0E4WZnW)
Now I need to know how to take that result and replace the lowercase word with the three letter word in uppercase.
Thanks!
call toupper function in awk:
echo "Abc" | awk '{print toupper($0)}'
gets you:
ABC
You can make use of the uc($string); command of PERL.
You can do it with Sed like this:
echo 'Ass ass ant Ant' | sed -re 's/\ba[a-z]{2}\b/\U&/gI'
(with your example string)
Another way is to use tr:
echo "Abc" | tr 'a-z' 'A-Z'
This solution "cheats" because it uses a loop and sub instead of gsub, but it is in awk and it works.
echo "abc Ape baaa ab abcd ant" | awk '{for (i=1;i<=NF;i++) if (length($i)==3){sub(/[aA][a-z]{2}/,toupper($i),$i)};print}'
perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;' your_file
tested:
> echo "Abc ab Ab" | perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;'
ABC ab Ab
>
Taken from here
Here is the awk version:
awk '{for(i=1;i<=NF;i++)
if((length($i)==3) && $i~/[aA][a-zA-Z][a-zA-Z]/)
$i=toupper($i)
}1' your_file

Grep matches only of multiple separated strings

I have a file with lines containing this format:
fieldA=value1, fieldB=value2, fieldC=value3, fieldD=value4, fieldE=value5
I am interested in fieldA, fieldB, fieldD. However, fieldC may or may not be present, therefore I cannot use something like:
grep "field" * | awk -F"," '{print $1, $2, $4}'
My end goal is to have output like this, all in one line:
fieldA=value1, fieldB=value2, fieldD=value4
I tried using grep -E, but it outputs those fields in different lines, and the association between the fields breaks.
grep -o -E "field1_=\w*|field2_=\w*|field3_=\w*"
if you know the field name of A,B,D grep and xargs could do the job. ( awk/sed could do it for sure)
grep -Po "fieldA=[^,]*|fieldB=[^,]*|fieldD=[^,]*" file|xargs -n3
that gives you:
fieldA=value1 fieldB=value2 fieldD=value4
if you want the comma in output:
grep -Po "fieldA=[^,]*,|fieldB=[^,]*,|fieldD=[^,]*" file|xargs -n3
Is a sed solution acceptable?
sed 's/^\([^ ]* [^ ]*\).*\(fieldD=[^,]*\).*/\1 \2/' filename

how to get sub-expression value of regExp in awk?

I was analyzing logs contains information like the following:
y1e","email":"","money":"100","coi
I want to fetch the value of money, i used 'awk' like :
grep pay action.log | awk '/"money":"([0-9]+)"/' ,
then how can i get the sub-expression value in ([0-9]+) ?
If you have GNU AWK (gawk):
awk '/pay/ {match($0, /"money":"([0-9]+)"/, a); print substr($0, a[1, "start"], a[1, "length"])}' action.log
If not:
awk '/pay/ {match($0, /"money":"([0-9]+)"/); split(substr($0, RSTART, RLENGTH), a, /[":]/); print a[5]}' action.log
The result of either is 100. And there's no need for grep.
Offered as an alternative, assuming the data format stays the same once the lines are grep'ed, this will extract the money field, not using a regular expression:
awk -v FS=\" '{print $9}' data.txt
assuming data.txt contains
y1e","email":"","money":"100","coin.log
yielding:
100
I.e., your field separator is set to " and you print out field 9
You need to reference group 1 of the regex
I'm not fluent in awk but here are some other relevant questions
awk extract multiple groups from each line
GNU awk: accessing captured groups in replacement text
Hope this helps
If you have money coming in at different places then may be it would not be a good idea to hard code the positional parameter.
You can try something like this -
$ awk -v FS=[,:\"] '{ for (i=1;i<=NF;i++) if($i~/money/) print $(i+3)}' inputfile
grep pay action.log | awk -F "\n" 'm=gensub(/.*money":"([0-9]+)".*/, "\\1", "g", $1) {print m}'

Get An Specified Match Under a String

I'm trying to match the contents of a string that contains sequences of quotes using Shell Script, at the time the far I got was this:
et="\"He\" \"llo\""
echo $et | sed -e '/\"(.*?)\"/g'
Which returns this:
"He" "llo"
But I don't want the quote marks to appear on the result, also how can I echo only the first, or the second, or the third, etc. match?
sed -e 's/"\([^"]*\)"/\1/g' will remove quotes around balanced " quotes. To only show the first, second match etc with sed you probably have to make different capture groups.
$ echo '"1" "2" "3"' | sed -e 's/"\([^"]*\)" "\([^"]*\)" "\([^"]*\)"/\2/g'
2
$
Provided that what is wanted is only the text between the first pair of quotes, here is a solution with perl:
echo $et | perl -ne '/"[^"]+"/ and print "$&\n";'
This will also handle quotes witin quotes if they are preceded by a backslash:
echo $et | perl -ne '/"[^"\\]+(\\.[^"]*)*"/ and print "$&\n";'
This is much simpler with awk since you can specify the double-quote to be the field separator.
$ et='"He" "llo"'
$ awk -F'"' '{print $2}' <<<$et
He
$ awk -F'"' '{print $4}' <<<$et
llo
Note: This is also scalable and the strings fields will be in multiples of two, i.e $2, $4, $6, etc.
You can also do something like this:
[srikanth#myhost ~]$ echo "\"He\" \"llo\"" | awk ' { match($0,/([A-Za-z]+)[" ]+([A-Za-z]+)/,a); print a[1]","a[2]} '
He,llo