awk with joined field - regex

I am trying to extract data from one file, based on another.
The substring from file1 serves as an index to find matches in file2.
All works when the string to be searched inf file2 is beetween spaces or isolated, but when is joined to other fields awk cannot find it. is there a way to have awk match any part of the strings in file2 ?
awk -vv1="$Var1" -vv2="$var2" '
NR==FNR {
if ($4==v1 && $5==v2) {
s=substr($0,4,8)
echo $s
a[s]++
}
next
}
!($1 in a) {
print
}' /tmp/file1 /tmp/file2
example that works:
file1:
1 554545352014-01-21 2014-01-21T16:18:01 FS 14001 1 1.10
1 554545362014-01-21 2014-01-21T16:18:08 FS 14002 1 5.50
file2:
55454535 11 17 102 850Sande Fiambre 1.000
55454536 11 17 17 238Pesc. Dourada 1.000
example that does not work:
file2:
5545453501/21/20142 1716:18 1 1 116:18
5545453601/21/20142 1716:18 1 1 216:18
the string to be searched, for instance : 55454535 finds a match in the working example, but it doesn't in the bottom one.

You probably want to replace this:
!($1 in a) {
print
}
with this (or similar - your requirements are unclear):
{
found = 0
for (s in a) {
if ($1 ~ "^"s) {
found = 1
}
}
if (!found) {
print
}
}

Use a regex comparison ~ instead of ==
ex. if ($4 ~ v1 && $5 ~ v2)
Prepend v1/v2 with ^ if you want to the word to only begin with string and $ if you want to word to only end with it

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Using gawk to extract rows with string in a column

I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!
You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

awk to count and sum total using matching string from file

I am trying to get the total length of each matching string and the count of each match in a file using awk. The matching string in $5 is the count and the sum of each $3 - $2 is the total length. Hopefully the awk below is a good start. Thank you :).
input
chr1 1266716 1266926 chr1:1266716-1266926 TAS1R3
chr1 1267008 1267328 chr1:1267008-1267328 TAS1R3
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3
chr1 1268291 1268514 chr1:1268291-1268514 TAS1R3
chr1 1956371 1956503 chr1:1956371-1956503 GABRD
chr1 1956747 1956866 chr1:1956747-1956866 GABRD
chr1 1956947 1957187 chr1:1956947-1957187 GABRD
chr1 1220077 1220196 chr1:1220077-1220196 SCNN1D
desired output
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119
awk
awk '{count[$5]++}
END {
for (word in count)
print $1,$2,$3,$4,word, count[word]
}' input > count |
awk 'print $1,$2,$3,$4,word, count[word]
}
{ $6 = $3 - $2 }
1' count.txt > length
edit
SCNN1D 1 119
GABRD 3 240
TAS1R3 4 223
You can do:
awk '{c1[$5]++; c2[$5]+=($3-$2)}
END{for (e in c1) print e, c1[e], c2[e]}' input
Note that the order of the records may be different than the order in the original file.
With awk, it's possible to do the entire thing in a single script,
by keeping a running count of both the cumulative length, and the number of instances for each word.
Try this (yet untested):
awk '{
offset1=$2; offset2=$3; word=$5
TotalLength[word]=offset2 - offset1 # or just $3-$2
count[word]++}
END {
for (word in count)
print word, count[word], TotalLength[word]
}' input
The original script had three errors.
The second awk chunk had an ambiguous input specification: Reading from pipe and a file argument (count.txt). In this case, awk cannot decide where to read from.
In an END section, the numbered fields will only refer to the fields of the last line/record read. This is not what you want.
Finally, the second awk script is missing the opening brace { for the print statement.
$ cat tst.awk
$5 != prev { if (NR>1) print prev, cnt, sum; prev=$5; cnt=sum=0 }
{ cnt++; sum+=($3-$2) }
END { print prev, cnt, sum }
$ awk -f tst.awk file
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119

sed print from match until other match NOT inclusive

I want to print all lines from a match up to a second match, not including that second match.
What I have so far does everything and does too much, in that it prints the second match as well.
Specifically, let's say I want to print everything starting on a line containing 'test', up to, but not including, the first line starting with a number or an open bracket '['.
This goes some way, but not all the way:
sed -n '/test/,/^[0-9]\|^\[/p' file
It is much easier to do this via awk:
awk '/test/{p=1} /^([0-9]|\[)/{p=0} p' file
Using awk:
awk 'p && /^[0-9]|^\[/ { exit }; /test/{ p = 1 } p' file
Example:
$ cat temp.txt
4
1
2
3
4
5
$ awk 'p && /4/ { exit }; /2|1/{ p = 1 } p' temp.txt
1
2
3
Notice how it skipped 4 when /2|1/ wasn't found yet.
sed -n '/test/,/^[0-9[]/ {
/test/ {
h;b
}
x;p
$ {
x
/^[^0-9[]/ p
}
}' YourFile
should work but not elegant