Command line to match lines with matching first field (sed, awk, etc.) - regex

What is fast and succinct way to match lines from a text file with a matching first field.
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output, alternative:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.
There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.

Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

For fixed width fields you can used uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk solution:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Using awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor

This might work for you (GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.

$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Dynamically generated regex for gsub not working

I have an input CSV file:
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
I need to create a string out of this as follows:
;5,1,3,5;6,2,6,7;7,4
So each character, except the first which is the value of the field $2, in the substring in between the ; denotes the row number of middle field; for example ;5,1,3,5 means that 5 is at row number 1,3,5.
I've been trying to use awk with gsub, trying to create the string MYSTR dynamically.
The regex inside the gsub is not working. I need a regex that will match ;$3 (the value of $3, which can be a two digit number) and replace it with ;$3,RowNO, if the pattern is not matched then add ;$3 at the end of the string.
This is what I have so far:
awk -F',' '{
print NR, $3;
noofchars=gsub(/;$3/,";"$3","NR,MYSTR);
print noofchars;
if ( noofchars == 1 )
;
else
MYSTR=MYSTR";"$3","NR;
print NR, $3;
print MYSTR;
}
END{print MYSTR;}' $1
The regex doesn't work because $3 isn't interpreted as the field #3 value but is seen as the anchor $ (that matches the end of the line) and a literal 3.
You can do it without gsub:
awk -F, '{a[$2]=a[$2]","NR}END{for (i in a){printf(";%d%s",i,a[i])}}'
Input
$ cat file
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
Output
$ awk -F, '{gsub(/[ ]+/,"",$3);a[$2] = ($2 in a ? a[$2]:$2) FS $3 }END{for(i in a)printf("%s%s",";",a[i]); print ""}' file
;5,1,3,5;6,2,6,7;7,4
Better Readable version
awk -F, '
{
gsub(/[ ]+/,"",$3); # suppress space char in third field
a[$2] = ($2 in a ? a[$2]:$2) FS $3 # array a where index being field2 and value will be field3, if index exists before append string with existing value
}
END{
for(i in a) # loop through array a and print values
printf("%s%s",";",a[i]);
print ""
}
' file
#vsshekhar: Try following too: It will provide you values in the correct same order which Input_file ($2) are coming.
awk -F, '{A[++i]=$2;B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR} END{for(j=1;j<=i;j++){if(B[A[j]]){printf(";%s,%s",A[j],B[A[j]]);delete B[A[j]]}};print ""}' Input_file
Adding a non-one liner form of solution too now.
awk -F, '{
A[++i]=$2;
B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR
}
END{
for(j=1;j<=i;j++){
if(B[A[j]]){
printf(";%s,%s",A[j],B[A[j]]);
delete B[A[j]]
}
};
print ""
}
' Input_file

Creating matching brackets- awk :sed

I have a data set that has three patterns:
First:
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
Second:
inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem
Third:
incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix
I need to convert the above to following form : Matching the brackets in the way for Penn Tree Bank (http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)
First:
abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
Second:
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
Third:
incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))
The code, I am working is using awk
{
n = gsub(/<>/,")",$2)
s = sprintf("%*s",n,"")
gsub(/ /,"(",s)
print "(" $1, s "((" $2 "))"
}
EDIT
More complex forms
nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix
to:
nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)
It is not producing the expected outputs that mentioned in the examples.
This should be general enough as it takes into account :stem, :prefix, and :suffix for matching:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
print c;}' testfile
Demo here: https://ideone.com/U3ux91
EDIT
This should take care of multiple suffixes and prefixes:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
}
print a;}' test
Demo here: https://ideone.com/U7LYXi
(sorry if antinationalistic is not a word, but for testing sake....)
The expected output for pattern 1 may have problem, the brackets are not paired, I guess it was typos and it should be:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
I make this awk script:
awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file
with your 3 patterns example in same file, it gives:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))
awk -F'<>| ' -v OFS= '{
$1 = $1 " "
for (i=2; i<=NF; i++) {
if ($i ~ /prefix$/) { $i = "(" $i; $NF = $NF ")" }
if ($i ~ /stem\)?$/) { stem = i; $i = "(" $i ")" }
if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } }
} { print }'
awk to the rescue!
$ awk 'function wrap(v) {return "("v")"; }
{n=split($2,a,"<>");
if(n==3) w=wrap(a[1] wrap(wrap(a[2]) a[3]));
else if(a[1]~/:prefix/) w=wrap(a[1] wrap(a[2]));
else w=wrap(wrap(a[1]) a[2]);
print $1, w}' stems
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

Bash command to match n line

I have an index HTML file with file/dir listing. It is just a usual filebrowser like :
...content here...
<td>20120011/</td>
<td>20120111/</td>
<td>20120211/</td>
<td>20120411/</td>
...content here...
I don't understand how to extract the 2nd line from the bottom.
1) I downloaded HTML with curl
content=$(curl -sL "http://path-to-html")
2) then used
dir=$(echo $content | sed '/.*href="\([0-9]*\/\)".*/!d;s//\1/;q')
which gives me the last match : 20120411.
But how to get the previous one ?
I don't know the total count of items.
This awk program will print the penultimate line:
echo ${content} | awk '{ pen = ult; ult = $0 } END { print pen }'
This will print the penultimate matching line:
echo ${content} | awk '/href="([0-9]{8}\/)"/ { pen = ult; ult = $0 } END { print pen }'
If you just want to extract the first capture group:
echo ${content} | awk 'match($0, /href="([0-9]{8}\/)"/, a) { pen = ult; ult = a[1] } END { print pen }'
Putting it all together:
bash-4.2$ dir=$(curl -sL http://www.arteetmarte.no/tmp/index.html |
awk 'match($0, /href="([0-9]{8}\/)"/, a) {
pen = ult
ult = a[1]
}
END {
print pen
}
')
bash-4.2$ echo ${dir}
20130918/
Tested with: GNU Awk 4.1.0, API: 1.0
May be a bit easier with awk
dir=$(echo "$content"|awk '/href=/{x=p;p=$0}END{sub(/.*">/,"",x);sub(/<.*/, "",x); print x}')
dir=$(echo $content | sed sed -n '/href="\([0-9]\{1,\}\/\)"/ {s|.*href="\([0-9]\{1,\}/\)".*|-\1-|;H;}
$ {x;l;s|.*-\([0-9]\{1,\}/\)-\(\n-[0-9]\{1,\}/-\)\{1\}$|\1|p;}')
The 1 in \{1\}$ specify how much line must be removed from the end

compare fields after removing spaces in two files with Regex tools

I am new to awk and need to find the statement to compare two fields in files below
The columns are , seperated
1.csv
_________
1space, aspace
2,b
space3space,c
2.csv
____________
1space,spacea
space2,bspace
3,spacecspace
The below statement works fine if there are no leading or training spaces in the fields of either of 1.tsv or 2.tsv
nawk -F, 'NR==FNR{a[$1,$2]++;next} !(a[$1,$2])' 2.tsv 1.tsv
Kindly let me know how can we modify the above statement to trim leadind and lagging spaces and then compare. Thanks for the help.
awk -F, '
{ key=$1; gsub(/^[[:space:]]+|[[:space:]]+$/,"",key) }
NR==FNR { a[key]; next }
!(key in a)
' 2.tsv 1.tsv
Do this:
awk '
BEGIN {FS=OFS=","}
NR==FNR {
gsub(/^ *| *$/,"",$1)
a[$1]++
next
}
{
gsub(/^ *| *$/,"",$1);
if (!($1 in a)) {
print
}
}' 2.tsv 1.tsv
Code for GNU sed:
sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/p#' file1|sed -f - file2
$cat file1
1 , a
2,b
3 ,c
$cat file2
1 ,a
2,b
3,c
$sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/d#' file1|sed -nf - file2
You need to trim all the spaces from $1 before trying to locate it in array a:
awk -F"," 'NR==FNR{$1=$1;a[$1]++;next} {f1=$1; gsub(/ /, "", f1);
if (!a[f1]) print}' 2.tsv 1.tsv