joining lines with a specific pattern in a text file - regex

I'm trying to join rows based on the first value in a row. My file looks like this:
the structure is: ID, KEY, VALUE
1 1 Joe
1 2 Smith
1 3 30
2 2 Doe
2 1 John
2 3 20
The KEY stands for some kind of attribute of the ID, in this case KEY '1' is first name, '2' is surname and '3' is age.
The output should look like this:
1 Joe Smith 30
2 John Doe 20
I know that this can be done by fairly simple awk script, but I'm having trouble finding it on SO or with Google.

{
a[$1,$2]=$3
if ($1>m) {m=$1}
}
END {
for(i=1;i<=m;i++)
{
j=1
printf i " "
while (a[i,j] != "")
{
printf a[i,j] " "
j++
}
printf "\n"
}
}

This awk command should work:
awk '$2==1{fn=$3} $2==2{ln=$3} $2==3{age=$3} NR>1 && NR%3==0 {print $1,fn,ln,age}' file

One way with awk
awk '{a[$1]=(a[$1])?a[$1]FS$3:$3}END{for(;x<length(a);)print ++x,a[x]}' <(sort -nk12 file)

Related

Getting the last column of a grep match for each line

Let's say I have
this is a test string
this is a shest string
this est is another example of sest string
I want the number of the character in string of the last "t" IN THE WORDS [tsh]EST, how do I get it? (In bash)
EDIT2: I can get the wanted substring with [tsh]*est if I'm not wrong.
I cannot rely on the first match (awk where=match(regex,$0) ) since it gives the first character position but the size of the match is not always the same.
EDIT: Expected output ->
last t of [tsh]*est at char number: 14
last t of [tsh]*est at char number: 15
last t of [tsh]*est at char number: 35
Hope I was clear, I think I edited the question too many times sorry !
What you got wrong
where=match(regex,$0)
the syntax of match is wrong. its string followd by the regex. That is match($0, regex)
Correction
$ awk '{print match($0, "t[^t]*$")}' input
17
18
38
EDIT
Get number of the character in string of the last "t" IN THE WORDS [tsh]EST,
$ awk '{match($0, "(t|sh|s)est"); print RSTART+RLENGTH-1}' input
14
15
35
OR
a much simpler version
$ awk 'start=match($0, "(t|sh|s)est")-1{$0=start+RLENGTH}1' input
14
15
35
Thanks Jidder for the suggestion
EDIT
To use the regex same as OP has provided
$ awk '{for(i=NF; match($i, "(t|sh|s)*est") == 0 && i > 0; i--); print index($0,$i)+RLENGTH-1;}' input
14
15
35
You can use this awk using same regex as provided by OP:
awk -v re='[tsh]*est' '{
i=0;
s=$0;
while (p=match(s, re)) {
p+=RLENGTH;
i+=p-1;
s=substr(s, p)
}
print i;
}' file
14
15
35
Try:
awk '{for (i=NF;i>=0;i--) { if(index ($i, "t") != 0) {print i; break}}}' myfile.txt
This will print the column with the last word that contains t
awk '{s=0;for (i=1;i<=NF;i++) if ($i~/t/) s=i;print s}' file
5
5
8
awk '{s=w=0;for (i=1;i<=NF;i++) if ($i~/t/) {s=i;w=$i};print "last t found in word="w,"column="s}'
last t found in word=string column=5
last t found in word=string column=5
last t found in word=string column=8

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

awk with joined field

I am trying to extract data from one file, based on another.
The substring from file1 serves as an index to find matches in file2.
All works when the string to be searched inf file2 is beetween spaces or isolated, but when is joined to other fields awk cannot find it. is there a way to have awk match any part of the strings in file2 ?
awk -vv1="$Var1" -vv2="$var2" '
NR==FNR {
if ($4==v1 && $5==v2) {
s=substr($0,4,8)
echo $s
a[s]++
}
next
}
!($1 in a) {
print
}' /tmp/file1 /tmp/file2
example that works:
file1:
1 554545352014-01-21 2014-01-21T16:18:01 FS 14001 1 1.10
1 554545362014-01-21 2014-01-21T16:18:08 FS 14002 1 5.50
file2:
55454535 11 17 102 850Sande Fiambre 1.000
55454536 11 17 17 238Pesc. Dourada 1.000
example that does not work:
file2:
5545453501/21/20142 1716:18 1 1 116:18
5545453601/21/20142 1716:18 1 1 216:18
the string to be searched, for instance : 55454535 finds a match in the working example, but it doesn't in the bottom one.
You probably want to replace this:
!($1 in a) {
print
}
with this (or similar - your requirements are unclear):
{
found = 0
for (s in a) {
if ($1 ~ "^"s) {
found = 1
}
}
if (!found) {
print
}
}
Use a regex comparison ~ instead of ==
ex. if ($4 ~ v1 && $5 ~ v2)
Prepend v1/v2 with ^ if you want to the word to only begin with string and $ if you want to word to only end with it

select lines with duplicate columns by a specific value

I have an input like this:
LineA parameter1 parameter2 56
LineB parameter1 parameter2 87
LineB parameter1 parameter2 56
LineB parameter1 parameter2 90
LineC parameter1 parameter2 40
I want to print each line but, if the first column ($1) is duplicated, only print the line with the highest value in the last column ($4).
So the output should look like this:
LineA parameter1 parameter2 56
LineB parameter1 parameter2 90
LineC ...
Try the below(assuming field 4 is >= 0 throughout)
Array b is used to track the highest value in field 4 for unique values in field 1. Array a (keyed by field 1) contains the corresponding record. As each record is processed, the record is added to array a and field 4 is added to array b if
1. a value is encountered in field 1 for the first time or 2. the value in field 4 exceeds the existing value in b for the value in field 1.
Finally, array a is printed out.
awk '$4 > b[$1] {a[$1] = $0; b[$1] = $4}
END{for (x in a) {print a[x]}}'
Code for GNU awk:
awk 'BEGIN {SUBSEP=OFS} $4>a[$1,$2,$3] {a[$1,$2,$3]=$4} END {for (i in a) {print i,a[i]}}' file
Another way in awk:
awk '
fld1!=$1 && NR>1 {print line}
fld1==$1 {line=(fld4>$4)?line:$0;next}
{line=$0;fld1=$1;fld4=$4;next}
END{print line}' file

Separate string of digits into 3 columns using awk/sed

I have a string of digits in rows as below:
6390212345678912011012112121003574820069121409100000065471234567810
6390219876543212011012112221203526930428968109100000065478765432196
That I need to split into 6 columns as below:
639021234567891,201101211212100,3574820069121409,1000000,654712345678,10
639021987654321,201101211222120,3526930428968109,1000000,654787654321,96
Conditions:
Field 1 = 15 Char
Field 2 = 15 Char
Field 3 = 15 or 16 Char
Field 4 = 7 Char
Field 5 = 12 Char
Field 6 = 2 Char
Final Output:
639021234567891,3574820069121409,654712345678
639021987654321,3526930428968109,654787654321
It's not clear how detect whether field 3 should have 15 or 16 chars. But as draft for the first 3 fields you could use something like that:
echo 63902910069758520110121121210035748200670169758510 |
awk '{ printf("%s,%s,%s",substr($1,1,15),substr($1,16,15),substr($1,30,15)); }'
Or with sed:
echo $NUM | sed -r 's/^([0-9]{16})([0-9]{15})([0-9]{15,16}) ...$/\1,\2,\3, .../'
This will use 15 or 16 for the length of field 3 based the length of the whole string.
If you're using gawk:
gawk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"} {print $1, $3, $5}'
Do you know ahead of time what the width of Field 3 should be? Do you need it to be programatically determined? How? Based on the total length of the line? Does it change line-by-line?
Edit:
If you don't have gawk, then this is a similar approach:
awk -v f3w=16 'BEGIN {OFS=","; FIELDWIDTHS="15 15 " f3w " 7 12 2"; n=split(FIELDWIDTHS,fw," ")} { p=1; r=$0; for (i=1;i<=n;i++) { $i=substr(r,p,fw[i]); p += fw[i]}; print $1,$3,$5}'