Awk to extract and format a highly variable text file - regex

I'm dealing with a text file that's just a mess. It's the service record for a used RV that I'm buying, and it's a regex lover's nightmare
It has both inconsistent field separators and an inconsistent number of fields, with the lines being one of two types:
Type 1 (11 columns):
UNIT Mile GnHr R.O. Ln Service Description Mechanic Hours $ Amt
7-9918;57878 1698 01633 021;0502-00C ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
Type 2 (10 columns)
UNIT Mile GnHr R.O. Ln Service Description Hours $ Amt
7-9918;55007 1641 [9564 007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;2;31.12
I've stripped out all the headings, but put them back here just for reference. In Type 2 lines, the Mechanic field is missing.
I replaced all occurrences of multiple spaces with semicolons, so what I have now is a file where some lines have 10 fields, some lines have 11 fields, and sometimes the field separator is a space, and in other cases it's a semicolon, and some fields have legitimate embedded spaces (Description and Mechanic).
I'm trying to find a way with awk to:
1) Extract each field and be able to print it out with a uniform OFS (semicolon is preferred)
2) If the Mechanic field is missing, insert it and print N/A or -- for the Mechanic
I can deal with column headings and stuff myself, I just can't crack the code for how to deal with the FS problem and variable number of columns in this file. I can grep out specific information that I need, but would be thrilled to get it into a form where I can import it into a spreadsheet or DB.

Your input file's not so bad. Assuming your input file is semi-colon separated:
Replace all blank chars in $2 with a ; to split that up into separate fields for output, then
if there's a blank in $3 then replace the first blank with a ; (since it contains both the service and description so you need to separate them), otherwise
this is a format of line that has no mechanic specified so add the empty-mechanic text after $4 (the description)
and then just print the line:
$ awk 'BEGIN{FS=OFS=";"} {gsub(/ /,OFS,$2)} !sub(/ /,OFS,$3){$4=$4 OFS "N/A"} 1' file
7-9918;57878;1698;01633;021;0502-00C;ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
7-9918;55007;1641;[9564;007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;N/A;2;31.12
and if you'd like to do anything with the individual fields:
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ gsub(/ /,OFS,$2) }
!sub(/ /,OFS,$3) { $4 = $4 OFS "N/A" }
{
$0 = $0
print
for (i=1; i<=NF; i++) {
print NR, i, $i
}
print ""
}
.
$ awk -f tst.awk file
7-9918;57878;1698;01633;021;0502-00C;ENG OIL/ FILTERT IF NEEDED;M02 JOSE A. SANCHEZ;0.80;80.00
1;1;7-9918
1;2;57878
1;3;1698
1;4;01633
1;5;021
1;6;0502-00C
1;7;ENG OIL/ FILTERT IF NEEDED
1;8;M02 JOSE A. SANCHEZ
1;9;0.80
1;10;80.00
7-9918;55007;1641;[9564;007;ELE-BAT-BAT-0-0AAA;BATTERY AAA ALL BRANDS;N/A;2;31.12
2;1;7-9918
2;2;55007
2;3;1641
2;4;[9564
2;5;007
2;6;ELE-BAT-BAT-0-0AAA
2;7;BATTERY AAA ALL BRANDS
2;8;N/A
2;9;2
2;10;31.12

A friend of mine also sent me this solution, done in perl:
#!/usr/bin/env perl -w
use strict;
use warnings;
# 1 1 1 1 1
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4
# 012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
# Type 1:
# 7-9918 55007 1641 [9564 022 0211 INTERIOR MISC. M02 JOSE A. SANCHEZ 0.00 0.00
# Type 2:
# 7-9918 57878 1698 01633 001 FUE-LPG-LPG-S-GAS PROPANE GAS BULK PURCHASE 5 24.00
my $delim="\t";
while (<STDIN>) {
#print $_;
# Both formats are the same at this point
print substr($_, 0, 6) . $delim;
print substr($_, 8, 5) . $delim;
print substr($_, 14, 4) . $delim;
print substr($_, 19, 5) . $delim;
print substr($_, 25, 3) . $delim;
my $qty = substr($_, 109, 11);
$qty =~ s/^\s*//g;
$qty =~ s/\s*$//g;
if ($qty =~ /^\d+\.\d{2}$/) {
# Type 1
print substr($_, 40, 9) . $delim;
print substr($_, 49, 32) . $delim;
# print substr($_, 81, 32) . $delim; # Technician name
print $qty . $delim;
} elsif ($qty =~ /^[-]?\d+$/) {
# Type 2
print substr($_, 40, 23) . $delim;
print substr($_, 63, 46) . $delim;
print $qty . $delim;
}
print sprintf("%.2f", substr($_, 120, 11)) . "\n";
}
1;

Related

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

Extract journal title from Genbank file using perl without using $1, $2 etc

This is a part of my input Genbank file:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)
This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.
foreach my $line (#lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
Another way to do this, is to take advantage of BioPerl, which can parse GenBank files:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
If you run this script with AC_000005.1 saved in a file called AC_000005.1.gb, you get:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:
foreach my $line (#lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.
To get the text without using $1, I think you're trying to do something like this:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:
foreach my $line (#lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
You can combine the two techniques to eliminate the issue number and dates from the journal data:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
I think it would have been easier just to get the data from the Pubmed API! ;)

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

awk if statement and pattern matching

I have the next input file:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
My desired output file:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
Explanation: I want to print in my output file, all the lines begining with "#", all the "unique" lines attending to column 1, and if I have repeated hits in column 1, first: take the number in $2 and sum to length of $5 (in same line), if the result is smaller than the $2 of next line, print both lines; BUT if the result is bigger than the $2 of next line, compare the values of DP and only print the line with best DP.
What I've tried:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
I'm new in awk world... I think that is more simple to do a little script with multiple lines... or maybe is better a bash solution.
Thanks in advance
As requested, an awk solution. I have commented the code heavily, so hopefully the comments will serve as explanation. As a summary, the basic idea is to:
Match comment lines, print them, and go to the next line.
Match the first line (done by checking if whether we have started remembering col1 yet).
On all subsequent lines, check values against the remembered values from the previous line. The "best" record, ie. the one that should be printed for each unique ID, is remembered each time and updated depending on conditions set forth by the question.
Finally, output the last "best" record of the last unique ID.
Code:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
Note: dp is calculated by taking the sub-string of $6 starting at index 4 and going to the end. The + 0 is added to force the value to be converted to an integer, to ensure the comparison will work as expected.
Perl solution. You might need to fix the border cases as you didn't provide data to test them.
#last remembers the last line, #F is the current line.
#!/usr/bin/perl
use warnings;
use strict;
my (#F, #last);
while (<>) {
#F = split;
print and next if /^#/ or not #last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "#last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
#F = #last;
}
}
} else {
print "#last\n" if #last;
}
} continue {
#last = #F;
}
print "#last\n";

Removing Extra commas from Comma delimited file

I have a comma delimited file with 12 columns.
There is problem with 5th and 6th columns (the text in 5th and 6th column is identical, but may have extra commas between them) which contains extra commas.
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
So in the above example "Hey There,How are you" should not have a comma.
I need to remove extra commas in 5th and 6th column.
If you always want to remove the 5th comma, try
sed 's/,//5' input.txt
But you are saying, it may have extra commas. You have to provide a logic how to find out if extra commas are there or not.
If you know the number of commas, you can use awk. This has proven to be quite an exercise, I am sure someone else will come up with a more elegant solution, but I'll share mine anyway:
awk -f script.awk input.txt
with script.awk:
BEGIN{
FS=","
}
NF<=12{
print $0
}
NF>12{
for (i=1; i<=4; i++) printf $i FS
for (j=0; j<2; j++){
for (i=0; i<=(NF-12)/2; i++){
printf $(i+5)
if (i<(NF-12)/2) printf "_"
else printf FS
}
}
for (i=NF-5; i<=NF; i++) printf $i FS
printf "n"
}
First we set the field separator to ,. If we count less or equal to 12 fields, everything's fine and we simply print the whole line. If there are more than 12 fields, we print first the first 4 fields (again with the field separator), and then we print twice field 5 (and field 6), but instead of printing the ,, we exchange it with _. In the end we print the remaining fields.
As I said, there is probably a more elegant solution to this. I wonder with what other people come up.
If all other fields are digital, you can try to save useful commas by that criteria.
sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}'
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903,
You can try with perl and its Text::CSV_XS module:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
my (#columns);
open my $fh, '<', shift or die;
my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) {
undef #columns;
if ( #$row <= 12 ) {
#columns = #$row;
next;
}
my $extra_columns = ( #$row - 12 ) / 2;
my $post_columns_index = 4 + 2 * $extra_columns * 2;
#columns = (
#$row[0..3],
(join( '', #$row[4..(4+$extra_columns)] )) x 2,
#$row[$post_columns_index..$#$row]
);
}
continue {
$csv->print( \*STDOUT, \#columns );
printf "\n";
}
Assuming an input file (infile) with three lines, where the first one has an additional comma, the second one has two additional commas and the third one is correct:
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
Run the script like:
perl script.pl infile
That yields:
2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
Note that it adds some quotes but it's correct based in the csv specification and easier to handle that the previous state.