I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort
I need please help with a script with a regex to fix a big text file under linux (with sed for example). My records looks like:
1373350|Doe, John|John|Doe|||B|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3|Moe corp|...
I need to validate if the 7th column has a unique character (maybe a letter or number) and if true, add the second column without the comma, i mean:
1373350|Doe, John|John|Doe|||B Doe John|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3 Simpson Homer|Moe corp|...
Any help? Thanks!
Awk is better suited for this job:
awk -F '|' 'BEGIN { OFS = FS } length($7) == 1 { x = $2; sub(/,/, "", x); $7 = $7 " " x } 1' filename
That is:
BEGIN { OFS = FS } # output separated the same way as the input
length($7) == 1 { # if the 7th field is one character long
x = $2 # make a copy of the second field
sub(/,/, "", x) # remove comma from it
$7 = $7 " " x # append it to seventh field
}
1 # print line
I have the next input file:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
My desired output file:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
Explanation: I want to print in my output file, all the lines begining with "#", all the "unique" lines attending to column 1, and if I have repeated hits in column 1, first: take the number in $2 and sum to length of $5 (in same line), if the result is smaller than the $2 of next line, print both lines; BUT if the result is bigger than the $2 of next line, compare the values of DP and only print the line with best DP.
What I've tried:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
I'm new in awk world... I think that is more simple to do a little script with multiple lines... or maybe is better a bash solution.
Thanks in advance
As requested, an awk solution. I have commented the code heavily, so hopefully the comments will serve as explanation. As a summary, the basic idea is to:
Match comment lines, print them, and go to the next line.
Match the first line (done by checking if whether we have started remembering col1 yet).
On all subsequent lines, check values against the remembered values from the previous line. The "best" record, ie. the one that should be printed for each unique ID, is remembered each time and updated depending on conditions set forth by the question.
Finally, output the last "best" record of the last unique ID.
Code:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
Note: dp is calculated by taking the sub-string of $6 starting at index 4 and going to the end. The + 0 is added to force the value to be converted to an integer, to ensure the comparison will work as expected.
Perl solution. You might need to fix the border cases as you didn't provide data to test them.
#last remembers the last line, #F is the current line.
#!/usr/bin/perl
use warnings;
use strict;
my (#F, #last);
while (<>) {
#F = split;
print and next if /^#/ or not #last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "#last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
#F = #last;
}
}
} else {
print "#last\n" if #last;
}
} continue {
#last = #F;
}
print "#last\n";
I have a comma delimited file with 12 columns.
There is problem with 5th and 6th columns (the text in 5th and 6th column is identical, but may have extra commas between them) which contains extra commas.
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
So in the above example "Hey There,How are you" should not have a comma.
I need to remove extra commas in 5th and 6th column.
If you always want to remove the 5th comma, try
sed 's/,//5' input.txt
But you are saying, it may have extra commas. You have to provide a logic how to find out if extra commas are there or not.
If you know the number of commas, you can use awk. This has proven to be quite an exercise, I am sure someone else will come up with a more elegant solution, but I'll share mine anyway:
awk -f script.awk input.txt
with script.awk:
BEGIN{
FS=","
}
NF<=12{
print $0
}
NF>12{
for (i=1; i<=4; i++) printf $i FS
for (j=0; j<2; j++){
for (i=0; i<=(NF-12)/2; i++){
printf $(i+5)
if (i<(NF-12)/2) printf "_"
else printf FS
}
}
for (i=NF-5; i<=NF; i++) printf $i FS
printf "n"
}
First we set the field separator to ,. If we count less or equal to 12 fields, everything's fine and we simply print the whole line. If there are more than 12 fields, we print first the first 4 fields (again with the field separator), and then we print twice field 5 (and field 6), but instead of printing the ,, we exchange it with _. In the end we print the remaining fields.
As I said, there is probably a more elegant solution to this. I wonder with what other people come up.
If all other fields are digital, you can try to save useful commas by that criteria.
sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}'
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903,
You can try with perl and its Text::CSV_XS module:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
my (#columns);
open my $fh, '<', shift or die;
my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) {
undef #columns;
if ( #$row <= 12 ) {
#columns = #$row;
next;
}
my $extra_columns = ( #$row - 12 ) / 2;
my $post_columns_index = 4 + 2 * $extra_columns * 2;
#columns = (
#$row[0..3],
(join( '', #$row[4..(4+$extra_columns)] )) x 2,
#$row[$post_columns_index..$#$row]
);
}
continue {
$csv->print( \*STDOUT, \#columns );
printf "\n";
}
Assuming an input file (infile) with three lines, where the first one has an additional comma, the second one has two additional commas and the third one is correct:
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
Run the script like:
perl script.pl infile
That yields:
2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
Note that it adds some quotes but it's correct based in the csv specification and easier to handle that the previous state.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Finding the N th Occurrence of a Match line
I have multi-line text with:
ship
plane
ship
car
I want to find the 1st occurrence of "ship" in a line and output:
The ship is a 1'st one.
I want find 2nd occurrence of "ship" in a line and output:
The ship is a 2'nd one.
use strict;
use warnings;
use Lingua::EN::Inflect 'ORD';
my $text = "ship plane\n ship\n car";
my #ships = $text =~ /(ship)/g;
for ( my $i = 0; $i < #ships; $i++ ) {
my $ship_num = ORD($i + 1);
print "The ship is the $ship_num one\n";
}
&ORD will take care of your ordinal suffixes.
$n = 1;
while (<>)
{
if (m%ship%)
{
print "The ship is the $n th one\n"; # You can do the st, nd etc
++$n;
}
}
Should work
perl -lne'
if (/^ship$/) {
++$seen;
print "The is a 1'\''st one" if $seen == 1;
print "The is a 2'\''nd one" if $seen == 2;
}
' file
Or if you want more than the first two:
perl -lne'
print "The is a ", ++$seen, " one" if /^ship$/;
' file
There's probably a module to handle finding the right suffix. I'll let you look into that.
Is this what you're looking for?
my $th
= ( ( int( $n % 100 / 10 ) != 1 )
&& [ qw<th st nd rd> ]->[ $n % 10 ]
)
|| 'th'
;
print "The ship is the $n$th one."