Find specific words and replace with capitals - regex

I need to use Unix and create an awk script. The first part of the script is to find the words "Ant" "Ass" and "Ape" in a text file and replace them with the same word but capitalized.
Do I use gsub to find each occurrence? If i do:
{gsub(/Ass/, "ASS"); print}
{gsub(/Ape/, "APE"); print}
{gsub(/Ant/, 'ANT"); print}
it just prints every line of the file 3 or 4 times... how can I search and replace these three words and then print out only the modified line?
The second part of the program is to track the number of lines with matches to Ass, Ape, or Ant and the number of substitutions made.
Thanks for your help!

Do all the substitutions in a single clause:
{subs += gsub(/Ass/, "ASS"); subs += gsub(/Ape/, "APE"); subs += gsub(/Ant/, "ANT"); print; }
END { print "Total substitutions:", subs; }

sed 's/Ant/ANT/g; s/Ass/ASS/g; s/Ape/APE/'

Another way:
awk '
BEGIN {IGNORECASE=1}
{
s = 0
while (match(substr($0, s),/ass|ape|ant/) > 0) {
c=substr($0,s + RSTART - 1,RLENGTH)
sub(c,toupper(c))
s += RSTART + RLENGTH
}
print
}' input

Related

Match strings in two files using awk and regexp

I have two files.
File 1 includes various types of SeriesDescriptions
"SeriesDescription": "Type_*"
"SeriesDescription": "OtherType_*"
...
File 2 contains information with only one SeriesDescription
"Name":"Joe"
"Age":"18"
"SeriesDescription":"Type_(Joe_text)"
...
I want to
compare the two files and find the lines that match for "SeriesDescription" and
print the line number of the matched text from File 1.
Expected Output:
"SeriesDescription": "Type_*" 24 11 (the correct line numbers in my files)
"SeriesDescription" will always be found on line 11 of File 2. I am having trouble matching given the * and have also tried changing it to .* without luck.
Code I have tried:
grep -nf File1.txt File2.txt
Successfully matches, but I want the line number from File1
awk 'FNR==NR{l[$1]=NR; next}; $1 in l{print $0, l[$1], FNR}' File2.txt File1.txt
This finds a match and prints the line number from both files, however, this is matching on the first column and prints the last line from File 1 as the match (since every line has the same column 1 for File 1).
awk 'FNR==NR{l[$2]=$3;l[$2]=NR; next}; $2 in l{print $0, l[$2], FNR}' File2.txt File1.txt
Does not produce a match.
I have also tried various settings of FS=":" without luck. I am not sure if the trouble is coming from the regex or the use of "" in the files or something else. Any help would be greatly appreciated!
With your shown samples, please try following. Written and tested in GNU awk, should work in any awk.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
}
' SeriesDescriptions file2
With your shown samples output will be:
"SeriesDescription": "Type_*" 1 3
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{ val="" } ##Nullifying val here.
match($0,/^[^_]*_/){ ##Using match to match value till 1st occurrence of _ here.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of above matched regex.
gsub(/[[:space:]]+/,"",val) ##Globally substituting spaces with NULL in val here.
}
FNR==NR{ ##This will execute when first file is being read.
if(val){ ##If val is NOT NULL.
arr[val]=$0 OFS FNR ##Create arr with index of val, which has value of current line OFS and FNR in it.
}
next ##next will skip all further statements from here.
}
(val in arr){ ##Checking if val is present in arr then do following.
print arr[val] OFS FNR ##Printing arr value with OFS, FNR value.
}
' SeriesDescriptions file2 ##Mentioning Input_file name here.
Bonus solution: If above is working fine for you AND you have this match only once in your file2 then you can exit from program to make it quick, in that case have above code in following way.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
exit
}
' SeriesDescriptions file2

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

Search text for matching strings to print line AND matching string

I am using awk to search a large text for matching strings. My goal is to print the matching string, line number, and matching line. I haven't been able to achieve the first part (ie. printing the matching string).
Currently I have:
awk '/string1/string2/string3/{ print NR, $0 }' file_to_search.txt
This produces the line number and matching line, but not the matching string.
Any help is appreciated.
Your question isnt clear but it sounds like this might be what you want:
awk 'match($0,/regexp/){ print substr($0,RSTART,RLENGTH), NR, $0 }' file_to_search.txt
With grep, you can get the line number and the matching string or the complete line (as I know )
grep -wnoF "string1
string2
string3" infile
With awk, you can get what you look for
awk '
function findmatch(s, i) {
for (i=1;i<=NF;i++)
{if ($i == s)
{print "find string = "s,"on line number = "FNR,"Complete line = "$0}};
}
s1{findmatch(s1)}
s2{findmatch(s2)}
s3{findmatch(s3)}
' s1='string1' s2='string2' s3='string3' infile

Remove string and add sequential number to file headers using awk or sed

I have following input:
>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*
And I would like to
remove the linebreaks in lines after the header starting with >
remove the asterisk
change the fasta header
I could do 1. and 2.
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }'
sed "s/\*//g"
and I can also add a sequential number to the end of the header line:
awk '/^>/{$0=$0"_"(++i)}1'
but I am failing at the last step with the replacing/removing and adding a sequential number:
desired output
>TM0001|hypothetical_protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
According to your "desired" output - gawk solution:
awk 'BEGIN{ RS=">"; FS="[|\\]\\[]" }!$0{ next }
{ gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%s\n%s\n",++c,$3,$5)
}' yourfile
The output:
>TM0001|hypothetical protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein of unknown function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease P protein component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
Details:
RS=">" - considering > as record separator
FS="[|\\]\\[]" - field separator, any of characters |[]
!$0{ next } - skip empty records
gsub(/^ */,"",$3) - remove leading spaces in the 3rd field
gsub(/[*[:space:]]/,"",$5) - replace/remove asterisk * and whitespace characters within the 5th field

SEd: replace whitespace characters with single comma except inside quotes

This line is from a car dataset (https://archive.ics.uci.edu/ml/datasets/Auto+MPG)
looking like this:
15.0 8. 429.0 198.0 4341. 10.0 70. 1. "ford galaxie 500"
how would one replace the multiple whitespace (it has both space and tabs) w/ a single comma, but not inside the quotes, preferably using sed,to turn the dataset into a REAL csv. Thanks!
Do it with awk:
awk -F'"' 'BEGIN { OFS="\"" } { for(i = 1; i <= NF; i += 2) { gsub(/[ \t]+/, ",", $i); } print }' filename.csv
Using " as the field separator, every second field is going to be a part of the line where spaces should be replaced. Then:
BEGIN { OFS = FS } # output should also be separated by "
{
for(i = 1; i <= NF; i += 2) { # in every second field
gsub(/[ \t]+/, ",", $i) # replace spaces with commas
}
print # and print the whole shebang
}
This might work for you (GNU sed):
sed 's/\("[^"]*"\|[0-9.]*\)\s\s*/\1,/g' file
This takes a quoted string or a decimal number followed by white space and replaces the white space by a comma - throughout each and every line.
To be less specific use (as per comments):
sed -r 's/("[^"]*"|\S+)\s+/\1,/g' file