I have the following code:
function replaceappend() {
awk -v old="^$2" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}
replaceappend "/etc/ssh/sshd_config" "Port" "Port 222"
It works perfectly but I am looking to modify it so it replaces the entire lines contents rather than just the matching text.
At the moment it would do this:
Port 1234 -> Port 222 1234
I want it to be work like this:
Port 1234 -> Port 222
I closest code I can find to do this is found here:
awk 'NR==4 {$0="different"} { print }' input_file.txt
This would replace the entire line of the match with the new content. How can I implement this into my existing code?
Just change:
sub(old,new) { replaced=1 }
to:
$0~old { $0=new; replaced=1 }
or:
sub(".*"old".*",new) { replaced=1 }
If you want to replace the entire line you can simplify your function. To avoid problems with metacharacters in the variables you pass to awk, I would suggest using a simple string search too:
awk -vold="$2" -vnew="$3" 'index($0,old)==1{f=1;$0=new}1;END{if(!f)print new}' "$1"
index returns the character position of the string that you are searching for, starting at 1. If the string old is at the start of the line, then the line is changed to the value of new. The 1 after the block is always true so every line is printed (this is a common shorthand for an unconditional {print} block).
As mklement0 has pointed out in the comments, the variables you pass to awk are still subject to some interpretation: for example, the string \n will be interpreted as a newline character, \t as a tab character, etc. However, this issue is much less significant than it would be using regular expressions, where things like a . would match any character.
Again, use a regular expression for that which you want to replace:
replaceappend port.txt "Port.*" "Port 222"
Here you are replacing Port (if it starts the line, as per your function definition) plus whatever follows until the end of the line with "Port 222".
EDIT: To make this part of the function instead of requiring it in the call, modify it to
function replaceappend() {
awk -v old="^$2.*" -v new="$3" '
sub(old,new) { replaced=1 }
{ print }
END { if (!replaced) print new }
' "$1" > /tmp/tmp$$ &&
mv /tmp/tmp$$ "$1"
}
Related
I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort
I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with > and followed by a line with a long string of characters -AGTCNR. The two lines make a complete record.
Here is how a simple file looks like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------NNNN
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
--------NNNTCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA--NNAGTNNNNNNNNNNNNNNNAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
NNNNNNNNNNNTCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
With the following code I can search through the second line, that contains the string of characters, for each record and extract records which have up to a certain maximum number of - or N or n characters at the beginning of line using $start_gaps variable and end of line using $end_gaps variable, this is done in the thread here:
start_Ns=10
end_Ns=10
awk -v start_N=$start_Ns -v end_N=$end_Ns ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs.aln
Now i need to search for the occurrence of - or N or n characters in the region "not including" the beginning or end terminals of the second line for every record and filter out records with more than a specific maximum number of - or N or n characters. The code below does it but i need to use a variable that i can easily reset:
start_Ns=10
end_Ns=10
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N && match($0,/N{0,11}/) {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
As for a variable i tried the following but failed:
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,mid_N}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
Expected results:
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAANNN-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
I would suggest the following logic in order not to overcomplicate things.
Search the beginning part, remove it from the string
Search the end part, remove it from the string
Search the middle part in the remainder:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,/^[-Nn]*/) && RLENGTH > start_N { next }
{ seq=substr(seq,RSTART+RLENGTH) }
match(seq,/[-Nn]*$/) && RLENGTH > end_N { next }
{ seq=substr(seq,1,RSTART-1) }
{ while (match(seq,/[-Nn]+/)) {
if(RLENGTH>mid_N) next
seq=substr(seq,RSTART+RLENGTH)
}
}
{ print hdr; print $0 }' file
An alternative method would be making use of Extended Regular expressions with character duplication:
awk -v start_N=10 -v mid_N=11 -v end_N=10 '
(FNR==1) { ere_start = "^[-Nn]{" start_N+1 ",}"
ere_end = "[-Nn]{" mid_N+1 ",}$"
ere_mid = "[^-Nn][-Nn]{" end_N+1 ",}[^-Nn]"
/^>/{hdr=$0; next}
{ seq=$0 }
match(seq,ere_start) { next }
match(seq,ere_end) { next }
match(seq,ere-mid) { next }
{ print hdr; print $0 }' file
You can use a string as the second argument to match and then the regular string interpolation operators in Awk work fine.
awk -v start_N=10 -v mid_N=11 -v end_N=10 ' /^>/ {
hdr=$0; next }
match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,"N{0," mid_N "}") &&
match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }'
Just to spell this out a bit, if you use /regex/ then the text between the slashes is immediately interpreted as a regular expression, but if you use "regex" or a piece of code which evaluates to a string, the regular Awk string-handling functions are processed first, and only then is the resulting string interpreted as a regular expression.
Thanks for your question. In my humble opinion, you should rephrase a bit your question and make sure your objective is 100% clear to all potential readers of this thread.
With regards to having a variable inside a construct in which awk doesn't allow the use of a variable, there is a standard trick that would apply whichever scripting tool you would use (e.g. sed or even some more complex stuff in perl or Python): interrupt your awk script by breaking the single-quote construct, and you insert in there a variable expansion that is performed by the shell, not by awk. For instance, here, you would define mid_N in Bash and then use "${mid_N}" in the middle of your awk script, with a closing single quote immediately before and a (re-)opening single quote immediately after. Like so:
mid_N=11
awk -v start_N=10 -v end_N=10 ' /^>/ {
hdr=$0; next }; match($0,/^[-Nn]*/) && RLENGTH<=start_N &&
match($0,/N{0,'"${mid_N}"'}/) && match($0,/[-Nn]*$/) && RLENGTH<=end_N {
print hdr; print }' infile.aln > without_shortseqs_mids.aln
That's a minimal-edit solution to the specific issue you mentioned below your "As for a variable i tried the following but failed:"
I have three unescaped adversarial shell variables.
$mystring
$old
$new
Remember, all three strings are adversarial. They will contain special characters. They will contain everything possible to mess up the replace. If there is a loophole in your replace, the strings will exploit it.
What is the simplest function to replace $old with $new in $mystring?
(I couldn't find any solution in stack overflow for a generic substitution that will work in all cases).
There's nothing fancy here -- the only thing you need to do to ensure that your values are treated as literals in a parameter expansion is to ensure that you're quoting the search value, as described in the relevant section of BashFAQ #21:
result=${mystring/"$old"/$new}
Without the double quotes on the inside, $old would be interpreted as a fnmatch-style glob expression; with them, it's literal.
To operate on streams instead, consider gsub_literal, also described in BashFAQ #21:
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ $1 ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\\/\\\\}" -v rep="${2//\\/\\\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index($0, str)) {
# append everything up to the search string, and the replacement string
out = out substr($0, 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
$0 = substr($0, i + len);
}
# append whatever is left
out = out $0;
print out;
}
'
}
some_command | gsub_literal "$search" "$rep"
...which can also be used for in-place replacement on files using techniques from the following (yet again taken from the previously-linked FAQ):
# Using GNU tools to preseve ownership/group/permissions
gsub_literal "$search" "$rep" < "$file" > tmp &&
chown --reference="$file" tmp &&
chmod --reference="$file" tmp &&
mv -- tmp "$file"
I am trying to write an AWK script to parse a file of the form
> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext
and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.
#!/bin/bash
# FASTA parser
awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
{ print "Not a FASTA file"; exit; }
else { x = 1; next; }
print $0 }
END { print " - DONE - "; }'
Basically you can use the following awk command:
awk 'NR==1 && /^>./ {p=1} p' file
On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.
If you want to print the error message, you need to revert the logic a bit:
awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.
For this OP literally
awk 'NR==1{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk 'NR==1{p=/^>/}p' YourFile
for line after > (including)
awk '!p{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk '!p{p=/^>/}p' YourFile
Since all you care about is the first line, you can just check that, then exit.
awk 'NR > 1 { exit (0) }
! /^>/ { print "Not a FASTA file" >"/dev/stderr"; exit (1) }' file
As noted in comments, the >"/dev/stderr" is a nonportable hack which may not work for you. Regard it as a placeholder for something slightly more sophisticated if you want a tool which behaves as one would expect from a standard Unix tool (run silently if no problems; report problems to standard error).
Please Help - I'm very rusty with my sed/awk/grep and I'm attempting to process a file (an export of a PDF that is around 4700 pages long).
Here is what I'm attempting to do: search/print line matching pattern 1, search for line matching pattern 2 and print that line and all lines from it until pattern 3 (if it includes/prints the line with pattern 3, I'm ok with it at this point), and search/print lines matching pattern 4.
All of the above patterns should occur in order (pattern 1,2,3,4) several hundred times in the file and I need to keep them in order.
Pattern 1: lines beginning with 1-5 and a whitespace (this is specific enough despite it seeming vague)
Pattern 2: lines beginning with (all caps) SOLUTION:
Pattern 3: lines beginning with (all caps) COMPLIANCE:
Pattern 4: lines beginning with an IP Addresses
Here's what I've cobbled together, but it's clearly not working:
#!/bin/bash
#
sed '
/^[1-5]\s/p {
/^SOLUTION/,/^COMPLIANCE/p {
/^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/p }
}' sample.txt
to use p in sed you need to use -n as well and also add -r for extended regex:
Here is how it should look like:
sed -r -n '{
/^[1-5] /p
/^SOLUTION/,/^COMPLIANCE/p
/^([0-9]{1,3}[\.]){3}[0-9]{1,3}/p
}' sample.txt
You probably want something like this, untested since you didn't provide any sample input or expected output:
awk '
BEGIN { state = 0 }
/^[1-5] / { if (state ~ /[01]/) { block = $0; state = 1 } }
/^SOLUTION/ { state = (state ~ /[12]/ ? 2 : 0) }
state == 2 { block = block ORS $0 }
/^COMPLIANCE/ { state = (state == 2 ? 3 : state) }
/^([0-9]{1,3}\.){3}[0-9]{1,3}/ { if (state == 3) { print block ORS $0; state = 0 } }
' file