sed/awk specify ranges in regex

sed/awk specify ranges in regex - regex

I have a file list output like this:
/path/F201405151800
/path/F201405151900
/path/F201405152000
/path/F201405152100
I piped this output to sed and used the following syntax:
sed -n '/F.\{8\}'$var1'/,/F.\{8\}'$var2'/p'
$var1 and $var2 are user inputs and as it can be seen, they refer to hours of the day in my files list. The above syntax works perfectly if $var1 and $var2 values are found. But if the value of $var1 is 16, and $var2 is 19 sed will not output anything because 16 will not be found in the above file list range.
A solution to this was:
sed -n '/F.\{8\}1[6-9]/p'
...which works but the issue I am facing now is how to specify double digit ranges in order to include something like: 16-20. I tried globbing between single quotes (like I'm doing with variables) like this:
sed -n '/F.\{8\}'{16..21}'/p'
...but the output I get is:
sed: can't read /F.\{8\}16/p: No such file or directory
sed: can't read /F.\{8\}17/p: No such file or directory
sed: can't read /F.\{8\}18/p: No such file or directory
sed: can't read /F.\{8\}19/p: No such file or directory
sed: can't read /F.\{8\}20/p: No such file or directory
sed: can't read /F.\{8\}21/p: No such file or directory
I don't really need to use sed, I explored some options with awk but could not obtain what I want, the main issue being that I can't figure out how to specify a regex RS so that I have the hours block as an awk field and do some conditions like
'$2 > 16 && $2 < 21 {print}'

You can use this awk:
awk -F'/' '{h=$3; gsub(/^F[0-9]{6}|[0-9]{4}$/, "", h)} h > 12 && h < 21' file
/path/F201405151800
/path/F201405151900
/path/F201405152000
/path/F201405152100

In order to have double digit ranges you can try something like this:
sed -n '/F.\{8\}(1[6-9]|2[01])/p'

Try to add -e:
sed -n -e '/F.\{8\}'"$var1"'/,/F.\{8\}'"$var2"'/p'
And always quote the variables to prevent word splitting.
Update:
awk -v A=17 -v B=21 -F/ '!/\/F[0-9]{12}$/{next} {h = substr($NF, 10, 2)} h >= A && h <= B'

Related

sed & regex expression

I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41

If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file

Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file

This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)

How to remove a space between matching words?

I've read a lot of questions about how to replace spaces from a file but I have the following problem:
I have a file like so:
<foo>"crazy foo"</foo> <bar>dull-bar</bar>
and I'm trying to remove spaces between > < and only those ones so the file would be like:
`<foo>"crazy foo"</foo><bar>dull-bar</bar>`
So far I've tried to remove then by using sed and tr. Sed is not working by any chance and using tr '> <' '><' outputs:
<foo>"crazy foo"</foo><<bar>dull-bar</bar>

sed -i -e "s/> *</></g" YourFile
-i means YourFile is modified. Remove this option to test your command and display the result in shell output.
* matches n spaces.
The g at the end of sed expression means "Replace all the occurrences".

You could try something like this
echo "<foo>"crazy foo"</foo> <bar>dull-bar</bar>" | sed 's/>[[:space:]]*</></g '

awk -F"\"" '{print $3}' file.txt | sed 's/ //g'

Pipe awk's results to sed (deletion)

I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?

Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.

Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.

This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt

This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2

Bash - how to put each line within quotation

I want to put each line within quotation marks, such as:
abcdefg
hijklmn
opqrst
convert to:
"abcdefg"
"hijklmn"
"opqrst"
How to do this in Bash shell script?

Using awk
awk '{ print "\""$0"\""}' inputfile
Using pure bash
while read FOO; do
echo -e "\"$FOO\""
done < inputfile
where inputfile would be a file containing the lines without quotes.
If your file has empty lines, awk is definitely the way to go:
awk 'NF { print "\""$0"\""}' inputfile
NF tells awk to only execute the print command when the Number of Fields is more than zero (line is not empty).

I use the following command:
xargs -I{lin} echo \"{lin}\" < your_filename
The xargs take standard input (redirected from your file) and pass one line a time to {lin} placeholder, and then execute the command at next, in this case a echo with escaped double quotes.
You can use the -i option of xargs to omit the name of the placeholder, like this:
xargs -i echo \"{}\" < your_filename
In both cases, your IFS must be at default value or with '\n' at least.

This sed should work for ignoring empty lines as well:
sed -i.bak 's/^..*$/"&"/' inFile
or
sed 's/^.\{1,\}$/"&"/' inFile

Use sed:
sed -e 's/^\|$/"/g' file
More effort needed if the file contains empty lines.

I think the sed and awk are the best solution but if you want to use just shell here is small script for you.
#!/bin/bash
chr="\""
file="file.txt"
cp $file $file."_backup"
while read -r line
do
echo "${chr}$line${chr}"
done <$file > newfile
mv newfile $file

paste -d\" /dev/null your-file /dev/null
(not the nicest looking, but probably the fastest)
Now, if the input may contain quotes, you may need to escape them with backslashes (and then escape backslashes as well) like:
sed 's/["\]/\\&/g; s/.*/"&"/' your-file

This answer worked for me in mac terminal.
$ awk '{ printf "\"%s\",\n", $0 }' your_file_name
It should be noted that the text in double quotes and commas was printed out in terminal, the file itself was unaffected.

I used sed with two expressions to replace start and end of line, since in my particular use case I wanted to place HTML tags around only lines that contained particular words.
So I searched for the lines containing words contained in the bla variable within the text file inputfile and replaced the beginnign with <P> and the end with </P> (well actually I did some longer HTML tagging in the real thing, but this will serve fine as example)
Similar to:
$ bla=foo
$ sed -e "/${bla}/s#^#<P>#" -e "/${bla}/s#\$#</P>#" inputfile
<P>foo</P>
bar
$

With sed or awk, how do I match from the end of the current line back to a specified character?

I have a list of file locations in a text file. For example:
/var/lib/mlocate
/var/lib/dpkg/info/mlocate.conffiles
/var/lib/dpkg/info/mlocate.list
/var/lib/dpkg/info/mlocate.md5sums
/var/lib/dpkg/info/mlocate.postinst
/var/lib/dpkg/info/mlocate.postrm
/var/lib/dpkg/info/mlocate.prerm
What I want to do is use sed or awk to read from the end of each line until the first forward slash (i.e., pick the actual file name from each file address).
I'm a bit shakey on syntax for both sed and awk. Can anyone help?

$ sed -e 's!^.*/!!' locations.txt
mlocate
mlocate.conffiles
mlocate.list
mlocate.md5sums
mlocate.postinst
mlocate.postrm
mlocate.prerm
Regular-expression quantifiers are greedy, which means .* matches as much of the input as possible. Read a pattern of the form .*X as "the last X in the string." In this case, we're deleting everything up through the final / in each line.
I used bangs rather than the usual forward-slash delimiters to avoid a need for escaping the literal forward slash we want to match. Otherwise, an equivalent albeit less readable command is
$ sed -e 's/^.*\///' locations.txt

Use command basename
$~hawk] basename /var/lib/mlocate
mlocate

I am for "basename" too, but for the sake of completeness, here is an awk one-liner:
awk -F/ 'NF>0{print $NF}' <file.txt

There's really no need to use sed or awk here, simply us basename
IFS=$'\n'
for file in $(cat filelist); do
basename $file;
done
If you want the directory part instead use dirname.

Pure Bash:
while read -r line
do
[[ ${#line} != 0 ]] && echo "${line##*/}"
done < files.txt
Edit: Excludes blank lines.

Thius would do the trick too if file contains the list of paths
$ xargs -d '\n' -n 1 -a file basename

This is a less-clever, plodding version of gbacon's:
sed -e 's/^.*\/\([^\/]*\)$/\1/'

#OP, you can use awk
awk -F"/" 'NF{ print $NF }' file
NF mean number of fields, $NF means get the value of last field
or with the shell
while read -r line
do
line=${line##*/} # means longest match from the front till the "/"
[ ! -z "$line" ] && echo $line
done <"file"
NB: if you have big files, use awk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed/awk specify ranges in regex - regex

You can use this awk: awk -F'/' '{h=$3; gsub(/^F[0-9]{6}|[0-9]{4}$/, "", h)} h > 12 && h < 21' file /path/F201405151800 /path/F201405151900 /path/F201405152000 /path/F201405152100

In order to have double digit ranges you can try something like this: sed -n '/F.\{8\}(1[6-9]|2[01])/p'

Try to add -e: sed -n -e '/F.\{8\}'"$var1"'/,/F.\{8\}'"$var2"'/p' And always quote the variables to prevent word splitting. Update: awk -v A=17 -v B=21 -F/ '!/\/F[0-9]{12}$/{next} {h = substr($NF, 10, 2)} h >= A && h <= B'

Related

sed & regex expression

How to remove a space between matching words?

Pipe awk's results to sed (deletion)

Bash - how to put each line within quotation

With sed or awk, how do I match from the end of the current line back to a specified character?

Categories

Resources