delete characters in lines starting with an unique pattern - regex

I have a file consisting of many entries that look like this:
>1761420406686363113470.1
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
i.e. a header line starting with > and many lines of sequence, followed by a header line.
I am trying to write a sed script that goes to only the lines that start with > (not the sequences lines) and deletes all but the first 10 numbers.
There are a lot of similar questions to this, but I can't figure it out. I've been trying variations on this code:
sed 's/^>..........*/^>........../' input.fasta
but clearly am not doing it right..

This might work for you (GNU sed):
sed -r 's/^(>.{10}).*/\1/p;d' file
This deletes all but those lines that are substituted, if you want to retain the sequence lines:
sed -r 's/^(>.{10}).*/\1/' file
should fit the bill.

You have to capture the first 10 characters in parentheses:
sed -e 's/^\(>..........\).*/\1/'
Which can be shortened to
sed -e 's/^\(>.\{10\}\).*/\1/'

as an alternative to sed, use cut
$ echo ">1761420406686363113470.1" | cut -c1-11
>1761420406
To operate on lines starting with an >, wrap it in a bash-while-loop
$ while read line; do if [[ $line == \>* ]]; then cut -c1-11 <<< $line; else echo $line; fi done < input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
or using awk:
$ awk '{if ($0 ~ />/){print substr($0,0,11)}else{print}}' input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA

Since good sed answers are already posted, here is an `GNU-awk solution.
gawk '/^>/{print gensub(/(.{11}).*/,"\\1","G",$1);next }1' inputFile

Related

Deleting lines matching a pattern from a Unix file

I have a file containing strings of the following format:
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
05|DELETE|REDEFINES|VARIABLE.
05|KEEP2|REDEFINES|VARIABLE2
|PIC|9(5).
I want to be able to use something like sed or awk to delete lines containing the word REDEFINES but NOT if the word PIC is also in there or if there is no full stop at the end of a line as this means the string has been split over 2 lines. So out of the 4 lines (3 strings) stated above I would only want to delete 05|DELETE|REDEFINES|VARIABLE.
I thought you might be able to use some kind of negation or lookahead but these don't seem to be available or I can't get them to work
Using awk this deletes anything containing REDEFINES in the String following the pattern in the example above:
awk '!/[[:print:]]*\REDEFINES[[:print:]]*\./'
Similarly using sed:
sed '/[[:print:]]*|REDEFINES[[:print:]]*\./d'
I just can't work out how to extend it to do what I need. Is this possible in sed or awk or do I need another tool?
Any help greatly appreciated.
Using awk
awk -v RS= '!/REDEFINES/ || /PIC/' file
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
05|KEEP2|REDEFINES|VARIABLE2
|PIC|9(5).
Using sed (with older input data):
sed -i.bak '/REDEFINES/{/PIC/!d;}' file
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
You can try the below command. Print the line if it contains PIC or if it does not contain REDEFINES. It is maintainable as it is not so tricky and could be understood without much of an effort.
cat input.txt | awk '{if ($0 ~ /PIC/ || $0 !~ /REDEFINES/){print $0}}'
Why don't you just use grep? Using negations on your question, here is what I understood:
keep the lines terminated with a full-stop, containing both REDEFINES and PIC.
So grep seems easy:
$ grep -E 'REDEFINES.*\.$' file | grep PIC
05|KEEP|REDEFINES|NO_TYPE|PIC|9.
Hope this helps.
This might work for you (GNU sed):
sed -r '/REDEFINES/{/PIC|[^.]$/!d}' file
or perhaps more easily:
sed '/PIC/b;/REDEFINES.*\.$/d' file
or if you prefer:
sed '/PIC/!{/REDEFINES.*\.$/d}' file

Match a string that could have a newline anywhere in it in - bash

I have a string containing a number that is represented as follows:
\S2=number_goes_here\
The number could be anything from 0.00000 and up. However, there could be a newline anywhere in that string, and I am not entirely sure how to go about matching that. Ultimately, I just want the number from this. Importantly, this string is amidst a large chunk of text that can be represented by this sample (S2 is found on the last line there):
1.454187\H,0,0.719618,3.525801,1.633708\H,0,-0.454651,2.80328,2.23844\
Ru,0,0.025774,1.557599,-0.253913\\Version=EM64L-G09RevD.01\State=6-A\H
F=-1238.5377983\S2=8.75446\S2-1=0.\S2A=8.750006\RMSD=2.314e-09\Dipole=
I'm open to bash, sed, awk, gawk; whatever thoughts you have to address this.
EDIT:
Here is example, the first answer below does not seem to have worked correctly for this example. It only prints "2."
.631441,-2.132979\H,0,0.20151,-1.464802,-2.95553\H,0,0.377883,-2.50668
5,-1.874761\\Version=EM64L-G09RevD.01\State=3-A\HF=-1265.9035096\S2=2.
053325\S2-1=0.\S2A=2.000966\RMSD=1.590e-04\Dipole=0.7197616,-2.1253769
grep -Po '(?<=S2=)[\d.]+' <(tr -d '\n' < file)
gives
8.75446
You can use perl, read the whole file in slurp mode, remove newline characters and search it using a regular expression:
perl -0777 -nE '
$_ = join q||, split /\n/;
printf qq|%s\n|, $1 if m/\\S2=([\d.]+)/
' infile
It yields:
8.75446
Also possible using just bash, though this won't work so well for very large files.
#!/bin/bash
IFS=$'\n'
string=$(<"test.txt")
var=$(echo $string) # word-splitting will replace each newline with a space here
while IFS= read -r word; do
[[ $word =~ '\S2='([0-9]*\.[0-9]*)'\' ]] && echo ${BASH_REMATCH[1]}
done <<< "$var"
e.g.
> ./abovescript
8.75446
Here is an gnu awk version (due to RS with multiple characters):
awk -F'\' 'NR==2 {print $1}' RS="S2=" file
8.75446
A version that works with most awk
awk -F\\ '{for (i=1;i<=NF;i++) if ($i~/S2=/) {split($i,a,"=");print a[2]}}' file
8.75446

Bash - how to put each line within quotation

I want to put each line within quotation marks, such as:
abcdefg
hijklmn
opqrst
convert to:
"abcdefg"
"hijklmn"
"opqrst"
How to do this in Bash shell script?
Using awk
awk '{ print "\""$0"\""}' inputfile
Using pure bash
while read FOO; do
echo -e "\"$FOO\""
done < inputfile
where inputfile would be a file containing the lines without quotes.
If your file has empty lines, awk is definitely the way to go:
awk 'NF { print "\""$0"\""}' inputfile
NF tells awk to only execute the print command when the Number of Fields is more than zero (line is not empty).
I use the following command:
xargs -I{lin} echo \"{lin}\" < your_filename
The xargs take standard input (redirected from your file) and pass one line a time to {lin} placeholder, and then execute the command at next, in this case a echo with escaped double quotes.
You can use the -i option of xargs to omit the name of the placeholder, like this:
xargs -i echo \"{}\" < your_filename
In both cases, your IFS must be at default value or with '\n' at least.
This sed should work for ignoring empty lines as well:
sed -i.bak 's/^..*$/"&"/' inFile
or
sed 's/^.\{1,\}$/"&"/' inFile
Use sed:
sed -e 's/^\|$/"/g' file
More effort needed if the file contains empty lines.
I think the sed and awk are the best solution but if you want to use just shell here is small script for you.
#!/bin/bash
chr="\""
file="file.txt"
cp $file $file."_backup"
while read -r line
do
echo "${chr}$line${chr}"
done <$file > newfile
mv newfile $file
paste -d\" /dev/null your-file /dev/null
(not the nicest looking, but probably the fastest)
Now, if the input may contain quotes, you may need to escape them with backslashes (and then escape backslashes as well) like:
sed 's/["\]/\\&/g; s/.*/"&"/' your-file
This answer worked for me in mac terminal.
$ awk '{ printf "\"%s\",\n", $0 }' your_file_name
It should be noted that the text in double quotes and commas was printed out in terminal, the file itself was unaffected.
I used sed with two expressions to replace start and end of line, since in my particular use case I wanted to place HTML tags around only lines that contained particular words.
So I searched for the lines containing words contained in the bla variable within the text file inputfile and replaced the beginnign with <P> and the end with </P> (well actually I did some longer HTML tagging in the real thing, but this will serve fine as example)
Similar to:
$ bla=foo
$ sed -e "/${bla}/s#^#<P>#" -e "/${bla}/s#\$#</P>#" inputfile
<P>foo</P>
bar
$

Using sed to find and replace within matched substrings

I'd like to use sed to process a property file such as:
java.home=/usr/bin/java
groovy-home=/usr/lib/groovy
workspace.home=/build/me/my-workspace
I'd like to replace the .'s and -'s with _'s but only up to the ='s token. The output would be
java_home=/usr/bin/java
groovy_home=/usr/lib/groovy
workspace_home=/build/me/my-workspace
I've tried various approaches including using addresses but I keep failing. Does anybody know how to do this?
What about...
$ echo foo.bar=/bla/bla-bla | sed -e 's/\([^-.]*\)[-.]\([^-.]*=.*\)/\1_\2/'
foo_bar=/bla/bla-bla
This won't work for the case where you have more than 1 dot or dash one the left, though. I'll have to think about it further.
awk makes life easier in this case:
awk -F= -vOFS="=" '{gsub(/[.-]/,"_",$1)}1' file
here you go:
kent$ echo "java.home=/usr/bin/java
groovy-home=/usr/lib/groovy
workspace.home=/build/me/my-workspace"|awk -F= -vOFS="=" '{gsub(/[.-]/,"_",$1)}1'
java_home=/usr/bin/java
groovy_home=/usr/lib/groovy
workspace_home=/build/me/my-workspace
if you really want to do with sed (gnu sed)
sed -r 's/([^=]*)(.*)/echo -n \1 \|sed -r "s:[-.]:_:g"; echo -n \2/ge' file
same example:
kent$ echo "java.home=/usr/bin/java
groovy-home=/usr/lib/groovy
workspace.home=/build/me/my-workspace"|sed -r 's/([^=]*)(.*)/echo -n \1 \|sed -r "s:[-.]:_:g"; echo -n \2/ge'
java_home=/usr/bin/java
groovy_home=/usr/lib/groovy
workspace_home=/build/me/my-workspace
In this case I would use AWK instead of sed:
awk -F"=" '{gsub("\\.|-","_",$1); print $1"="$2;}' <file.properties>
Output:
java_home/usr/bin/java
groovy_home/usr/lib/groovy
workspace_home/build/me/my-workspace
This might work for you (GNU sed):
sed -r 's/=/\n&/;h;y/-./__/;G;s/\n.*\n//' file
"You wait ages for a bus..."
This works with any number of dots and hyphens in the line and does not require GNU sed:
sed 'h; s/.*=//; x; s/=.*//; s/[.-]/_/g; G; s/\n/=/' < data
Here's how:
h: save a copy of the line in the hold space
s: throw away everything before the equal sign in the pattern space
x: swap the pattern and hold
s: blow away everything after the = in the pattern
s: replaces dots and hyphens with underscores
G: join the pattern and hold with a newline
s: replace that newline with an equal to glue it all back together
Other way using sed
sed -re 's/(.*)([.-])(.*)=(.*)/\1_\3=\4/g' temp.txt
Output
java_home=/usr/bin/java
groovy_home=/usr/lib/groovy
workspace_home=/build/me/my-workspace
In case there are more than .- on left hand side then this
sed -re ':a; s/^([^.-]+)([\.-])(.*)=/\1_\3=/1;t a' temp.txt

With sed or awk, how do I match from the end of the current line back to a specified character?

I have a list of file locations in a text file. For example:
/var/lib/mlocate
/var/lib/dpkg/info/mlocate.conffiles
/var/lib/dpkg/info/mlocate.list
/var/lib/dpkg/info/mlocate.md5sums
/var/lib/dpkg/info/mlocate.postinst
/var/lib/dpkg/info/mlocate.postrm
/var/lib/dpkg/info/mlocate.prerm
What I want to do is use sed or awk to read from the end of each line until the first forward slash (i.e., pick the actual file name from each file address).
I'm a bit shakey on syntax for both sed and awk. Can anyone help?
$ sed -e 's!^.*/!!' locations.txt
mlocate
mlocate.conffiles
mlocate.list
mlocate.md5sums
mlocate.postinst
mlocate.postrm
mlocate.prerm
Regular-expression quantifiers are greedy, which means .* matches as much of the input as possible. Read a pattern of the form .*X as "the last X in the string." In this case, we're deleting everything up through the final / in each line.
I used bangs rather than the usual forward-slash delimiters to avoid a need for escaping the literal forward slash we want to match. Otherwise, an equivalent albeit less readable command is
$ sed -e 's/^.*\///' locations.txt
Use command basename
$~hawk] basename /var/lib/mlocate
mlocate
I am for "basename" too, but for the sake of completeness, here is an awk one-liner:
awk -F/ 'NF>0{print $NF}' <file.txt
There's really no need to use sed or awk here, simply us basename
IFS=$'\n'
for file in $(cat filelist); do
basename $file;
done
If you want the directory part instead use dirname.
Pure Bash:
while read -r line
do
[[ ${#line} != 0 ]] && echo "${line##*/}"
done < files.txt
Edit: Excludes blank lines.
Thius would do the trick too if file contains the list of paths
$ xargs -d '\n' -n 1 -a file basename
This is a less-clever, plodding version of gbacon's:
sed -e 's/^.*\/\([^\/]*\)$/\1/'
#OP, you can use awk
awk -F"/" 'NF{ print $NF }' file
NF mean number of fields, $NF means get the value of last field
or with the shell
while read -r line
do
line=${line##*/} # means longest match from the front till the "/"
[ ! -z "$line" ] && echo $line
done <"file"
NB: if you have big files, use awk.