sed one-liner to remove all single newlines? - regex

So for example,
A paragraph's newlines would be removed let's say
it contained only single
newlines.
Then the things I would want to skip out:
However.
Our previous pair of newlines wouldn't.

It’s not a sed solution — although you can always run any sed through s2p of course — but a very easy solution using perl is:
% perl -i.orig -ne 'print unless /^$/' file1 file2 file3
That has the advantage of being extensible to any whitespace on the otherwise blank lines, like spaces and tabs:
% perl -i.orig -ne 'print unless /^\s*$/' file1 file2 file3
In the event that have files with various line endings, like CR or CRLF, you could also do this, assuming you are running perl 5.10 or better:
% perl -0777 -i.orig -ne 's/\R+/\n/' file1 file2 file3
which will normalize all sequences of one or more Unicode line separators into single newlines.
If you have UTF‑8 files that might have (for example) U+00A0 NON-BREAK SPACE in them on otherwise empty lines, you can handle them by telling perl that they are UTF‑8 using the ‑CSD command-line switch:
% perl -CSD -i.orig -ne 'print unless /^\s*$/' file1 file2 file3
UPDATE
I’m really unclear what you mean by removing a paragraph. I think you just mean joining up lines in a paragraph.
If so — if what you want to do is squeeze newlines from paragraphs, then you want to do this:
% perl -i.orig -00 -ple 's/\s*\n\s*/ /g' file1 file2 file3
It may not look like it works, but it does: try it.

Here's a sed solution.
$ sed -n -e '1{${p;b};h;b};/^$/!{H;$!b};x;s/\(.\)\n/\1 /g;p' 5751270.txt
A paragraph would be removed let's say it contained only single newlines.
However.
Our previous pair of newlines wouldn't.

You can try this bash script
#!/bin/bash
exec 8<"file"
while read -r line <&8
do
if (( ${#line} > 0 )); then
read -r next <&8
if (( ${#next} > 0 ));then
continue
else
echo "$line"
echo "$next"
fi
fi
done
exec <&8-

Related

Sed replace pattern with file contents

I would like to use sed (I think?) to replace a pattern with file contents.
Example
File 1 (primary file)
Hello <<CODE>>
Goodbye.
File 2 (contents)
Anonymous person,
I am very nice.
File 3 (target)
Hello Anonymous person,
I am very nice.
Goodbye.
Right now, I am using this command:
sed "/<<CODE>>/{r file2
:a;n;ba}" file1 | \
sed "s/<<CODE>>//g" > \
file3
But this outputs:
Hello
Anonymous person,
I am very nice.
Goodbye.
(note the newline after Hello)
How can I do this without getting that newline?
(note that file2 may contain all sorts of things: brackets, newlines, quotes, ...)
Much simpler to use awk:
awk 'FNR==NR{s=(!s)?$0:s RS $0;next} /<<CODE>>/{sub(/<<CODE>>/, s)} 1' file2 file1
Hello Anonymous person,
I am very nice.
Goodbye.
Explanation:
FNR==NR - Execute this block for first file in input i.e. file2
s=(!s)?$0:s RS $0 - Concatenate whole file content in string s
next - Read next line until EOF on first file
/<<CODE>>/ - If a line with <<CODE>> is found execute that block
sub(/<<CODE>>/, s) - Replace <<CODE>> with string s (data of file2)
1 - print the output
EDIT: Non-regex way:
awk 'FNR==NR{s=(!s)?$0:s RS $0; next}
i=index($0, "<<CODE>>"){$0=substr($0, 1, i-1) s substr($0, i+8)} 1' file2 file1
The awk though it might be harder to read, is probably the right way to go.
Just for comparison sake, here's a ruby version
ruby -ne 'BEGIN{#body=File.open("file2").read}; puts gsub(/<<CODE>>/,#body);' < file1
not too bad in Perl:
cat file1 | perl -e "open FH, qq(file2); \$f2=join '', <FH>; chomp \$f2; map {s/<<CODE>>/\$f2/g; print \$_} <STDIN>" > file3
(maybe i am not the best perl coder)
it's straight-forward. read the whole file2 in, substitute it, then print.
You could do simply with:
awk '{gsub("<<CODE>>", filetwo)}1' filetwo="$(<file2)" file1
I am thinking about a solution where we use Bash to evaluate a cat process outside of the single quotes delimitating the sed script, but unfortunately the following doesn't work as soon as your file2 contains a newline:
sed 's/<<CODE>>/'"$(cat file2)"'/' file1
It accepts spaces in file2, but not newlines, as you can see in the following piece of code that works well, whatever the content of file2:
sed 's/<<CODE>>/'"$(cat file2 | tr -d '\n')"'/' file1
But, obviously, this modifies the content of the file before inclusion, which is simply bad. :-(
So if you want to play, you can first tr the newlines to some weird unexpectable character, and translate those back after sed has worked on its script:
sed 's/<<CODE>>/'"$(cat file2 | tr '\n' '\3')"'/' file1 | tr '\3' '\n'

Match a string that could have a newline anywhere in it in - bash

I have a string containing a number that is represented as follows:
\S2=number_goes_here\
The number could be anything from 0.00000 and up. However, there could be a newline anywhere in that string, and I am not entirely sure how to go about matching that. Ultimately, I just want the number from this. Importantly, this string is amidst a large chunk of text that can be represented by this sample (S2 is found on the last line there):
1.454187\H,0,0.719618,3.525801,1.633708\H,0,-0.454651,2.80328,2.23844\
Ru,0,0.025774,1.557599,-0.253913\\Version=EM64L-G09RevD.01\State=6-A\H
F=-1238.5377983\S2=8.75446\S2-1=0.\S2A=8.750006\RMSD=2.314e-09\Dipole=
I'm open to bash, sed, awk, gawk; whatever thoughts you have to address this.
EDIT:
Here is example, the first answer below does not seem to have worked correctly for this example. It only prints "2."
.631441,-2.132979\H,0,0.20151,-1.464802,-2.95553\H,0,0.377883,-2.50668
5,-1.874761\\Version=EM64L-G09RevD.01\State=3-A\HF=-1265.9035096\S2=2.
053325\S2-1=0.\S2A=2.000966\RMSD=1.590e-04\Dipole=0.7197616,-2.1253769
grep -Po '(?<=S2=)[\d.]+' <(tr -d '\n' < file)
gives
8.75446
You can use perl, read the whole file in slurp mode, remove newline characters and search it using a regular expression:
perl -0777 -nE '
$_ = join q||, split /\n/;
printf qq|%s\n|, $1 if m/\\S2=([\d.]+)/
' infile
It yields:
8.75446
Also possible using just bash, though this won't work so well for very large files.
#!/bin/bash
IFS=$'\n'
string=$(<"test.txt")
var=$(echo $string) # word-splitting will replace each newline with a space here
while IFS= read -r word; do
[[ $word =~ '\S2='([0-9]*\.[0-9]*)'\' ]] && echo ${BASH_REMATCH[1]}
done <<< "$var"
e.g.
> ./abovescript
8.75446
Here is an gnu awk version (due to RS with multiple characters):
awk -F'\' 'NR==2 {print $1}' RS="S2=" file
8.75446
A version that works with most awk
awk -F\\ '{for (i=1;i<=NF;i++) if ($i~/S2=/) {split($i,a,"=");print a[2]}}' file
8.75446

delete characters in lines starting with an unique pattern

I have a file consisting of many entries that look like this:
>1761420406686363113470.1
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
i.e. a header line starting with > and many lines of sequence, followed by a header line.
I am trying to write a sed script that goes to only the lines that start with > (not the sequences lines) and deletes all but the first 10 numbers.
There are a lot of similar questions to this, but I can't figure it out. I've been trying variations on this code:
sed 's/^>..........*/^>........../' input.fasta
but clearly am not doing it right..
This might work for you (GNU sed):
sed -r 's/^(>.{10}).*/\1/p;d' file
This deletes all but those lines that are substituted, if you want to retain the sequence lines:
sed -r 's/^(>.{10}).*/\1/' file
should fit the bill.
You have to capture the first 10 characters in parentheses:
sed -e 's/^\(>..........\).*/\1/'
Which can be shortened to
sed -e 's/^\(>.\{10\}\).*/\1/'
as an alternative to sed, use cut
$ echo ">1761420406686363113470.1" | cut -c1-11
>1761420406
To operate on lines starting with an >, wrap it in a bash-while-loop
$ while read line; do if [[ $line == \>* ]]; then cut -c1-11 <<< $line; else echo $line; fi done < input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
or using awk:
$ awk '{if ($0 ~ />/){print substr($0,0,11)}else{print}}' input
>1761420406
CAAGATTCTGAGATAATCGCGGTTTAAAGTTTCAAATTTGTTTCGGCCGATTCGAAGTCA
Since good sed answers are already posted, here is an `GNU-awk solution.
gawk '/^>/{print gensub(/(.{11}).*/,"\\1","G",$1);next }1' inputFile

How to seek forward and replace selected characters with sed

Can I use sed to replace selected characters, for example H => X, 1 => 2, but first seek forward so that characters in first groups are not replaced.
Sample data:
"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";
How it should be after sed:
"Hello World";"Number 1 is there";"tX2s-Xas,2,XXunKnownData";
What I have tried:
Nothing really, I would try but everything I know about sed expressions seems to be wrong.
Ok, I have tried to capture ([^;]+) and "skip" (get em back using ´\1\2´...) first groups separated by ;, this is working fine but then comes problem, if I use capturing I need to select whole group and if I don't use capturing I'll lose data.
This is possible with sed, but is kinda tedious. To do the translation if field number $FIELD you can use the following:
sed 's/\(\([^;]*;\)\{'$((FIELD-1))'\}\)\([^;]*;\)/\1\n\3\n/;h;s/[^\n]*\n\([^\n]*\).*/\1/;y/H1/X2/;G;s/\([^\n]*\)\n\([^\n]*\)\n\([^\n]*\)\n\([^\n]*\)/\2\1\4/'
Or, reducing the number of brackets with GNU sed:
sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
Example:
$ FIELD=3
$ echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' | sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
"Hello World";"Number 1 is there";"tX2s-Xas,2,XXunKnownData";
$ FIELD=2
$ echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' | sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
"Hello World";"Number 2 is there";"tH1s-Has,1,HHunKnownData";
There may be a simpler way that I didn't think of, though.
If awk is ok for you:
awk -F";" '{gsub("H","X",$3);gsub("1","2",$3);}1' OFS=";" file
Using -F, the file is split with semi-colon as delimiter, and hence now the 3rd field($3) is of our interest. gsub function substitutes all occurences of H with X in the 3rd field, and again 1 to 2.
1 is to print every line.
[UPDATE]
(I just realized that it could be shorter. Perl has an auto-split mode):
$F[2] =~ s/H/X/g; $F[2] =~ s/1/2/g; $_=join(";",#F)
Perl is not known for being particularly readable, but in this case I suspect the best you can get with sed might not be as clear as with Perl:
echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' |
perl -F';' -ape '$F[2] =~ s/H/X/g; $F[2] =~ s/1/2/g; $_=join(";",#F)'
Taking apart the Perl code:
# your groups are in #F, accessed as $F[$i]
$F[2] =~ s/H/X/g; # Do whatever you want with your chosen (Nth) group.
$F[2] =~ s/1/2/g;
$_ = join(";", #F) # Put them back together.
perl -pe is like sed. (sort of.)
and perl -F';' -ape means use auto-splitting (-a) and set the field separator to ';'. Then your groups are accessible via $F[i] - so it works slightly like awk, too.
So it would also work like perl -F';' -ape '/*your code*/' < inputfile
I know you asked for a sed solution - I often find myself switching to Perl (though I do still like sed) for one-liners.
awk -F";" '{gsub("H","X",$3);gsub("1","2",$3);}1' Your_file
This might work for you (GNU sed):
sed 's/H/X/2g;s/1/2/2g' file
This changes all but the first occurrence of H or 1 to X or 2 respectively
If it's by fields separated by ;'s, use:
sed 's/H[^;]*;/&\n/;h;y/H/X/;H;g;s/\n.*\n//;s/1[^;]*;/&\n/;h;y/1/2/;H;g;s/\n.*\n//' file
This can be mutated to cater for many values, so:
echo -e "H=X\n1=2"|
sed -r 's|(.*)=(.*)|s/\1[^;]*;/\&\\n/;h;y/\1/\2/;H;g;s/\\n.*\\n//|' |
sed -f - file

With sed or awk, how do I match from the end of the current line back to a specified character?

I have a list of file locations in a text file. For example:
/var/lib/mlocate
/var/lib/dpkg/info/mlocate.conffiles
/var/lib/dpkg/info/mlocate.list
/var/lib/dpkg/info/mlocate.md5sums
/var/lib/dpkg/info/mlocate.postinst
/var/lib/dpkg/info/mlocate.postrm
/var/lib/dpkg/info/mlocate.prerm
What I want to do is use sed or awk to read from the end of each line until the first forward slash (i.e., pick the actual file name from each file address).
I'm a bit shakey on syntax for both sed and awk. Can anyone help?
$ sed -e 's!^.*/!!' locations.txt
mlocate
mlocate.conffiles
mlocate.list
mlocate.md5sums
mlocate.postinst
mlocate.postrm
mlocate.prerm
Regular-expression quantifiers are greedy, which means .* matches as much of the input as possible. Read a pattern of the form .*X as "the last X in the string." In this case, we're deleting everything up through the final / in each line.
I used bangs rather than the usual forward-slash delimiters to avoid a need for escaping the literal forward slash we want to match. Otherwise, an equivalent albeit less readable command is
$ sed -e 's/^.*\///' locations.txt
Use command basename
$~hawk] basename /var/lib/mlocate
mlocate
I am for "basename" too, but for the sake of completeness, here is an awk one-liner:
awk -F/ 'NF>0{print $NF}' <file.txt
There's really no need to use sed or awk here, simply us basename
IFS=$'\n'
for file in $(cat filelist); do
basename $file;
done
If you want the directory part instead use dirname.
Pure Bash:
while read -r line
do
[[ ${#line} != 0 ]] && echo "${line##*/}"
done < files.txt
Edit: Excludes blank lines.
Thius would do the trick too if file contains the list of paths
$ xargs -d '\n' -n 1 -a file basename
This is a less-clever, plodding version of gbacon's:
sed -e 's/^.*\/\([^\/]*\)$/\1/'
#OP, you can use awk
awk -F"/" 'NF{ print $NF }' file
NF mean number of fields, $NF means get the value of last field
or with the shell
while read -r line
do
line=${line##*/} # means longest match from the front till the "/"
[ ! -z "$line" ] && echo $line
done <"file"
NB: if you have big files, use awk.