Using sed to match regex - regex

I don't know much about sed, nor regex. I want to replace every line that contains only tabs by the string '0'. There are also lines in my file that contain only '\n'.
Basically I want to use the regular expression ^\h+$ and replace the matches with 0.
I tried:
sed -i 's/^\h+$/0/' file.txt
But it doesn't work

You can use:
sed -i.bak -E 's/^[[:blank:]]+$/0/' file
POSIX character class [[:blank:]] matches a space or tab which is same as \h in PCRE.
-i.bak is to keep original file in file.bak, in case you want to restore.

In sed the tabulator is called \t. One-or-more need a backslash \+:
sed -i -e 's/^\t\+$/0/' file.txt

Related

Replace spaces with new lines if part of a specific pattern using sed and regex with extended syntax

so I have a text file with multiple instances looking like this:
word. word or words [something:'else]
I need to replace with a new line the double space after every period followed by a sequence of words and then a "[", like so:
word.\nword or words [something:'else]
I thought about using the sed command in bash with extended regex syntax, but nothing has worked so far... I've tried different variations of this:
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/g' old.txt > new.txt
I'm an absolute beginner at this, so I'm not sure at all about what I'm doing 😳
This might work for you (GNU sed):
sed -E 's/\. ((\w+ )+\[)/\.\n\1/g' file
Replace globally a period followed by two spaces and one or more words space separated followed by an opening square bracket by; a period followed by a newline followed by the matching back reference from the regexp.
Your sed command is almost correct (but contains some redundancies)
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/' old.txt > new.txt
# ^
# You forget terminating the s command
But you don't need to capture everything. A simpler one could be
sed -E 's/\. (.*\[)/.\n\1/' old.txt > new.txt

sed from constant regex

I tried to remove the unwanted symbols
%H1256
*+E1111
*;E2311
+-'E3211
{E4511
DE4513
so I tried by using this command
sed 's/+E[0-9]/E/g
but it won't remove the blank spaces, and the digits need to be preserved.
expected:
H1256
E1111
E2311
E3211
E4511
E4513
EDIT
Special thanks to https://stackoverflow.com/users/3832970/wiktor-stribiżew my days have been saved by him
sed -n 's/.*\([A-Z][0-9]*\).*/\1/p' file or grep -oE '[A-Z][0-9]+' file
You may use either sed:
sed -n 's/.*\([[:upper:]][[:digit:]]*\).*/\1/p' file
or grep:
grep -oE '[[:upper:]][[:digit:]]+' file
See the online demo
Basically, the patterns match an uppercase letter ([[:upper:]]) followed with digits ([[:digit:]]* matches 0 or more digits in the POSIX BRE sed solution and [[:digit:]]+ matches 1+ digits in an POSIX ERE grep solution).
While sed solution will extract a single value (last one) from each line, grep will extract all values it finds from all lines.
This should do the job:
sed -E 's/^[^[:alnum:]]+//' file
Or if it is only the last 5 characters you need
sed -E 's/.*(.{5})$/\1/' file

How to add a line break before and after a regex in a text file?

This is an excerpt from the file I want to edit:
>chr1|-|9|S|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG >chr1|+|9|Y|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
I would a new text file in which I add a line break before ">" and after "somatic" or after "germline", how can I do in R or Unix?
Expected output:
>chr1|-|9|S|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
>chr1|+|9|Y|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
By the looks of your input, you could simply replace spaces with newlines:
tr -s ' ' '\n' <infile >outfile
(Some tr dialects don't like \n. Try '\012' or a literal newline: opening quote, newline, closing quote.)
If that won't work, you can easily do this in sed. If somatic is static, just hard-code it:
sed -e 's/somatic */&\n/g' -e 's/ >/\n>/g' file >newfile
The usual caveats about different sed dialects apply. Some versions don't like \n for newline, some want a newline or a semicolon instead of multiple -e arguments.
On Linux, you can modify the file in-place:
sed -i 's/somatic */&\
/g
s/ >/\
/g' file
(For variation, I'm showing how to do this if your sed doesn't recognize \n but allows literal newlines, and how to put the script in a single multi-line string.)
On *BSD (including MacOS) you need to add an argument to -i always; sed -i '' ...
If somatic is variable, but you always want to replace the first space after a wedge, try something like
sed 's/\(>[^ ]*\) /\1\n/g'
>[^ ] matches a wedge followed by zero or more non-space characters. The parentheses capture the matched string into \1. Again, some sed variants don't want backslashes in front of the parentheses, or are otherwise just ... different.
If you have very long lines, you might bump into a sed which has problems with that. Maybe try Perl instead. (Luckily, no dialects to worry about!)
perl -i -pe 's/(>[^ ]*) /$1\n/g;s/ >/\n>/g' file
(Skip the -i option if you don't want to modify the input file. Then output will be to standard output.)
(\bsomatic\b|\bgermline\b)|(?=>)
Try this.See demo.Replace by $1\n
http://regex101.com/r/tF5fT5/53
If there's no support for lookahead then try
(\bsomatic\b|\bgermline\b)
Try this.Replace by $1\n.See demo.
http://regex101.com/r/tF5fT5/50
and
(>)
Replace by \n$1.See demo.
http://regex101.com/r/tF5fT5/51
Thank you everyone!
I used:
tr -s ' ' '\n' <infile >outfile
as suggested by tripleee and it worked perfectly!

Insert space after period using sed

I've got a bunch of files that have sentences ending like this: \#.Next sentence. I'd like to insert a space after the period.
Not all occurrences of \#. do not have a space, however, so my regex checks if the next character after the period is a capital letter.
Because I'm checking one character after the period, I can't just do a replace on \#. to \#., and because I don't know what character is following the period, I'm stuck.
My command currently:
sed -i .bak -E 's/\\#\.[A-Z]/<SOMETHING IN HERE>/g' *.tex
How can I grab the last letter of the matching string to use in the replacement regex?
EDIT: For the record, I'm using a BSD version of sed (I'm using OS X) - from my previous question regarding sed, apparently BSD sed (or at least, the Apple version) doesn't always play nice with GNU sed regular expressions.
The right command should be this:
sed -i.bak -E "s/\\\#.(\S)/\\\#. \1/g" *.tex
Whith it, you match any \# followed by non whitespace (\S) and insert a whitespace (what is made by replacing the whole match with '\# ' plus the the non whitespace just found).
Use this sed command:
sed -i.bak -E 's/(\\#\.)([A-Z])/\1 \2/g' *.tex
OR better:
sed -i.bak -E 's/(\\#\.)([^ \t])/\1 \2/g' *.tex
which will insert space if \#. is not followed by any white-space character (not just capital letter).
This might work for you:
sed -i .bak -E 's/\\#\. \?/\\#. /g' *.tex
Explanation:
If there's a space there replace it with a space, otherwise insert a space.
I think the following would be correct:
s/\\#\.[^\s]/\\#. /g
Only replace the expression if it is not followed by a space.

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.
For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.
The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.
All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt
option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile
This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.
You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence
You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new
Using gawk:
gawk '{$1=$1}1' OFS="\n" file