using backreferences regex in sed - regex

I would like to remove multiple spaces in a file with a single character.
Example
cat kill rat
dog kill cat
I used the following regex, which seemed to matched in http://www.regexpal.com/ but wasn't working in sed.
([^ ])*([ ])*
I used the sed command like so:
sed s/\(\[\^\ \]\)*\(\[\ \]\)*/\$1\|/g < inputfile
I expect,
cat|kill|rat
dog|kill|cat
But I couldn't get it to work. Any help would be much appreciated. Thanks.
Edit:
kindly note that cat/dog could be any character than whitespace.

sed backreferences with backslashes, so use \1 instead of $1.
Surround your expressions with quotes:
sed 's/match/replace/g' < inputfile
Manpages are the best invention in Linux world: man sed
Watch out for *, it can actually match NOTHING.
If you want to replace multiple spaces with a '|', use this RE:
sed -r 's/ +/\|/g'
From man sed:
-r, --regexp-extended
use extended regular expressions in the script.
You don't need any backreferences if you just want to replace all spaces.
Replace (space) by \s if you want to match tabs too.

I know the OP wanted with sed and that the question is old, but what about tr -s ' ' input?

What about :
s/\s+/\|/g

You can use:
sed -e 's/[[:blank:] ]/\|/g ' < inputfile
whete [:blank:] is space and tab characters

Related

Bash regex with sed

Trying to get only words that come before "/", I wrote:
T='He/She is a very handsome/beautiful man/woman indeed.'
echo "$T" | sed -E 's#\b(.*)/(.*)\b#\1#g'
However, I only got to make it work in the last occurrence (although I'm using "g" in sed sentence):
He/She is a very handsome/beautiful man.
My desired output is: "He is a very handsome man indeed."
Any ideas? Thank you.
A POSIX compliant one:
T='He/She is a very handsome/beautiful man/woman indeed.'
echo "$T" | sed 's:/[^ ]*::g'
To remove non-space characters after every slash you can use:
$ sed 's:/\S*::g' <<<'He/She is a very handsome/beautiful man/woman indeed.'
He is a very handsome man indeed.
The pattern, :/\S*:, matches a slash followed by zero or more non-space characters. The replacement string, ::, is empty and it is applied globally g. The <<< is a here-string that passes input to sed.

Replace some dots(.) with commas(,) with RegEx and awk or sed

I want to replace dots with commas for some but not all matches:
hostname_metric (Index: 1) to hostname;metric (avg);22.04.2015 13:40:00;3.0000;22.04.2015 02:05:00;2.0000;22.04.2015 02:00:00;650.7000;2.2594;
The outcome should look like this:
hostname_metric (Index: 1) to hostname;metric (avg);22.04.2015 13:40:00;3,0000;22.04.2015 02:05:00;2,0000;22.04.2015 02:00:00;650,7000;2,2594;
I was able to identify the RegEx which should work to find the correct dots.
;[0-9]{1,}\.[0-9]{4}
But how can I replace them with a comma with awk or sed?
Thanks in advance!
Adding some capture groups to the regex in your question, you can use this sed one-liner:
sed -r 's/(;[0-9]{1,})\.([0-9]{4})/\1,\2/g' file
This matches and captures the part before and after the . and uses them in the replacement string.
On some versions of sed, you may need to use -E instead of -r to enable Extended Regular Expressions. If your version of sed doesn't understand either switch, you can use basic regular expressions and add a few escape characters:
sed 's/\(;[0-9]\{1,\}\)\.\([0-9]\{4\}\)/\1,\2/g' file
sed 's/\(;[0-9]\+\)\.\([0-9]\{4\}\)/\1,\2/g' should do the trick.

How to add a line break before and after a regex in a text file?

This is an excerpt from the file I want to edit:
>chr1|-|9|S|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG >chr1|+|9|Y|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
I would a new text file in which I add a line break before ">" and after "somatic" or after "germline", how can I do in R or Unix?
Expected output:
>chr1|-|9|S|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
>chr1|+|9|Y|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
By the looks of your input, you could simply replace spaces with newlines:
tr -s ' ' '\n' <infile >outfile
(Some tr dialects don't like \n. Try '\012' or a literal newline: opening quote, newline, closing quote.)
If that won't work, you can easily do this in sed. If somatic is static, just hard-code it:
sed -e 's/somatic */&\n/g' -e 's/ >/\n>/g' file >newfile
The usual caveats about different sed dialects apply. Some versions don't like \n for newline, some want a newline or a semicolon instead of multiple -e arguments.
On Linux, you can modify the file in-place:
sed -i 's/somatic */&\
/g
s/ >/\
/g' file
(For variation, I'm showing how to do this if your sed doesn't recognize \n but allows literal newlines, and how to put the script in a single multi-line string.)
On *BSD (including MacOS) you need to add an argument to -i always; sed -i '' ...
If somatic is variable, but you always want to replace the first space after a wedge, try something like
sed 's/\(>[^ ]*\) /\1\n/g'
>[^ ] matches a wedge followed by zero or more non-space characters. The parentheses capture the matched string into \1. Again, some sed variants don't want backslashes in front of the parentheses, or are otherwise just ... different.
If you have very long lines, you might bump into a sed which has problems with that. Maybe try Perl instead. (Luckily, no dialects to worry about!)
perl -i -pe 's/(>[^ ]*) /$1\n/g;s/ >/\n>/g' file
(Skip the -i option if you don't want to modify the input file. Then output will be to standard output.)
(\bsomatic\b|\bgermline\b)|(?=>)
Try this.See demo.Replace by $1\n
http://regex101.com/r/tF5fT5/53
If there's no support for lookahead then try
(\bsomatic\b|\bgermline\b)
Try this.Replace by $1\n.See demo.
http://regex101.com/r/tF5fT5/50
and
(>)
Replace by \n$1.See demo.
http://regex101.com/r/tF5fT5/51
Thank you everyone!
I used:
tr -s ' ' '\n' <infile >outfile
as suggested by tripleee and it worked perfectly!

Printing a matched regexp with sed

So I'm trying to match a regexp with any string in the middle of it and then print out just that string. The syntax is sort of like this...
sed -n 's/<title>.*</title>/"what do I put here"/p' input.file
and I just want to print out whatever .* is where I typed "what do I put here". I'm not very comfortable with sed at this point so this is likely a very simple answer and I'm having trouble finding one in any of the other questions. Thanks in advance!
Capture the pattern you want to extract within \(...\), and then you can refer to it as \1 in the replacement string:
sed -n 's/<title>\(.*\)</title>/\1/p' input.file
You can have multiple \(...\) expressions, and refer to them with \1, \2, \3, and so on.
If you have the GNU version of sed, or gsed, then you could simplify a bit:
sed -rn 's/<title>(.*)</title>/\1/p' input.file
With the -r flag, sed can use "extended regular expressions", which practically let's you write (...) instead of \(...\), + instead of \+, and other goodies.

PCRE regex to sed regex

First of all sorry for my bad english. I'm a german guy.
The code given below is working fine in PHP:
$string = preg_replace('/href="(.*?)(\.|\,)"/i','href="$1"',$string);
Now T need the same for sed. I thought it should be:
sed 's/href="(.*?)(\.|\,)"/href="{$\1}"/g' test.htm
But that gives me this error:
sed: -e expression #1, char 36:
invalid reference \1 on `s' command's
RHS
sed does not support non-greedy regex match.
sed -e 's|href=\"\(.[^"][^>]*\)\([.,]\)\">|href="\1">|g' file
You need a backslash in front of the parentheses you want to reference, thus
sed 's/href="\(.*?\)(.|\,)"/href="{$\1}"/g' test.htm
You have to escape the block selector characters ( and ) as follows.
sed 's/href="\(.*?\)\(.|\,\)"/href="{$\1}"/g' test.htm
here is a solution, it is not prefect, only deal with the situation of one extra "," or "."
sed -r -e 's/href="([^"]*)([.,]+)"/href="\1"/g' test.htm
If you want to match a literal ".", you need to escape it or use it in a character class. As an alternative to slashing the capturing parentheses (which you need to do with basic REs), you can use the -E option to tell sed to use extended REs. Lastly, the REs used by sed use \N to refer to subpatterns, where N is a digit.
sed -E "s/href=([\"'])([^\"']*)[.,]\1/href=\1\2\1/i"
This has its own issue that will prevent matches of href attributes that use both types of quotes.
man sed and man re_format will give more information on REs as used in sed.