Problem to change an occurence in a file with sed - regex

I have a file with several lines :
OTU3055 UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
OTU0856 OTU53699 UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
OTU0125 UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
I want to remove all OTUXXXX occurences (there are always 4 numbers after the "OTU") which appears in the file. I used sed but it didn't work. The OTUXXXX always appearat the beginning of the lines.
sed 's/OTU[0-9]{4} //g' my_file.txt
I put a space after OTU[0-9]{4} because I want the Uniref90 IDs are at the beginning of eacg line.
Edit :
sed -r 's/OTU[0-9]{4} //g' my_file.txt works. But I get another problem,
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2
Some lines still begin with a white space. I tried sed 's/^ *//' my_file.txt and it does not work. I want the second line of my file starts like the two other lines, without any space.

You may use
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b[[:space:]]*//g' file > newfile
Or, if the matches can be found anywhere, not only at the string start:
sed -r 's/[[:space:]]*\bOTU[0-9]{4,}\b//g' file | sed 's/[[:space:]]*$//' > newfile
The whitespaces after the OTU<digits> won't get matched with the second snippet, so a piped sed command is necessary.
See the online demo.
Details
[[:space:]]* - 0+ whitespace chars
\b a word boundary
OTU[0-9]{4,} - OTU and 4 or more digits
\b - a word boundary
[[:space:]]* - 0+ whitespace chars.

There is no explanation for your posted actual output given your posted input and the command you ran but if you want to match on 4 or more digits and the space after the OTU* strings could be a tab or some other white space that's not a blank char then this is what you need using GNU or OSX/BSD awk for -E:
$ sed -E 's/(OTU[0-9]{4,}[[:space:]]+)+//' file
UniRef90_A0A0F7KBB1 UniRef90_A0A1Z9IPT2
UniRef90_D6PC25 UniRef90_D6PCA5 UniRef90_D6PCG3
UniRef90_A0A075FUN0 UniRef90_A0A075G8Q1 UniRef90_A0A075GDT2

Related

Back-reference when preprend using sed linux command and i sed command

I'm trying to prepend the first character of "monkey" using this command:
echo monkey | sed -E '/(.)onkey/i \1'
But when I use it like this, the output shows
1
monkey
I actually hope to see:
m
monkey
But back-reference doesn't work. Please someone tell me if it is possible to use Back-reference with \1. Thanks in advance.
You may use this sed:
echo 'monkey' | sed -E 's/(.)onkey/\1\n&/'
m
monkey
Here:
\1: is back-reference for group #1
\n: inserts a line break
&: is back-reference for full match
With any version of awk you can try following solution, written and tested with shown samples. Simply searching regex ^.onkey and then using sub function to substitute starting letter with itself new line and itself and printing the value(s).
echo monkey | awk '/^.onkey/{sub(/^./,"&\n&")} 1'
This might work for you (GNU sed):
sed -E '/monkey/i m' file
Insert the line containing m only above a line containing monkey.
Perhaps a more generic solution would be to insert the first character of a word above that word:
sed -E 'h;s/\B.*//;G' file
Make copy of the word.
Remove all but the first character of the word.
Append the original word delimited by a newline.
Print the result.
N.B. \B starts a match between characters of a word. \b represents the start or end of a word (as does \< and \> separately).

Sed handle patterns over multiple lines

For example if I have a file like
yo#gmail.com, yo#
gmail.com yo#gmail
.com
And I want to replace the string yo#gmail.com.
If the file had the target string in a single line then we could've just used
sed "s/yo#gmail.com/e#email.com/g" file
So, is it possible for me to catch patters that are spread between multiple line without replacing the \n?
Something like this
e#email.com, e#
email.com e#email
.com
Thank you.
You can do this:
tr -d '\n' < file | sed 's/yo#gmail.com/e#email.com/g'
This might work for you (GNU sed):
sed -E 'N;s/yo#gmail\.com/e#email.com/g
h;s/(\S+)\n(\S+)/\1\2\n\1/;/yo#gmail\.com/!{g;P;D}
s//\ne#email.com/;:a;s/\n(\S)(\S+)\n\S/\1\n\2\n/;ta
s/(.*\n.*)\n/\1/;P;D' file
Append the following line to the pattern space.
Replace all occurrences of matching email address in both the first and second lines.
Make a copy of the pattern space.
Concatenate the last word of the first line with the first word of the second and keep the second line as is. If there is no match with the email address, revert the line, print/delete the first line and repeat.
Otherwise, replace the match and re-insert the newline as of the length of the first word of the second line (deleting the first word of the second line too).
Remove the newline used for scaffolding, print/delete the first line and repeat.
N.B. The lines will not be of the same length as the originals if the replacement string length does not match the matching string length. Also there has been no attempt to break the replacement string in the same relative split if the match and replacement strings are not the same length.
Alternative:
echo $'yo#gmail.com\ne#email.com' |
sed -E 'N;s#(.*)\n(.*)#s/\\n\1/\\n\2/g#
:a;\#\\n([^/])(.*)\\n(.)?(.*/g)#{s//\1\\n\2\3\\n\4/;H;ba}
x;s/.//;s#\\n/g$#/g#gm;s#\\n/#/#;s/\./\\./g' |
sed -e 'N' -f - -e 'P;D' file
or:
echo 's/yo#gmail.com/e#email.com/' |
sed -E 'h;s#/#/\\n#g;:a;H;s/\\n([^/])/\1\\n/g;ta;x;s/\\n$//mg;s/\./\\./g' |
sed -zf - /file
N.B. With the last alternative solution, the last sed invocation can be swapped for the first alternative solutions last sed invocation.

Replace spaces with new lines if part of a specific pattern using sed and regex with extended syntax

so I have a text file with multiple instances looking like this:
word. word or words [something:'else]
I need to replace with a new line the double space after every period followed by a sequence of words and then a "[", like so:
word.\nword or words [something:'else]
I thought about using the sed command in bash with extended regex syntax, but nothing has worked so far... I've tried different variations of this:
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/g' old.txt > new.txt
I'm an absolute beginner at this, so I'm not sure at all about what I'm doing 😳
This might work for you (GNU sed):
sed -E 's/\. ((\w+ )+\[)/\.\n\1/g' file
Replace globally a period followed by two spaces and one or more words space separated followed by an opening square bracket by; a period followed by a newline followed by the matching back reference from the regexp.
Your sed command is almost correct (but contains some redundancies)
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/' old.txt > new.txt
# ^
# You forget terminating the s command
But you don't need to capture everything. A simpler one could be
sed -E 's/\. (.*\[)/.\n\1/' old.txt > new.txt

how to trim trailing spaces after all delimiter in a text file

Need help to remove trailing spaces after all delimiter in a text file
I have Text file with below data.
eg.
ADDRESS_ID| COUNTRY_TP_CD| RESIDENCE_TP_CD| PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0| 76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
I want to remove spaces after the delimiter and the first letter of the word.
Any regex or unix script that can do the same. Looking for output as below:
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU||||||2013-09-19 14:48:49.609000|
Any help will be appreciated.
awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)} 1' file
Using a perl one-liner to remove the spacing around every field. Assumes no embedded delimiters:
perl -i -lpe 's/\s*([^|]*?)\s*/$1/g' file.txt
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-l: Enable line ending processing
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
The below perl code would remove the spaces which are present at the start of a line or the spaces after to the delimiter | ,
$ perl -pe 's/(?<=\|) +|^ +//g' file
ADDRESS_ID|COUNTRY_TP_CD|RESIDENCE_TP_CD|PROV_STATE_TP_CD|ADDR_LINE_ONE|P_ADDR_LINE_ONE
885637959852960985.0|76.0|||169 Park lane||Scottish||lane||KU|||||||2013-09-19 14:48:49.609000|
To save the changes made to that file,
perl -i -pe 's/(?<=\|) +|^ +//g' file
sed 's/\ //g' input.txt > output.txt
With sed:
sed -r -e 's/(^|\|)\s+/\1/g' -e 's/\s+$//' filename
In the first expression:
(^|\|) matches the beginning of the line or a | character, and saves this in capture group 1.
\s+ matches a sequence of whitespace characters after that.
The replacement \1 substitutes capture group 1, so this deletes the whitespace at the beginning of the line and after the delimiter.
The g modifier makes it operate on all the matches in the line.
In the second expression:
\s+ again matches a sequence of whitespace
$ matches the end of the line
The replacement replaces the whole thing with an empty string, this removing trailing spaces.
for posix sed (for GNU sed add --posix)
sed 's/^[[:space:]]//;s/|[[:space:]]/|/g' YourFile
use 2 substitution (there are no OR (|) in sed regex posix version)
Remove starting space by replacing space at start( ^[[:space:]]*) by nothing
Replace any sequence pipe than any space (|[[:space:]]*) by pipe
[[:space:]] could be replace by a single space char if text only have space (ASCII 32) char

Insert space after period using sed

I've got a bunch of files that have sentences ending like this: \#.Next sentence. I'd like to insert a space after the period.
Not all occurrences of \#. do not have a space, however, so my regex checks if the next character after the period is a capital letter.
Because I'm checking one character after the period, I can't just do a replace on \#. to \#., and because I don't know what character is following the period, I'm stuck.
My command currently:
sed -i .bak -E 's/\\#\.[A-Z]/<SOMETHING IN HERE>/g' *.tex
How can I grab the last letter of the matching string to use in the replacement regex?
EDIT: For the record, I'm using a BSD version of sed (I'm using OS X) - from my previous question regarding sed, apparently BSD sed (or at least, the Apple version) doesn't always play nice with GNU sed regular expressions.
The right command should be this:
sed -i.bak -E "s/\\\#.(\S)/\\\#. \1/g" *.tex
Whith it, you match any \# followed by non whitespace (\S) and insert a whitespace (what is made by replacing the whole match with '\# ' plus the the non whitespace just found).
Use this sed command:
sed -i.bak -E 's/(\\#\.)([A-Z])/\1 \2/g' *.tex
OR better:
sed -i.bak -E 's/(\\#\.)([^ \t])/\1 \2/g' *.tex
which will insert space if \#. is not followed by any white-space character (not just capital letter).
This might work for you:
sed -i .bak -E 's/\\#\. \?/\\#. /g' *.tex
Explanation:
If there's a space there replace it with a space, otherwise insert a space.
I think the following would be correct:
s/\\#\.[^\s]/\\#. /g
Only replace the expression if it is not followed by a space.