Extract text between two markers and replace a character - regex

I want to change
>lcl|ORF183:9482:8118 unnamed protein product
into
>ORF183:9482-8118
Keep everything after | and before 'white space', plus replacing second : to -
So far I'm doing it with the following code:
sed -e '/^>/s/ .*//' -e '/^>/s/|/ /' -e '/^>/s/lcl //' -e '/^>/s/\(.*\):/\1-/'
but wish to do it in a simpler one-line code.

This could work:
sed -e 's/\(^.*|\)\(.*\):\(.*\):\(.*\)[[:space:]]\(unnamed.*$\)/>\2:\3-\4/'

Here's some improvements based on code you've tried
$ sed -e '/^>/s/ .*//' -e '/^>/s/lcl|//' -e '/^>/s/:/-/2' ip.txt
>ORF183:9482-8118
-e '/^>/s/|/ /' -e '/^>/s/lcl //' can be simplified to -e '/^>/s/lcl|//'
use s/>[^|]*|/>/ if you wish to match any text between > and |
sed allows to specify which occurrence of the match you want to replace, s/:/-/2 means replace the 2nd : to -
If your sed implementation allows grouping, you can group all the commands (separated by ;) inside {} for a particular address
$ sed '/^>/{s/ .*//; s/lcl|//; s/:/-/2}' ip.txt
>ORF183:9482-8118
Please visit https://stackoverflow.com/tags/sed/info for learning resources and other goodies

Related

Delete any special character using Sed

I have yet another list of subdomain. I want to remove any Wildcard subdomain which include these special characters:
()!&$#*+?
Mostly, the data are prefixly random. Also, could be middle. Here's some sample of output data
(www.imgur.com
***************diet.blogspot.com
*-1.gbc.criteo.com
------------------------------------------------------------i.imgur.com
This has been quite an inconvenience while scanning through the list. As always, I'm trying sed to fix it:
sed -i "/[!()#$&?+]/d" foo.txt ###Didn't work
sed -i "/[\!\(\)\#\$\&\?\+]/d" ###Escaping char didn't work
Performing commands above still result in an unchanged list and the file still on original state. I'm thinking that; to fix this is to pipe series of sed command in order to remove it one by one:
cat foo.txt | sed -e "/!/d" -e "/#/d" -e "/\*/d" -e "/\$/d" -e "/(/d" -e "/)/d" -e "/+/d" -e "/\'/d" -e "/&/d" >> foo2.txt
cat foo.txt | sed -e "/\!/d" | sed -e "/\#/d" | sed -e "/\*/d" | sed -e "/\$/d" | sed -e "/\+/d" | sed -e "/\'/d" | sed -e "/\&/d" >> foo2.txt
If escaping all special char doesn't work, it must've been my false logic. Also tried with /g still doesn't increase my luck.
As a side note: I don't want - to be deleted as some valid subdomain can have - character:
line-apps.com
line-apps-beta.com
line-apps-rc.com
line-apps-dev.com
Any help would be cherished.
Using sed
$ sed '/[[:punct:]]/d' input_file
This should delete all lines with special characters, however, it would help if you provided sample data.
To do what you're trying to do in your answer (which adds [ and ] and more to the set of characters in your question) would be:
sed '/[][!?+,#$&*() ]/d'
or just:
grep -v '[][!?+,#$&*() ]'
Per POSIX to include ] in a bracket expression it must be the first character otherwise it indicates the end of the bracket expression.
Consider printing lines you want instead of deleting lines you do not want, though, e.g.:
grep '^[[:alnum:]_.-]$' file
to print lines that only contain letters, numbers, underscores, dashes, and/or periods.

how to regex replace before colon?

this is my original string:
NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
I want to only add back slash to all the spaces before ':'
so, this is what I finally want:
NetworkManager/system\ connections/Wired\ 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
I need to do this in bash, so, sed, awk, grep are all ok for me.
I have tried following sed, but none of them work
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/ .*\(:.*$\)/\\ .*\1/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/\( \).*\(:.*$\)/\\ \1.*\2/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/ .*\(:.*$\)/\\ \1/g'
echo NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1 | sed 's/\( \).*\(:.*$\)/\\ \1\2/g'
thanks for answering my question.
I am still quite newbie to stackoverflow, I don't know how to control the format in comment.
so, I just edit my original question
my real story is:
when I do grep or use cscope to search keyword, for example "address1" under /etc folder.
the result would be like:
./NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
if I use vim to open file under cursor, suppose my vim cursor is now at word "NetworkManager",
then vim will understand it as
"./NetworkManager/system"
that's why I want to add "\" before space, so the search result would be more vim friendly:)
I did try to change cscope's source code, but very difficult to fully achieve this. so have to do a post replacement:(
If you only want to do the replacements if there is a : present in the string, you can check if there are at least 2 columns, setting the (output)field separator to a colon.
Data:
cat file michaelvandam#Michaels-MacBook-Pro
NetworkManager/system connections/Wired 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
NetworkManager/system connections/Wired 1.nmconnection 14 address1=10.1.10.71/24,10.1.10.1%
Example in awk:
awk 'BEGIN {FS=OFS=":"}{if(NF>1)gsub(" ","\\ ",$1)}1' file
Output
NetworkManager/system\ connections/Wired\ 1.nmconnection:14 address1=10.1.10.71/24,10.1.10.1
NetworkManager/system connections/Wired 1.nmconnection 14 address1=10.1.10.71/24,10.1.10.1
This could be simply done in awk program, with your shown samples, please try following.
awk 'BEGIN{FS=OFS=":"} {gsub(/ /,"\\\\&",$1)} 1' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as : for this program. Then in main program using gsub(Global substitution) function of awk. Where substituting space with \ in 1st field only(as per OP's remarks it should be done before :) and printing line then.
An idea for a perl one liner in bash to use \G and \K (similar #CarySwoveland's comment).
perl -pe 's/\G[^ :]*\K /\\ /g' myfile
See this demo at tio.run or a pattern demo at regex101.
This might work for you (GNU sed):
sed -E ':a;s/^([^: ]*) /\1\n/;ta;s/\n/\\ /g' file
Replace spaces before : by newlines then replace newlines by \ 's.
Alternative using the hold space:
sed -E 's/:/\n:/;h;s/ /\\ /g;G;s/\n.*\n//' file
Split the line on the first :.
Amend the front section, remove the middle and append the unadulterated back section.
My answer is ugly and I think RavinderSingh13's answer is THE ONE, but I already took the time to write mine and it works (It's written step by step, but it's a one line command):
I got inspired by HatLess answer:
first get the text before the : with cut (I put the string in a file to make it easy to read, but this works on echo):
cut -d':' -f1 infile
Then replace spaces using sed:
cut -d':' -f1 infile | sed 's/\([a-z]\) /\1\\ /g'
Then echo the output with no new line:
echo -n "$(cut -d':' -f1 infile | sed -e 's/\([a-z]\) /\1\\ /g')"
Add the missing : and what comes after it:
echo -n "$(cut -d':' -f1 infile | sed -e 's/\([a-z]\) /\1\\ /g')" | cat - <(echo -n :) | cat - <(cut -d':' -f2 infile)

sed: struggling with substitution and regex for ^*=

I am running a linux bash script. From stout lines like: /gpx/trk/name=MyTrack1, I want to keep only the end of line after =.
I am struggling to understand why the following sed command is not working as I expect:
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
(I also tried)
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*\=//"
The return is always /gpx/trk/name=MyTrack1 and not MyTrack1
An even simpler way if this is the only structure you are concerned about:
echo "/gpx/trk/name=MyTrack1" | cut -d = -f 2
Simply try:
echo "/gpx/trk/name=MyTrack1" | sed 's/.*=//'
Solution 2nd: With another sed.
echo "/gpx/trk/name=MyTrack1" | sed 's/\(.*=\)\(.*\)/\2/'
Explanation: As per OP's request adding explanation for this code here:
s: Means telling sed to do substitution operation.
\(.*=\): Creating first place in memory to keep this regex's value which tells sed to keep everything in 1st place of memory from starting to till = so text /gpx/trk/name= will be in 1 place.
\(.*\): Creating 2nd place in memory for sed telling it to keep everything now(after the match of 1st one, so this will start after =) and have value in it as MyTrack1
/\2/: Now telling sed to substitute complete line with only 2nd memory place holder which is MyTrack1
Solution 3rd: Or with awk considering that your Input_file is same as shown samples.
echo "/gpx/trk/name=MyTrack1" | awk -F'=' '{print $2}'
Solution 4th: With awk's match.
echo "/gpx/trk/name=MyTrack1" | awk 'match($0,/=.*$/){print substr($0,RSTART+1,RLENGTH-1)}'
$ echo "/gpx/trk/name=MyTrack1" | sed -e "s/^.*=//"
MyTrack1
The regular expression ^.*= matches anything up to and including the last = in the string.
Your regular expression ^*= would match the literal string *= at the start of a string, e.g.
$ echo "*=/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
/gpx/trk/name=MyTrack1
The * character in a regular expression usually modifies the immediately previous expression so that zero or more of it may be matched. When * occurs at the start of an expression on the other hand, it matches the character *.
Not to take you off the sed track, but this is easy with Bash alone:
$ echo "$s"
/gpx/trk/name=MyTrack1
$ echo "${s##*=}"
MyTrack1
The ##*= pattern removes the maximal pattern from the beginning of the string to the last =:
$ s="1=2=3=the rest"
$ echo "${s##*=}"
the rest
The equivalent in sed would be:
$ echo "$s" | sed -E 's/^.*=(.*)/\1/'
the rest
Where #*= would remove the minimal pattern:
$ echo "${s#*=}"
2=3=the rest
And in sed:
$ echo "$s" | sed -E 's/^[^=]*=(.*)/\1/'
2=3=the rest
Note the difference in * in Bash string functions vs a sed regex:
The * in Bash (in this context) is glob like - itself means 'any character'
The * in a regex refers to the previous pattern and for 'any character' you need .*
Bash has extensive string manipulation functions. You can read about Bash string patterns in BashFAQ.

How to ignore word delimiters in sed

So I have a bash script which is working perfectly except for one issue with sed.
full=$(echo $full | sed -e 's/\b'$first'\b/ /' -e 's/ / /g')
This would work great except there are instances where the variable $first is preceeded immediately by a period, not a blank space. In those instances, I do not want the variable removed.
Example:
full="apple.orange orange.banana apple.banana banana";first="banana"
full=$(echo $full | sed -e 's/\b'$first'\b/ /' -e 's/ / /g')
echo $first $full;
I want to only remove the whole word banana, and not make any change to orange.banana or apple.banana, so how can I get sed to ignore the dot as a delimiter?
You want "banana" that is preceded by beginning-of-string or a space, and followed by a space or end-of-string
$ sed -r 's/(^|[[:blank:]])'"$first"'([[:blank:]]|$)/ /g' <<< "$full"
apple.orange orange.banana apple.banana
Note the use of -r option (for bsd sed, use -E) that enables extended regular expressions -- allow us to omit a lot of backslashes.

sed: mix explicit and regex phrases

I'm trying to write a sed command to remove a specific string followed by two digits. So far I have:
sed -e 's/bizzbuzz\([0-9][0-9]\)//' file.txt
but I cant seem to get the syntax right. Any suggestions?
sed -re 's/bizzbuzz[0-9]{2}//' file.txt
and
sed -re 's/\bbizzbuzz[0-9]{2}\b//' file.txt
if the searched string have word boundary
sed -e 's/bizzbuzz[0-9]\{2\}//' file.txt
if you don't have GNU sed
Your current approach seems like it should work fine:
$ echo 'FOO bizzbuzz56 BAR' | sed -e 's/bizzbuzz\([0-9][0-9]\)//'
FOO BAR
As said in other answer, the syntax seems to be fine (with unnecesary parenthesis).
But may be you want to replace all the strings found in each line ? In that case, you should add a 'g' at the end of the 's' command:
sed -e 's/bizzbuzz\([0-9][0-9]\)//g' file.txt