sed backreference regex to support spaces and tabs - regex

Im am trying to write a small sed substitution one liner to add a hostname to an existing entry in an /etc/hosts file, so for example, this line
10.80.0.4 host.compute.internal
I want to inject othername into the line like this
10.80.0.4 othername host.compute.internal
Ive managed to put the following together using backreferences and it works fine as long as the items are separated by a space
cat /etc/hosts |
sed 's/\(^.\+\) \(.\+compute.internal\)/\1 othername \2/g'
however..I need this regex to also support Tab separated items as well, Does anyone have any ideas how i can amend the above to support tabs and spaces?
any help would be greatly appreciated
thanks

This may do what you want:
sed -E 's/(^.+)[[:blank:]](.+compute.internal)/\1 othername \2/'
See Character Classes and Bracket Expressions (sed, a stream editor) for an explanation of [[:blank:]].

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).
If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

Replace last occurrence of space with sed

I need to replace the last occurrence of space in an input file, using sed.
What I came up with is
sed "s/([ ])[0-9]*$/,/g"
However, it does not seem to want to remember the space which it's supposed to replace. Running the command without round brackets works fine (for what it's supposed to do - replace the space and the chain of numbers). When I add the brackets, it does nothing.
Yes, I am aware of this solution, however when trying to pass \1 to sed, it screams that "\1 not defined in the RE".
Anyone care to help? It seems to be a simple issue, I'd be glad to know the solution.
This seemed to work "the first time" (yay) ...
$ sed -e 's/ \([^ ][^ ]*\)$/,\1/' /etc/hosts

Remove all hyperlinks in a text file, linux scripting

I am very new in scripting, but I want to learn it.
What I have to do is to remove all occurrences of something like http://* from a text file. I want to do it with sed command and regular expressions.
Here is what I have come up to so far:
sed 's/http:\/\/.*/ /' < input.txt > output.txt
This code replaces all the hyperlinks with a space. But the problem is that it also removes the rest of the line.
How can I fix this problem? I have tried adding space, "http://.* " or end of word "http://.*\>" or other tricks that I found in the internet, but they didn't work.
And is there a better way to do so instead of using sed?
Sed is a fine way to do this. Try changing your regex to s!http://[^[:space:]]*! !g.

Remove all spaces before a specific symbol/character in a text file using notepad++

Okay I'm trying to set up a set of tag lists for a translation script but have encountered a problem I need to remove the spaces in one of the languages using regex
the lines are set up like "JP JP":"ENG ENG" and I would like it to be "JPJP":"ENG ENG"
I'm new when it comes to regex so I'm out of ideas of what to try
Thanks!
Try this, this expression uses positive lookahead
\s+(?=\w+":")
You can use Unix utility sed
sed -i 's/old_name/new_name/g' your-file
In your example this should be:
sed -i 's/"JP JP":"ENG ENG"/"JPJP":"ENGENG"/g' your-file

Repeating a regex pattern

First, I don't know if this is actually possible but what I want to do is repeat a regex pattern.
The pattern I'm using is:
sed 's/[^-\t]*\t[^-\t]*\t\([^-\t]*\).*/\1/' films.txt
An input of
250. 7.9 Shutter Island (2010) 110,675
Will return:
Shutter Island (2010)
I'm matching all none tabs, (250.) then tab, then all none tabs (7.9) then tab. Next I backrefrence the film title then matching all remaining chars (110,675).
It works fine, but im learning regex and this looks ugly, the regex [^-\t]*\t is repeated just after itself, is there anyway to repeat this like you can a character like a{2,2}?
I've tried ([^-\t]*\t){2,2} (and variations) but I'm guessing that is trying to match [^-\t]*\t\t?
Also if there is any way to make my above code shorter and cleaner any help would be greatly appreciated.
This works for me:
sed 's/\([^\t]*\t\)\{2\}\([^\t]*\).*/\2/' films.txt
If your sed supports -r you can get rid of most of the escaping:
sed -r 's/([^\t]*\t){2}([^\t]*).*/\2/' films.txt
Change the first 2 to select different fields (0-3).
This will also work:
sed 's/[^\t]\+/\n&/3;s/.*\n//;s/\t.*//' films.txt
Change the 3 to select different fields (1-4).
To use repeating curly brackets and grouping brackets with sed properly, you may have to escape it with backslashes like
sed 's/\([^-\t]*\t\)\{3\}.*/\1/' films.txt
Yes, this command will work properly with your example.
If you feel annoyed to, you can choose to put -r option which enables regex extended mode and forget about backslash escapes on brackets.
sed -r 's/([^-\t]*\t){3}.*/\1/' films.txt
Found that this is almost the same as Dennis Williamson's answer, but I'm leaving it because it's shorter expression to do the same.
I think you might be going about this the wrong way. If you're simply wanting to extract the name of the film, and it's release year, then you could try this regex:
(?:\t)[\w ()]+(?:\t)
As seen in place here:
http://regexr.com?2sd3a
Note that it matches a tab character at the beginning and end of the actual desired string, but doesn't include them in the matching group.
You can repeat things by putting them in parenthesis, like this:
([^-\t]*\t){2,2}
And the full pattern to match the title would be this:
([^-\t]*\t){2,2}([^-\t]+).*
You said you tried it. I'm not sure what is different, but the above worked for me on your sample data.
why are you doing things the hard way??
$ awk '{$1=$2=$NF=""}1' file
Shutter Island (2010)
If this is a tab separated file with a regular format I'd use cut instead of sed
cut -d' ' -f3 films.txt
Note there's a single tab between the quotes after the -d which can be typed at the shell prompt by typing ctrl+v first, i.e. ctrl+v ctrl+i