find, replace and escape string linux - replace

I'm trying find all instances of a string and replace them, the original string looks like this:
<li>Some Text Here</li>
the replacement looks like this:
<li>Something new</li>
What would be a good way to do this in the CLI
Thanks

I think the sed command would do the job nicely, provided your onclick handler and the "Some Text Here" don't include any nested HTML tags that the regex might confuse for the closing tags of the replacement string.

Searching and replacing in HTML is a guaranteed headache. At some point someone will pass you malformed HTML and even the most careful crafted regexp will fail horribly.
I'd definitely work with the HTML at the highest possible level of abstraction, preferably a homegrown tool that uses DOM or SAX.
For a quick fix
A command line tool using XSL/XSLT

Standard way of searching and replacing in linux command line is sed, eg:
sed -i yourfilename -e "s%textyouwantoreplace%newtext%g"
The only thing is, sed uses regular expressions, so you'll need to escape stuff that might be a wildcard, by putting a \ before it, eg write \$ instead of $.
-i means: edit the file in-place
-e means: the next thing in the commandline is an expression to evaluate, ie the whole thing in quotes after it
's' means 'substitute'
'g' means 'global' substitution

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).
If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

Sed script to to rewrite certain strings

I'm dealing with a body of XML files containing unstructured texts with semantic markup for personal names.
For reasons to do with the stylesheet that will eventually show them via a web application, I need to replace:
<persName>Fred</persName>'s
<persName>Wilma</persName>'s
with
<persName>Fred's</persName>
<persName>Wilma's</persName>
I have a single line in a shell script, being run in Gitbash for Windows, below. It runs OK, but has no effect. I suppose I'm missing something obvious, perhaps to do with escaping characters, but any help appreciated.
sed -i "s/<\/persName>\'s/\'s<\/persName>/g" test.xml
You may use
sed -i "s,</persName>'s,'s</persName>,g" test.xml
Details
s - we want to replace
, - a delimiter
</persName>'s - this string to find
, - delimiter
's</persName> - replace with this string
, - delimiter
g - multiple times if more than one is found
The -i option makes the replacements directly in the file.
Note that you do not have to escape ' when defining the sed command inside a double quoted string.
It is a good idea to use a delimiter char other than the common / if there are / chars inside the regex or/and replacement pattern.
The comment on your question suggests an easier solution, but I guess, that there might be names where the suffix 's differs, like names ending with an s. So I chose a solution where you grab what's right and put it in the middle.
As separator for the search and replace command in sed you can choose whatever you want. I've chosen #, so you don't have to escape the backslashes in the text. The escaped parantheses store what's inside in variables \1 and \2.
sed 's#<persName>\(.*\)</persName>\(.*\)#<persName>\1\2</persName>#g' testfile
Result:
<persName>Fred's</persName>
<persName>Wilma's</persName>
If you want to replace it in file, you can use the -i parameter. But be sure to check the result first.

Bash sed how to mask brackets

I want to search for 'start') in the file /etc/init.d/fhem, and write code which I read from a textfile to that file after the statement above. At the moment I get a message that I have to close the bracket from 'start'). I think I have to mask it properly, but so far no luck with trying that. May someone give me the missing link?
CocConf=$(<COC.txt)#Reading Cod from File to insert in other file
sed -r "\'start\')/a $CocConf" /etc/init.d/fhem #Inserting said Code
You're missing the / before the regular expression. And there's no need to escape single quotes inside double quotes. But when you use extended regexps, you need to escape parentheses. The a command also requires a backslash after it, and the text to be added must be on the next line.
sed -r "/start\)/a\
$CocConf" /etc/init.d/fhem

sed regex stop at first match

I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>
this should be filtered into:
[quote=User]
I used the following regex in sed
s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g
this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)
You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.
The easiest way to do this is simply to use Perl in the first place.
If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:
s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;
Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.
Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:
s{<blockquote.*?written by (.*?)</i>}
{[quote=$1]}g;
Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:
$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...
The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.
I don't think sed supports non-greedy match. You can try perl though:
perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename
This might work for you:
sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file
Explanation:
If a line doesn't contain the pattern then skip it. /<blockquote.*written by .*<\/i>/!b
Change the front of the pattern into a newline globally throughout the line. s/<blockquote/\n/g
Globally replace the newline followed by the remaining pattern using a [^\n]* instead of .*. s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
Revert those newlines not changed to the original front pattern. s/\n/\<blockquote/g

Regular expression with sed

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.