"sed" special characters handling - regex

we have an sed command in our script to replace the file content with values from variables
for example..
export value="dba01upc\Fusion_test"
sed -i "s%{"sara_ftp_username"}%$value%g" /home_ldap/user1/placeholder/Sara.xml
the sed command ignores the special characters like '\' and replacing with string "dba01upcFusion_test" without '\'
It works If I do the export like export value='dba01upc\Fusion_test' (with '\' surrounded with ‘’).. but unfortunately our client want to export the original text dba01upc\Fusion_test with single/double quotes and he don’t want to add any extra characters to the text.
Can any one let me know how to make sed to place the text with special characters..
Before Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account >
<ser:description/>
<ser:static-account>
<con:username>{sara_ftp_username}</con:username>
</ser:static-account>
</ser:service-account>
After Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account>
<ser:description/>
<ser:static-account>
<con:username>dba01upcFusion_test</con:username>
</ser:static-account>
</ser:service-account>
Thanks in advance

You cannot robustly solve this problem with sed. Just use awk instead:
awk -v old="string1" -v new="string2" '
idx = index($0,old) {
$0 = substr($0,1,idx-1) new substr($0,idx+length(old))
}
1' file
Ah, #mklement0 has a good point - to stop escapes from being interpreted you need to pass in the values in the arg list along with the file names and then assign the variables from that, rather than assigning values to the variables with -v (see the summary I wrote a LONG time ago for the comp.unix.shell FAQ at http://cfajohnson.com/shell/cus-faq-2.html#Q24 but apparently had forgotten!).
The following will robustly make the desired substitution (a\ta -> e\tf) on every search string found on every line:
$ cat tst.awk
BEGIN {
old=ARGV[1]; delete ARGV[1]
new=ARGV[2]; delete ARGV[2]
lgthOld = length(old)
}
{
head = ""; tail = $0
while ( idx = index(tail,old) ) {
head = head substr(tail,1,idx-1) new
tail = substr(tail,idx+lgthOld)
}
print head tail
}
$ cat file
a\ta a a a\ta
$ awk -f tst.awk 'a\ta' 'e\tf' file
e\tf a a e\tf
The white space in file is tabs. You can shift ARGV[3] down and adjust ARGC if you like but it's not necessary in most cases.

Update with the benefit of hindsight, to present options:
Update 2: If you're intent on using sed, see the - somewhat cumbersome, but now robust and generic - solution below.
If you want a robust, self-contained awk solution that also properly handles both arbitrary search and replacement strings (but cannot incorporate regex features such as word-boundary assertions), see Ed Morton's answer.
If you want a pure bash solution and your input files are small and preserving multiple trailing newlines is not important, see Charles Duffy's answer.
If you want a full-fledged third-party templating solution, consider, for instance, j2cli, a templating CLI for Jinja2 - if you have Python and pip, install with sudo pip install j2cli.
Simple example (note that since the replacement string is provided via a file, this may not be appropriate for sensitive data; note the double braces ({{...}})):
value='dba01upc\Fusion_test'
echo "sara_ftp_username=$value" >data.env
echo '<con:username>{{sara_ftp_username}}</con:username>' >tmpl.xml
j2 tmpl.xml data.env # -> <con:username>dba01upc\Fusion_test</con:username>
If you use sed, careful escaping of both the search and the replacement string is required, because:
As Ed Morton points out in a comment elsewhere, sed doesn't support use of literal strings as replacement strings - it invariably interprets special characters/sequences in the replacement string.
Similarly, the search string literal must be escaped in a way that its characters aren't mistaken for special regular-expression characters.
The following uses two generic helper functions that perform this escaping (quoting) that apply techniques explained at "Is it possible to escape regex characters reliably with sed?":
#!/usr/bin/env bash
# SYNOPSIS
# quoteRe <text>
# DESCRIPTION
# Quotes (escapes) the specified literal text for use in a regular expression,
# whether basic or extended - should work with all common flavors.
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# '
# SYNOPSIS
# quoteSubst <text>
# DESCRIPTION
# Quotes (escapes) the specified literal string for safe use as the substitution string (the 'new' in `s/old/new/`).
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
# The search string.
search='{sara_ftp_username}'
# The replacement string; a demo value with characters that need escaping.
value='&\1%"'\'';<>/|dba01upc\Fusion_test'
# Use the appropriately escaped versions of both strings.
sed "s/$(quoteRe "$search")/$(quoteSubst "$value")/g" <<<'<el>{sara_ftp_username}</el>'
# -> <el>&\1%"';<>/|dba01upc\Fusion_test</el>
Both quoteRe() and quoteSubst() correctly handle multi-line strings.
Note, however, given that sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
quoteRe() is always safe to use with a command substitution ($(...)), because it always returns a single-line string (newlines in the input are encoded as '\n').
By contrast, if you use quoteSubst() with a string that has trailing newlines, you mustn't use $(...), because the latter will remove the last trailing newline and therefore break the encoding (since quoteSubst() \-escapes actual newlines, the string returned would end in a dangling \).
Thus, for strings with trailing newlines, use IFS= read -d '' -r escapedValue < <(quoteSubst "$value") to read the escaped value into a separate variable first, then use that variable in the sed command.

This can be done with bash builtins alone -- no sed, no awk, etc.
orig='{sara_ftp_username}' # put the original value into a variable
new='dba01upc\Fusion_test' # ...no need to 'export'!
contents=$(<Sara.xml) # read the file's content into
new_contents=${contents//"$orig"/$new} # use parameter expansion to replace
printf '%s' "$new_contents" >Sara.xml # write new content to disk
See the relevant part of BashFAQ #100 for information on using parameter expansion for string substitution.

Related

BASH escaping double quotes within single quotes

I'm trying to write a bash function that would escape all double quotes within single quotes, eg:
'I need to escape "these" quotes with backslashes'
would become
'I need to escape \"these\" quotes with backslashes'
My take on it was:
Find pairs of single quotes in the input and extract them with grep
Pipe into sed, escape double quotes
Sed again the whole input and replace grep match with sedded match
I managed to get it working to the part of having correctly escaped quotes section, but replacing it in the whole input fails.
The script code copypaste:
# $1 - Full name, $2 - minified name
adjust_quotes ()
{
SINGLE_QUOTES=`grep -Eo "'.*'" $2`
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
sed -r "s|'.*'|$ESCAPED_QUOTES|g" "$2" > "$2.escaped"
mv "$2.escaped" $2
echo "Quotes escaped within single quotes on $2"
}
Random additional questions:
In the console, escaping the quote with only two backslashes works, but when code is put in the script - I need four. I'd love to know
Could I modify this code into a loop to escape all pairs of single quotes, one after another until EOF?
Thanks!
P.S. I know this would probably be easier to do in eg. python, but I really need to keep it in bash.
Using BASH string replacement:
s='I need to escape "these" quotes with backslashes'
r="${s//\"/\\\"}"
echo "$r"
I need to escape \"these\" quotes with backslashes
Here's a pure bash solution, which does the transformation on stdin, printing to stdout. It reads the entire input into memory, so it won't work with really enormous files.
escape_enclosed_quotes() (
IFS=\'
read -d '' -r -a fields
for ((i=1; i<${#fields[#]}; i+=2)); do
fields[i]=${fields[i]//\"/\\\"}
done
printf %s "${fields[*]}"
)
I deliberately enclosed the body of the function in parentheses rather than braces, in order to force the body to run in a subshell. That limits the modification of IFS to the body, as well as implicitly making the variables used local.
The function uses the read builtin to read the entire input (since the line delimiter is set to NUL with -d '') into an array (-a) using a single quote as the field separator (IFS=\'). The result is that the parts of the input surrounded with single quotes are in the odd positions of the array, so the function loops over the odd indices to do the substitution only for those fields. I use bash's find-and-replace syntax instead of deferring to an external utility like sed.
This being bash, there are a couple of gotchas:
If the file contains a NUL, the rest of the file will be ignored.
If the last line of the file does not end with a newline, and the last character of that line is a single quote, it will not be output.
Both of the above conditions are impossible in a portable text file, so it's probably OK. All the same, worth taking note.
The supplementary question: why are the extra backslashes needed in
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
Answer: It has nothing to do with that line being in a script. It has to do with your use of backticks (...) for command substitution, and the idiosyncratic and often unpredictable handling of backslashes inside backticks. This syntax is deprecated. Do not use it. (Not even if you see someone else using it in some random example on the internet.) If you had used the recommended $(...) syntax for command substitution, it would have worked as expected:
ESCAPED_QUOTES=$(echo $SINGLE_QUOTES | sed 's|"|\\"|g')
(More information is in the Bash FAQ linked above.)

Grep invert on string matched, not line matched

I'll keep this explanation of why I need help to a mimimum. One of my file directories got hacked through XSS and placed a long string at the beginning of all php files. I've tried to use sed to replace the string with nothing but it won't work because the pattern to match includes many many characters that would need to be escaped.
I found out that I can use fgrep to match a fixed string saved in a pattern file, but I'd like to replace the matched string (NOT THE LINE) in each file, but grep's -v inverts the result on the line, rather than the end of the matched string.
This is the command I'm using on an example file that contains the hacked
fgrep -v -f ~/hacked-string.txt example.php
I need the output to contain the <?php that's at the end of the line (sometimes it's a <style> tag), but the -v option inverts at the end of that line, so the output doesn't contain the <?php at the beginning.
NOTE
I've tried to use the -o or --only-matching which outputs nothing instead:
fgrep -f ~/hacked-string.txt example.php --only-matching -v
Is there another option in grep that I can use to invert on the end of the matched pattern, rather than the line where the pattern was matched? Or alternatively, is there an easier option to replace the hacked string in all .php files?
Here is a small snippet of what's in hacked-string.txt (line breaks added for readability):
]55Ld]55#*<%x5c%x7825bG9}:}.}-}!#*<%x55c%x7825)
dfyfR%x5c%x7827tfs%x5c%x7c%x785c%x5c%x7825j:^<!
%x5c%x7825w%x5c%x7860%x5c%x785c^>Ew:25tww**WYsb
oepn)%x5c%x7825bss-%x5c%x7825r%x5c%x7878B%x5c%x
7825h>#]y3860msvd},;uqpuft%x5c%x7860msvd}+;!>!}
%x5c%x7827;!%x5c%x7825V%x5c%x7827{ftmfV%x5e56+9
9386c6f+9f5d816:+946:ce44#)zbssb!>!ssbnpe_GMFT%
x5c5c%x782f#00#W~!%x5c%x7825t2w)##Qtjw)#]82#-#!
#-%x5c%x7825tmw)%x5c%x78w6*%x5c%x787f_*#fubfsdX
k5%x5c%xf2!>!bssbz)%x5c%x7824]25%x5c%x7824-8257
-K)fujs%x5c%x7878X6<#o]o]Y%x5c%x78257;utpI#7>-1
-bubE{h%x5c%x7825)sutcvt)!gj!|!*bubEpqsut>j%x5c
%x7825!*72!%x5c%x7827!hmg%x5c%x78225>2q%x5c%x7
Thanks in advance!
I think what you are asking is this:
"Is it possible to use the grep utility to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
In that case, the answer is "No".
What I think you wanted to ask was:
"What is the easiest way to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
Here's one reasonably simple solution:
delete_string() {
awk -v s="$the_string" '{while(i=index($0,s))$0=substr($0,1,i-1)substr($0,i+length(s))}1'
}
delete_string 'some_hideous_string_with*!"_inside' < original_file > new_file
The shell syntax is slightly fragile; it will break if the string contains an apostrophe ('). However, you can read a raw string from stdin into a variable with:
$ IFS= read -r the_string
absolutely anything here
which will work with any string which doesn't contain a newline or a NUL character. Once you have the string in a variable, you can use the above function:
delete_string "$the_string" < original_file > new_file
Here's another possible one liner, using python:
delete_string() {
python -c 'import sys;[sys.stdout.write(l.replace(r"""'"$1"'""","")) for l in sys.stdin]'
}
This won't handle strings which have three consecutive quotes (""").
Is the hacked string the same in every file?
If the length of hacked string in chars was 1234 then you can use
tail -c +1235 file.php > fixed-file.php
for each infected file.
Note that tail c +1235 tells to start output at 1235th character of the input file.
With perl:
perl -i.hacked -pe "s/\Q$(<hacked-string.txt)\E//g" example.php
Notes:
The $(<file) bit is a bash shortcut to read the contents of a file.
The \Q and \E bits are from perl, they treat the stuff in between as plain characters, ignoring regex metachars.
The -i.hacked option will edit the file in-place, creating a backup "example.php.hacked"

How to swap text based on patterns at once with sed?

Suppose I have 'abbc' string and I want to replace:
ab -> bc
bc -> ab
If I try two replaces the result is not what I want:
echo 'abbc' | sed 's/ab/bc/g;s/bc/ab/g'
abab
So what sed command can I use to replace like below?
echo abbc | sed SED_COMMAND
bcab
EDIT:
Actually the text could have more than 2 patterns and I don't know how many replaces I will need. Since there was a answer saying that sed is a stream editor and its replaces are greedily I think that I will need to use some script language for that.
Maybe something like this:
sed 's/ab/~~/g; s/bc/ab/g; s/~~/bc/g'
Replace ~ with a character that you know won't be in the string.
I always use multiple statements with "-e"
$ sed -e 's:AND:\n&:g' -e 's:GROUP BY:\n&:g' -e 's:UNION:\n&:g' -e 's:FROM:\n&:g' file > readable.sql
This will append a '\n' before all AND's, GROUP BY's, UNION's and FROM's, whereas '&' means the matched string and '\n&' means you want to replace the matched string with an '\n' before the 'matched'
sed is a stream editor. It searches and replaces greedily. The only way to do what you asked for is using an intermediate substitution pattern and changing it back in the end.
echo 'abcd' | sed -e 's/ab/xy/;s/cd/ab/;s/xy/cd/'
Here is a variation on ooga's answer that works for multiple search and replace pairs without having to check how values might be reused:
sed -i '
s/\bAB\b/________BC________/g
s/\bBC\b/________CD________/g
s/________//g
' path_to_your_files/*.txt
Here is an example:
before:
some text AB some more text "BC" and more text.
after:
some text BC some more text "CD" and more text.
Note that \b denotes word boundaries, which is what prevents the ________ from interfering with the search (I'm using GNU sed 4.2.2 on Ubuntu). If you are not using a word boundary search, then this technique may not work.
Also note that this gives the same results as removing the s/________//g and appending && sed -i 's/________//g' path_to_your_files/*.txt to the end of the command, but doesn't require specifying the path twice.
A general variation on this would be to use \x0 or _\x0_ in place of ________ if you know that no nulls appear in your files, as jthill suggested.
Here is an excerpt from the SED manual:
-e script
--expression=script
Add the commands in script to the set of commands to be run while processing the input.
Prepend each substitution with -e option and collect them together. The example that works for me follows:
sed < ../.env-turret.dist \
-e "s/{{ name }}/turret$TURRETS_COUNT_INIT/g" \
-e "s/{{ account }}/$CFW_ACCOUNT_ID/g" > ./.env.dist
This example also shows how to use environment variables in your substitutions.
This might work for you (GNU sed):
sed -r '1{x;s/^/:abbc:bcab/;x};G;s/^/\n/;:a;/\n\n/{P;d};s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/;ta;s/\n(.)/\1\n/;ta' file
This uses a lookup table which is prepared and held in the hold space (HS) and then appended to each line. An unique marker (in this case \n) is prepended to the start of the line and used as a method to bump-along the search throughout the length of the line. Once the marker reaches the end of the line the process is finished and is printed out the lookup table and markers being discarded.
N.B. The lookup table is prepped at the very start and a second unique marker (in this case :) chosen so as not to clash with the substitution strings.
With some comments:
sed -r '
# initialize hold with :abbc:bcab
1 {
x
s/^/:abbc:bcab/
x
}
G # append hold to patt (after a \n)
s/^/\n/ # prepend a \n
:a
/\n\n/ {
P # print patt up to first \n
d # delete patt & start next cycle
}
s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/
ta # goto a if sub occurred
s/\n(.)/\1\n/ # move one char past the first \n
ta # goto a if sub occurred
'
The table works like this:
** ** replacement
:abbc:bcab
** ** pattern
Tcl has a builtin for this
$ tclsh
% string map {ab bc bc ab} abbc
bcab
This works by walking the string a character at a time doing string comparisons starting at the current position.
In perl:
perl -E '
sub string_map {
my ($str, %map) = #_;
my $i = 0;
while ($i < length $str) {
KEYS:
for my $key (keys %map) {
if (substr($str, $i, length $key) eq $key) {
substr($str, $i, length $key) = $map{$key};
$i += length($map{$key}) - 1;
last KEYS;
}
}
$i++;
}
return $str;
}
say string_map("abbc", "ab"=>"bc", "bc"=>"ab");
'
bcab
May be a simpler approach for single pattern occurrence you can try as below:
echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
My output:
~# echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
bcab
For multiple occurrences of pattern:
sed 's/\(ab\)\(bc\)/\2\1/g'
Example
~# cat try.txt
abbc abbc abbc
bcab abbc bcab
abbc abbc bcab
~# sed 's/\(ab\)\(bc\)/\2\1/g' try.txt
bcab bcab bcab
bcab bcab bcab
bcab bcab bcab
Hope this helps !!
echo "C:\Users\San.Tan\My Folder\project1" | sed -e 's/C:\\/mnt\/c\//;s/\\/\//g'
replaces
C:\Users\San.Tan\My Folder\project1
to
mnt/c/Users/San.Tan/My Folder/project1
in case someone needs to replace windows paths to Windows Subsystem for Linux(WSL) paths
If replacing the string by Variable, the solution doesn't work.
The sed command need to be in double quotes instead on single quote.
#sed -e "s/#replacevarServiceName#/$varServiceName/g" -e "s/#replacevarImageTag#/$varImageTag/g" deployment.yaml
Here is an awk based on oogas sed
echo 'abbc' | awk '{gsub(/ab/,"xy");gsub(/bc/,"ab");gsub(/xy/,"bc")}1'
bcab
I believe this should solve your problem. I may be missing a few edge cases, please comment if you notice one.
You need a way to exclude previous substitutions from future patterns, which really means making outputs distinguishable, as well as excluding these outputs from your searches, and finally making outputs indistinguishable again. This is very similar to the quoting/escaping process, so I'll draw from it.
s/\\/\\\\/g escapes all existing backslashes
s/ab/\\b\\c/g substitutes raw ab for escaped bc
s/bc/\\a\\b/g substitutes raw bc for escaped ab
s/\\\(.\)/\1/g substitutes all escaped X for raw X
I have not accounted for backslashes in ab or bc, but intuitively, I would escape the search and replace terms the same way - \ now matches \\, and substituted \\ will appear as \.
Until now I have been using backslashes as the escape character, but it's not necessarily the best choice. Almost any character should work, but be careful with the characters that need escaping in your environment, sed, etc. depending on how you intend to use the results.
Every answer posted thus far seems to agree with the statement by kuriouscoder made in his above post:
The only way to do what you asked for is using an intermediate
substitution pattern and changing it back in the end
If you are going to do this, however, and your usage might involve more than some trivial string (maybe you are filtering data, etc.), the best character to use with sed is a newline. This is because since sed is 100% line-based, a newline is the one-and-only character you are guaranteed to never receive when a new line is fetched (forget about GNU multi-line extensions for this discussion).
To start with, here is a very simple approach to solving your problem using newlines as an intermediate delimiter:
echo "abbc" | sed -E $'s/ab|bc/\\\n&/g; s/\\nab/bc/g; s/\\nbc/ab/g'
With simplicity comes some trade-offs... if you had more than a couple variables, like in your original post, you have to type them all twice. Performance might be able to be improved a little bit, too.
It gets pretty nasty to do much beyond this using sed. Even with some of the more advanced features like branching control and the hold buffer (which is really weak IMO), your options are pretty limited.
Just for fun, I came up with this one alternative, but I don't think I would have any particular reason to recommend it over the one from earlier in this post... You have to essentially make your own "convention" for delimiters if you really want to do anything fancy in sed. This is way-overkill for your original post, but it might spark some ideas for people who come across this post and have more complicated situations.
My convention below was: use multiple newlines to "protect" or "unprotect" the part of the line you're working on. One newline denotes a word boundary. Two newlines denote alternatives for a candidate replacement. I don't replace right away, but rather list the candidate replacement on the next line. Three newlines means that a value is "locked-in", like your original post way trying to do with ab and bc. After that point, further replacements will be undone, because they are protected by the newlines. A little complicated if I don't say so myself... ! sed isn't really meant for much more than the basics.
# Newlines
NL=$'\\\n'
NOT_NL=$'[\x01-\x09\x0B-\x7F]'
# Delimiters
PRE="${NL}${NL}&${NL}"
POST="${NL}${NL}"
# Un-doer (if a request was made to modify a locked-in value)
tidy="s/(\\n\\n\\n${NOT_NL}*)\\n\\n(${NOT_NL}*)\\n(${NOT_NL}*)\\n\\n/\\1\\2/g; "
# Locker-inner (three newlines means "do not touch")
tidy+="s/(\\n\\n)${NOT_NL}*\\n(${NOT_NL}*\\n\\n)/\\1${NL}\\2/g;"
# Finalizer (remove newlines)
final="s/\\n//g"
# Input/Commands
input="abbc"
cmd1="s/(ab)/${PRE}bc${POST}/g"
cmd2="s/(bc)/${PRE}ab${POST}/g"
# Execute
echo ${input} | sed -E "${cmd1}; ${tidy}; ${cmd2}; ${tidy}; ${final}"

Egrep expression: how to unescape single quotes when reading from file?

I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?
set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.

sed - Include newline in pattern

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">
find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done
The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:
<script>
//Foo
</script>
will remove the first script tag but will omit the "foo" and closing tag which I don't want.
Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?
Assuming that you have <script> tags on different lines, e.g. something like:
foo
bar
<script type="text/javascript">
some JS
</script>
foo
the following should work:
sed '/<script/,/<\/script>/d' inputfile
This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.
awk '/<script.*>/ { in=1; next }
/<\/script.*>/ { if (in) in=0; next }
{ if (!in) print; } ' $1
As you mentioned, the issue is that sed processes input line by line.
The simplest workaround is therefore to make the input a single line, e.g. replacing newlines with a character which you are confident doesn't exist in your input.
One would be tempted to use tr :
… |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'
However "currently tr fully supports only single-byte characters", and to be safe you probably want to use some improbable character like ˇ, for which tr is of no help.
Fortunately, the same thing can be achieved with sed, using branching.
Back on our <script>…</script> example, this does work and would be (according to the previous link) cross-platform :
… |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'
Or in a more condensed form if you use GNU sed and don't need cross-platform compatibility :
… |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'
Please refer to the linked answer under "using branching" for details about the branching part (:a;N;$!ba;). The remaining part is straightforward :
s/\n/ˇ/g replaces all newlines with ˇ ;
s~<script>.*</script>~~g removes what needs to be removed (beware that it requires some securing for actual use : as is it will delete everything between the first <script> and the last </script> ; also, note that I used ~ instead of / to avoid escaping of the slash in </script> : I could have used just about any single-byte character except a few reserved ones like \) ;
s/ˇ/\n/g readds newlines.