Escape dollar sign in regexp for sed - regex

I will introduce what my question is about before actually asking - feel free to skip this section!
Some background info about my setup
To update files manually in a software system, I am creating a bash script to remove all files that are not present in the new version, using diff:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g"); do echo "rm -f $i" >> REMOVEOLDFILES.sh; done
This works fine. However, apparently my files often have a dollar sign ($) in the filename, this is due to some permutations of the GWT framework. Here is one example line from the above created bash script:
rm -f var/lib/tomcat7/webapps/ROOT/WEB-INF/classes/ExampleFile$3$1$1$1$2$1$1.class
Executing this script would not remove the wanted files, because bash reads these as argument variables. Hence I have to escape the dollar signs with "\$".
My actual question
I now want to add a sed-Command in the aforementioned pipeline, replacing this dollar sign. As a matter of fact, sed also reads the dollar sign as special character for regular expressions, so obviously I have to escape it as well.
But somehow this doesn't work and I could not find an explanation after googling a lot.
Here are some variations I have tried:
echo "Bla$bla" | sed "s/\$/2/g" # Output: Bla2
echo "Bla$bla" | sed 's/$$/2/g' # Output: Bla
echo "Bla$bla" | sed 's/\\$/2/g' # Output: Bla
echo "Bla$bla" | sed 's/#"\$"/2/g' # Output: Bla
echo "Bla$bla" | sed 's/\\\$/2/g' # Output: Bla
The desired output in this example should be "Bla2bla".
What am I missing?
I am using GNU sed 4.2.2
EDIT
I just realized, that the above example is wrong to begin with - the echo command already interprets the $ as a variable and the following sed doesn't get it anyway... Here a proper example:
Create a textfile test with the content bla$bla
cat test gives bla$bla
cat test | sed "s/$/2/g" gives bla$bla2
cat test | sed "s/\$/2/g" gives bla$bla2
cat test | sed "s/\\$/2/g" gives bla2bla
Hence, the last version is the answer. Remember: when testing, first make sure your test is correct, before you question the test object........

The correct way to escape a dollar sign in regular expressions for sed is double-backslash. Then, for creating the escaped version in the output, we need some additional slashes:
cat filenames.txt | sed "s/\\$/\\\\$/g" > escaped-filenames.txt
Yep, that's four backslashes in a row. This creates the required changes: a filename like bla$1$2.class would then change to bla\$1\$2.class.
This I can then insert into the full pipeline:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g" | sed "s/\\$/\\\\$/g"; do echo "rm -f $i" >> REMOVEOLDFILES.sh; done
Alternative to solve the background problem
chepner posted an alternative to solve the backround problem by simply adding single-quotes around the filenames for the output. This way, the $-signs are not read as variables by bash when executing the script and the files are also properly removed:
for i in $(diff -r old new 2>/dev/null | grep "Only in old" | cut -d "/" -f 3- | sed "s/: /\//g"); do echo "rm -f '$i'" >> REMOVEOLDFILES.sh; done
(note the changed echo "rm -f '$i'" in that line)

There are other problems with your script, but file names containing $ are not a problem if you properly quote the argument to rm in the resulting script.
echo "rm -f '$i'" >> REMOVEOLDFILES.sh
or using printf, which makes quoting a little nicer and is more portable:
printf "rm -f '%s'" "$i" >> REMOVEOLDFILES.sh
(Note that I'm addressing the real problem, not necessarily the question you asked.)

There is already a nice answer directly in the edited question that helped me a lot - thank you!
I just want to add a bit of curious behavior that I stumbled across: matching against a dollar sign at the end of lines (e.g. when modifying PS1 in your .bashrc file).
As a workaround, I match for additional whitespace.
$ DOLLAR_TERMINATED="123456 $"
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$/END/"
123456END
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$$/END/"
sed: -e expression #1, char 13: Invalid back reference
$ echo "${DOLLAR_TERMINATED}" | sed -e "s/ \\$\s*$/END/"
123456END
Explanation to the above, line by line:
Defining DOLLAR_TERMINATED - I want to replace the dollar sign at the end of DOLLAR_TERMINATED with "END"
It works if I don't check for the line ending
It won't work if I match for the line ending as well (adding one more $ on the left side)
It works if I additionally match for (non-present) whitespace
(My sed version is 4.2.2 from February 2016, bash is version 4.3.48(1)-release (x86_64-pc-linux-gnu), in case that makes any difference)

Related

Delete any special character using Sed

I have yet another list of subdomain. I want to remove any Wildcard subdomain which include these special characters:
()!&$#*+?
Mostly, the data are prefixly random. Also, could be middle. Here's some sample of output data
(www.imgur.com
***************diet.blogspot.com
*-1.gbc.criteo.com
------------------------------------------------------------i.imgur.com
This has been quite an inconvenience while scanning through the list. As always, I'm trying sed to fix it:
sed -i "/[!()#$&?+]/d" foo.txt ###Didn't work
sed -i "/[\!\(\)\#\$\&\?\+]/d" ###Escaping char didn't work
Performing commands above still result in an unchanged list and the file still on original state. I'm thinking that; to fix this is to pipe series of sed command in order to remove it one by one:
cat foo.txt | sed -e "/!/d" -e "/#/d" -e "/\*/d" -e "/\$/d" -e "/(/d" -e "/)/d" -e "/+/d" -e "/\'/d" -e "/&/d" >> foo2.txt
cat foo.txt | sed -e "/\!/d" | sed -e "/\#/d" | sed -e "/\*/d" | sed -e "/\$/d" | sed -e "/\+/d" | sed -e "/\'/d" | sed -e "/\&/d" >> foo2.txt
If escaping all special char doesn't work, it must've been my false logic. Also tried with /g still doesn't increase my luck.
As a side note: I don't want - to be deleted as some valid subdomain can have - character:
line-apps.com
line-apps-beta.com
line-apps-rc.com
line-apps-dev.com
Any help would be cherished.
Using sed
$ sed '/[[:punct:]]/d' input_file
This should delete all lines with special characters, however, it would help if you provided sample data.
To do what you're trying to do in your answer (which adds [ and ] and more to the set of characters in your question) would be:
sed '/[][!?+,#$&*() ]/d'
or just:
grep -v '[][!?+,#$&*() ]'
Per POSIX to include ] in a bracket expression it must be the first character otherwise it indicates the end of the bracket expression.
Consider printing lines you want instead of deleting lines you do not want, though, e.g.:
grep '^[[:alnum:]_.-]$' file
to print lines that only contain letters, numbers, underscores, dashes, and/or periods.

Extract strings from text files and rename them accordingly in bash

I have a lot of text file named randomly (something like 70000 files); all I know is that somewhere in the first 30 lines there are two lines of the format Author: Samuel Richardson and another line Title: Clarissa, Volume 5 (of 9). I am not sure of the case of these two lines.
I want to extract the title and the author and rename the file accordingly, something like "Clarissa, Volume 5 (of 9) ,___, Samuel Richardson.txt" (I use ,___, so that there are valid separators between author and titles.
My code is
for filename in *.txt; do
title=$(head -n 30 $filename.txt | grep -i 'Title:' | sed -n 's/^.*Title: //p')
author=$(head -n 30 $filename.txt | grep -i 'Author:' | sed -n 's/^.*Author: //p')
new_name="$title ,___, $author"
mv $filename $new_name.txt
done
It is not working as expected. The subcode
echo "title: $title _"
echo "author: $author _"
new_name="$title ,___, $author"
echo $new_name
prints as output the following
_tle: Clarissa, Volume 5 (of 9)
_thor: Samuel Richardson
,___, Samuel Richardson)
Moreover, I don't know how to save the computation of the extraction of the first 30 lines with the head command to a variable firstlines, so that it should not be re-computated.
The code
firstlines=$(head -n 30 randomname.txt)
and the use of title=$($firstlines | grep -i 'Title:' | sed -n 's/^.*Title: //p')
prints out the error command not found.
#Poshi's comment about about line endings is correct, and #B.Shefter's answer is on the right track but has a number of problems (unquoted variable references, relying on nonstandard features of echo and sed), so I thought I'd rewrite with (hopefully) the problems fixed.
Also, I'll repeat the recommendation I gave in a comment: use mv -n or mv -i to avoid overwriting files if anything goes wrong, and make a backup first. (You have a backup anyway, right? You should always have a backup of anything you don't want to lose.)
Anyway, here's my take on it:
#!/bin/bash
for filename in *.txt; do
# Grab the first 30 lines with carriage returns removed:
firstlines=$(head -n 30 "$filename" | tr -d '\r')
# Capture the title and author. Note that sed doesn't have case-insensitive
# patterns, so use e.g. [Tt] to manually make them case-insensitive. Also, use
# [[:blank:]]* to allow any number of spaces and/or tabs after the ":".
title=$(echo "$firstlines" | sed -n 's/^.*[Tt][Ii][Tt][Ll][Ee]:[[:blank:]]*//p')
if [ -z "$title" ]; then
echo "Unable to find Title: in $filename; skipping" >&2
continue
fi
author=$(echo "$firstlines" | sed -n 's/^.*[Aa][Uu][Tt][Hh][Oo][Rr]:[[:blank:]]*//p')
if [ -z "$author" ]; then
echo "Unable to find Author: in $filename; skipping" >&2
continue
fi
new_name="$title ,___, $author.txt"
# Note: the filenames here will contain spaces, so double-quoting is *critical*
mv -i "$filename" "$new_name"
done
#Poshi's right: your main issue is line endings. It looks as if each line ending includes a carriage return (\r). By itself, \r just moves the cursor back to the beginning of the line. When coupled with \n it works fine--because it moves to the beginning of the next line--but by itself it causes what you're seeing: some text, followed by the cursor going back to the beginning of the line, followed by more text overwriting what was there originally.
EDIT: It would probably help if I included a solution for that. Something like this should work, inserted before the assignment to new_name:
title=$(echo -e $title | sed 's/\r//')
author=$(echo -e $author | sed 's/\r//')
As for your second problem, the reason you're getting command not found is that the first word in the variable $firstlines isn't a command. You want something like:
title=$(echo -e $firstlines | grep -i 'Title:' | sed -n 's/^.*Title: //p')

Get all strings after the 4th occurrence of the pattern is found in bash

Starting with a string like:
String=1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS.
A regular expression needed for matching all strings after the 4th colon ":" and assigning it for a variable in shell script like:
var_result="Searching done for the string:SUCCESS."
Using shell (bash or POSIX)
$ string="1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS."
$ echo "${string#*:*:*:*:}"
Searching done for the string:SUCCESS.
${string#*:*:*:*:} is an example of prefix removal. It removes a prefix consisting of four colon-separated strings.
The output can be saved in a shell variable:
$ var_result=${string#*:*:*:*:}
$ echo "$var_result"
Searching done for the string:SUCCESS.
Using cut
cut works for this:
$ string="1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS."
$ cut -d: -f 5- <<<"$string"
Searching done for the string:SUCCESS.
The above selects the fifth field and all succeeding fields where fields are separated by colons. More specifically, -d: tells cut to use : as the field separator and -f 5- tells it to select field 5 and everything after.
To save the output in a variable, we use command substitution:
$ var_result=$(cut -d: -f 5- <<<"$var")
$ echo "$var_result"
Searching done for the string:SUCCESS.
If you just have a POSIX shell, not bash, then we need to use echo:
$ var_result=$(echo "$var" | cut -d: -f 5-)
$ echo "$var_result"
Searching done for the string:SUCCESS.
Or, safer still, printf:
$ var_result=$(printf "%s" "$var" | cut -d: -f 5-)
$ echo "$var_result"
Searching done for the string:SUCCESS.
Using sed
The following uses sed to remove the first four fields defined by colons:
$ sed -E 's/([^:]*:){4}//' <<<"$string"
Searching done for the string:SUCCESS.
More specifically:
[^:] matches any character except :.
[^:]*: matches any number of non-colons followed by a colon.
([^:]*:){4} matches exactly four colon separated fields.
s/([^:]*:){4}// is a substitute command which looks for the first four colon-separated columns and replaces them with an empty string.
The following is the same but saves the result in a variable:
$ var_result=$(sed -E 's/([^:]*:){4}//' <<<"$string")
$ echo "$var_result"
Searching done for the string:SUCCESS.
The following is the same but good also for POSIX shells:
$ var_result=$(printf '%s' "$var" | sed -E 's/([^:]*:){4}//')
$ echo "$var_result"
Searching done for the string:SUCCESS.
Following solution may help you on same.
Let's say following is the variable's value:
var="1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS."
echo "$var"
1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS.
echo "$var" | awk -F":" '{$1=$2=$3=$4="";sub(/^:+/,"");print $0}' OFS=":"
Searching done for the string:SUCCESS.
With bash regex you can say:
String="1973251922:197325192278:abcdefgh:0xfff689990:Searching done for the string:SUCCESS."
if [[ $String =~ ^([^:]*:){4}(.+)$ ]]; then
echo ${BASH_REMATCH[2]}
fi

How to remove special characters like a single quote from a string?

Using Sed I tried but it did not worked out.
Basically, I have a string say:-
Input:-
'http://www.google.com/photos'
Output required:-
http://www.google.com
I tried using sed but escaping ' is not possible.
what i did was:-
sed 's/\'//' | sed 's/photos//'
sed for photos worked but for ' it didn't.
Please suggest what can be the solution.
Escaping ' in sed is possible via a workaround:
sed 's/'"'"'//g'
# |^^^+--- bash string with the single quote inside
# | '--- return to sed string
# '------- leave sed string and go to bash
But for this job you should use tr:
tr -d "'"
Perl Replacements have a syntax identical to sed, works better than sed, is installed almost in every system by default and works for all machines the same way (portability):
$ echo "'http://www.google.com/photos'" |perl -pe "s#\'##g;s#(.*//.*/)(.*$)#\1#g"
http://www.google.com/
Mind that this solution will keep only the domain name with http in front, discarding all words following http://www.google.com/
If you want to do it with sed , you can use sed "s/'//g" as advised by Wiktor Stribiżew in comments.
PS: I sometimes refer to special chars with their ascii hex code of the special char as advised by man ascii, which is \x27 for '
So for sed you can do it:
$ echo "'http://www.google.com/photos'" |sed -r "s#'##g; s#(.*//.*/)(.*$)#\1#g;"
http://www.google.com/
# sed "s#\x27##g' will also remove the single quote using hex ascii code.
$ echo "'http://www.google.com/photos'" |sed -r "s#'##g; s#(.*//.*)(/.*$)#\1#g;"
http://www.google.com #Without the last slash
If your string is stored in a variable, you can achieve above operations with pure bash, without the need of external tools like sed or perl like this:
$ a="'http://www.google.com/photos'" && a="${a:1:-1}" && echo "$a"
http://www.google.com/photos
# This removes 1st and last char of the variable , whatever this char is.
$ a="'http://www.google.com/photos'" && a="${a:1:-1}" && echo "${a%/*}"
http://www.google.com
#This deletes every char from the end of the string up to the first found slash /.
#If you need the last slash you can just add it to the echo manually like echo "${a%/*}/" -->http://www.google.com/
It's unclear if the ' are actually around your string, although this should take care it:
str="'http://www.google.com/photos'"
echo "$str" | sed s/\'//g | sed 's/\/photos//g'
Combined:
echo "$str" | sed -e "s/'//g" -e 's/\/photos//g'
Using tr:
echo "$str" | sed -e "s/\/photos//g" | tr -d \'
Result:
http://www.google.com
If the single quotes are not around your string it should work regardless.

Replace string with another string based on backreference with sed

I'm trying to convert a predefined string %c# where # can be some number with another string. The catch is that the length of the other string must be truncated to # number of characters.
Ideally these set of commands would work:
FORMAT="%c10"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
echo $FORMAT | sed "s/%c\([0-9]\+\)/${LAST_COMMIT:0:\1}/g"
but clearly there is a syntax error on the \1. You can replace it with a number to see what I'm trying to get as output.
I'm open to using some other program other than sed to achieve this but ideally it should be programs that are pretty much native to most linux installations.
Thanks!
This is my idea.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c//')
Get number with sed and get first some character with head.
EDIT1
This might be better.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c\([0-9]\+\)/\1/')
EDIT2
I make the script because it is too tough to understand. Please try this.
$ cat sample.sh
#!/bin/bash
FORMAT="%b-%t-%c10-%c5"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
## List numbers
lengths=$(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g")
## Substitute %cXX to first XX characters of LAST_COMMIT
for n in ${lengths}
do
to_str=$(echo ${LAST_COMMIT:0:${n}})
FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/")
done
## Print result
echo ${FORMAT}
This is the result.
$ ./sample.sh
%b-%t-5189e42b1410-5189e5
Also this is one line commands (Same contents but too long and too tough)
for n in $(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g"); do to_str=$(echo ${LAST_COMMIT:0:${n}}); FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/"); done; echo ${FORMAT}
The value of $LAST_COMMIT gets interpolated before sed runs, so there is no backreference to refer back to yet. There is an /e extension in GNU sed which would support something like this, but I would simply use a slightly more capable tool.
perl -e '$fmt = shift; $fmt=~ s/%c(\d+)/%.$1s/g; printf("$fmt\n", #ARGV)' '%c10' "$LAST_COMMIT"
Of course, if you can let go of your own ad-hoc format string specifier, and switch to a printf-compatible format string altogether, just use the printf shell command straight off.
length=$(echo $FORMAT | sed "s/%c\([0-9]\+\)/\1/g")
echo "${LAST_COMMIT:0:$length}"