sed not terminating at the end of this line (issue with escaping?) - regex

The SQL applications that I'm using isn't properly escaping all of the strings that I have, so I'm trying to use sed to replace these instances. The issue is I'll have this:
`some string of characters that may include hyphens'
and the quote at the end won't get escaped (yes that's supposed to be a ` not a quote).
My plan was to use this:
sed 's/[^\\]\'[^,]/&\\\'&/g' testfile.txt
Logic: anything that isn't a backslash followed by a quote, then anything that isn't a comma will be replaced by the same text with with a backslash and a quote.
I would like for testfile.txt to have all instances of ' replaced with \', but I just keep getting > as if it isn't done the line

I try this using gnu sed,
$ cat d
already escaped quote \' won't be escaped
$ sed -E "s/([^\\]|^)'([^,]|$)/\1\\\'\2/" d
already escaped quote \' won\'t be escaped

What you're looking for are called lookaround assertions, where you match any ' not preceded by a \ or followed by an end of line. Unfortunately, sed doesn't support these. But you can use Perl:
perl -pe 's/(?<!\\)'\''(?!$)/\\'\''/g' testfile.txt
In unescaped form, this would look like s/(?<!\\)'(?!$)/\\'/g but we have to make allowances for the shell. No escapes are recognized in single quoted strings, so your original problem was \' not being recognized, and the string terminating early.
See here for example and detailed regexp breakdown: https://regex101.com/r/k8sonu/1

Related

How does escaping a literal dot work for sed

The following sed command
echo '.' | sed "s/\\./foo/"
substitutes . with foo, as expected. However, if we escape the non-alphanumeric . in the above command
echo '.' | sed "s/\\\./foo/"
prints barely ., whereas foo is expected. sed should match the character . literally, but it doesn't. I cannot understand what is happening with the dot. I believe that I should simply put a backslash in front of every non-alphanumeric character in bash, if a string is double-quoted. A dot is a non-alphanumeric character so what is wrong about escaping it and why does it produce a different result?
This is because how backslashes work in bash double quote " escaping.
echo "\\"
\
echo "\."
\.
echo "s/\\./foo/"
s/\./foo/
echo "s/\\\./foo/"
s/\\./foo/
From man bash:
Within double quotes, the backslash retains its special meaning only when
followed by one of the following characters: $, `, ", \,or <newline>.
So in the first case, sed gets s/\./foo/ and interprets it as "replace a dot with foo". In the second case sed gets s/\\./foo/ and interprets it as "replace a backslash and one other character with foo.
You better use single quote escaping in this case:
echo 's/./foo/'
s/./foo/
echo 's/\./foo/'
s/\./foo/
which is probably what you wanted.
In your second one, the . is still a literal dot, but the regular expression only replaces the two-character sequence \. (not . by itself) with foo:
$ echo '=\.=' | sed "s/\\\./foo/"
=foo=

substitute single quotes in sed and perl

Could someone please explain what was happening with these two commands? Why do sed and perl give different results running the same regular expression pattern:
# echo "'" | sed -e "s/\'/\'/"
''
# echo "'" | perl -pe "s/\'/\'/"
'
# sed --version
sed (GNU sed) 4.5
You're using GNU sed, right? \' is an extension that acts as an anchor for end-of-string in GNU's implementation of basic regular expressions. So you're seeing two quotes in the output because the s matches the end of the line and adds a quote, after the one that was already in the line.
To make it a bit more obvious:
echo foo | sed -e "s/\'/#/"
produces
foo#
Documented here, and in the GNU sed manual
Edit: The equivalent in perl is \Z (or maybe \z depending on how you want to handle a trailing newline). Since \' isn't a special sequence in perl regular expressions, it just matches a literal quote. As mentioned in the other answer and comments, escaping a single quote inside a double quoted string isn't necessary, and as you've found, can potentially result in unintended behavior.

What do I need to quote in sed command lines?

There are many questions on this site on how to escape various elements for sed, but I'm looking for a more general answer. I understand that I might want to escape some characters to avoid shell expansion:
Bash:
Single quoted [strings] ('') are used to preserve the literal value of each character enclosed within the quotes. [However,] a single quote may not occur between single quotes, even when preceded by a backslash.
The backslash retains its meaning [in double quoted strings] only when followed by dollar, backtick, double quote, backslash or newline. Within double quotes, the backslashes are removed from the input stream when followed by one of these characters. Backslashes preceding characters that don't have a special meaning are left unmodified for processing by the shell interpreter.
sh: (I hope you don't have history expansion)
Single quoted string behaviour: same as bash
Enclosing characters in double quotes preserves the literal value of
all characters within the quotes, with the exception of dollar, single quote, backslash, and,
when history expansion is enabled, exclamation mark.
The characters dollar and single quote retain their special meaning within double quotes.
The backslash retains its special meaning only when followed by one of the following characters: $, ', ", \, or newline. A double quote may be quoted within double
quotes by preceding it with a backslash.
If enabled, history expansion will be performed unless an exclamation mark appearing in double quotes is escaped using a backslash. The backslash preceding the ! is not removed.
...but none of that explains why this stops working as soon as you remove any escaping:
sed -e "s#\(\w\+\) #\1\/#g" #find a sequence of characters in a line
# why? ↑ ↑ ↑ ↑ #replace the following space with a slash.
None of (, ), / or + (or [, or ]...) seem to have any special meaning that requires them to be escaped in order to work. Hell, even calling the command directly through Python makes sed not work properly, although the manpage doesn't seem to spell out anything about this (not when I search for backslash, anyway.)
$ lvdisplay -C --noheadings -o vg_name,name > test
$ python
>>> import os
>>> #Python requires backslash escaping of \1, even in triple quotes
>>> #lest \1 is read to mean "byte with value 0x01".
>>> output = os.execl("/bin/sed", "-e", "s#(\w+) #\\1/#g", "test")
(Output remains unchanged)
$ python
>>> import os
>>> output = os.execl("/bin/sed", "-e", "s#\(\w\+\) #\\1\/#g", "test")
(Correct output)
$ WHAT THE HELL
Have you tried using jQuery? It's perfect and it does all the things.
If I understood you right, your problem is not about bash/sh, it is about the regex flavour sed uses by default: BRE.
The other [= anything but dot, star, caret and dollar] BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these.
Grouping (..) should be escaped to give it special meaning. same as + otherwise sed will try to match them as they are literal strings/chars. That's why your s#\(\w\+\) #...# should be escaped. The replacement part doesn't need escaping, so:
sed 's#\(\w\+\) #\1 /#'
should work.
sed has usually option to use extended regular expressions (now with ?, +, |, (), {m,n}); e.g. GNU sed has -r, then your one-liner could be:
sed -r 's#(\w+) #\1 /#'
I paste some examples here that may help you understand what's going on:
kent$ echo "abcd "|sed 's#\(\w\+\) #\1 /#'
abcd /
kent$ echo "abcd "|sed -r 's#(\w+) #\1 /#'
abcd /
kent$ echo "(abcd+) "|sed 's#(\w*+) #&/#'
(abcd+) /
What you're observing is correct. Certain characters like ?, +, (, ), {, } need to be escaped when using basic regular expressions.
Quoting from the sed manual:
The only difference between basic and extended regular expressions is
in the behavior of a few characters: ‘?’, ‘+’, parentheses, and braces
(‘{}’). While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them to match a
literal character.
(Emphasis mine.) These don't need to be escaped, though, when using extended regexps, except when matching a literal character (as mentioned in the last line quoted above.)
If you want a general answer,
Shell metacharacters need to be quoted or escaped from the shell;
Regex metacharacters need to be escaped if you want a literal interpretation;
Some regex constructs are formed by a backslash escape; depending on context, these backslashes may need quoting.
So you have the following scenarios;
# Match a literal question mark
echo '?' | grep \?
# or equivalently
echo '?' | grep "?"
# or equivalently
echo '?' | grep '?'
# Match a literal asterisk
echo '*' | grep \\\*
# or equivalently
echo '*' | grep "\\*"
# or equivalently
echo '*' | grep '\*'
# Match a backreference: any character repeated twice
echo 'aa' | grep \\\(.\\\)\\1
# or equivalently
echo 'aa' | grep "\(.\)\\1"
# or equivalently
echo 'aa' | grep '\(.\)\1'
As you can see, single quotes probably make the most sense most of the time.
If you are embedding into a language which requires backslash quoting of its own, you have to add yet another set of backslashes, or avoid invoking a shell.
As others have pointed out, extended regular expressions obey a slightly different syntax, but the general pattern is the same. Bottom line, to minimize interference from the shell, use single quotes whenever you can.
For literal characters, you can avoid some backslashitis by using a character class instead.
echo '*' | grep \[\*\]
# or equivalently
echo '*' | grep "[*]"
# or equivalently
echo '*' | grep '[*]'
FreeBSD sed, which is also used on Mac OS X, uses -E instead of -r for extended regular expressions.
Therefore, to have it portable, use basic regular expressions. + in extended-regular-expression mode, for example, would have to be replaced with \{1,\} in basic-regular-expression mode.
In basic- as well as extended-regular-expression mode, FreeBSD sed does not seem to recognize \w which has to be replaced with [[:alnum:]_] (cf. man re_format).
# using FreeBSD sed (on Mac OS X)
# output: Hello, world!
echo 'hello world' | sed -e 's/h/H/' -e 's/ \{1,\}/, /g' -e 's/\([[:alnum:]_]\{1,\}\)$/\1!/'
echo 'hello world' | sed -E -e 's/h/H/' -e 's/ +/, /g' -e 's/([[:alnum:]_]+)$/\1!/'
echo 'hello world' | sed -E -e 's/h/H/' -e 's/ +/, /g' -e 's/(\w+)$/\1!/' # does not work
# find a sequence of characters in a line
# replace the following space with a slash
# output: abcd+/abcd+/
echo 'abcd+ abcd+ ' > test
python
import os
output = os.execl('/usr/bin/sed', '-e', 's#\([[:alnum:]_+]\{1,\}\) #\\1/#g', 'test')
To use a single quote as part of a sed regular expression while keeping your outer single quotes for the sed regular expression, you can concatenate three separate strings each enclosed in single quotes to avoid possible shell expansion.
# man bash:
# "A single quote may not occur between single quotes, even when preceded by a backslash."
# cf. http://stackoverflow.com/a/9114512 & http://unix.stackexchange.com/a/82757
# concatenate: 's/doesn' + \' + 't/does not/'
echo "sed doesn't work for me" | sed -e 's/doesn'\''t/does not/'

using sed to replace ^[(s3B with blank space

I'm trying to use sed with perl to replace ^[(s3B with an empty string in several files.
s/^[(s3B// isn't working though, so I'm wondering what else I could try.
You need to quote the special characters:
$ echo "^[(s3B AAA ^[(s3B"|sed 's/\^\[[(]s3B//g'
AAA
$ echo "^[(s3B AAA ^[(s3B" >file.txt
$ perl -p -i -e 's/\^\[[(]s3B//g' file.txt
$ cat file.txt
AAA
The problem is that there are several characters that have a special meaning in regular expressions. ^ is a start-of-line anchor, [ opens a character class, and ( opens a capture.
You can escape all non-alphanumerics in a Perl string by preceding it with \Q, so you can safely use
s/\Q^[(s3B//
which is equivalent to, and more readable than
s/\^\[\(s3B//
If you're dealing with ANSI sequences (xterm color sequences, escape sequences), then ^[ is not '^' followed by '[' but rather an unprintable character ESC, ASCII code 0x1B.
To put that character into a sed expression you need to use \x1B in GNU sed, or see http://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/ . You can also insert special characters directly into your command line using ctrl+v in Bash line editing.
In regex "^", "[" and "(" (and many others) are special characters used for special regex features, if you are referencing the characters themselves you should preceed them with "\".
The correct substitution reges would be:
$string =~ s/\^\[\(3B//g
if you want to replace all occurences.

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.
For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.
The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.
All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt
option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile
This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.
You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence
You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new
Using gawk:
gawk '{$1=$1}1' OFS="\n" file