Perl regular expression matching on large Unicode code points

Perl regular expression matching on large Unicode code points - regex

I am trying to replace various characters with either a single quote or double quote.
Here is my test file:
# Replace all with double quotes
＂ fullwidth
“ left
” right
„ low
" normal
# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick
I'm trying to do this...
perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt
But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.
Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.
$ awk -F\ '{print $1}' test.txt | \
perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'
U+FF02 ＂
U+201C “
U+201D ”
U+201E „
U+0022 "
U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `
Why isn't my regular expression matching?

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.
You need:
$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt
I do mention this sort of thing in this answer a couple of times.

With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.
With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.
You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.
perl -CSD -i -pe'
s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt

Related

substitute single quotes in sed and perl

Could someone please explain what was happening with these two commands? Why do sed and perl give different results running the same regular expression pattern:
# echo "'" | sed -e "s/\'/\'/"
''
# echo "'" | perl -pe "s/\'/\'/"
'
# sed --version
sed (GNU sed) 4.5

You're using GNU sed, right? \' is an extension that acts as an anchor for end-of-string in GNU's implementation of basic regular expressions. So you're seeing two quotes in the output because the s matches the end of the line and adds a quote, after the one that was already in the line.
To make it a bit more obvious:
echo foo | sed -e "s/\'/#/"
produces
foo#
Documented here, and in the GNU sed manual
Edit: The equivalent in perl is \Z (or maybe \z depending on how you want to handle a trailing newline). Since \' isn't a special sequence in perl regular expressions, it just matches a literal quote. As mentioned in the other answer and comments, escaping a single quote inside a double quoted string isn't necessary, and as you've found, can potentially result in unintended behavior.

How do I correctly escape this search string for Perl pie

I use this classic perl one liner to replace strings in multiple files recursively
perl -pi -e 's/oldstring/newstring/g' `grep -irl oldstring *`
But this has failed me as I want to find the string:
'$user->primaryorganisation->id'
and replace with
$user->primaryorganisation->id
I can't seem to escape the string correctly for the line to run successfully.
Any help gratefully received!

Try this one. Lots of escapes. Go with TLPs suggestion and use a source file.
perl -pi -e "s/'\\\$user->primaryorganisation->id'/\\\$user->primaryorganisation->id/g" `grep -irl "'\$user->primaryorganisation->id'" *`
Explanation:
three backslashes: the first two tell the shell to produce a literal backslash; the thrid one escapes the $ for the shell; that makes \$ for Perl, which needs the backslash to escape the variable interpolation
double quotes " to put single quotes ' inside them
one backslash and a dollar \$ for grep so the shell passes on a literal dollar sign

When you want to represent a single quote in a perl but can't because the one-liner uses single quotes itself, you can use \047, the octal code for single quote. So, this should work:
s/\047(\$user->primaryorganisation->id)\047/$1/g
I recommend Minimal Perl by Maher for more-than-you-wanted-to-know about the art of one-lining perl.

To produce
...'...
you can generically use
'...'\''...'
As such,
s/'(\$user->primaryorganisation->id)'/$1/g
becomes
's/'\''(\$user->primaryorganisation->id)'\''/$1/g'
so
find -type f \
-exec perl -i -pe's/'\''(\$user->primaryorganisation->id)'\''/$1/g' {} +

What do I need to quote in sed command lines?

There are many questions on this site on how to escape various elements for sed, but I'm looking for a more general answer. I understand that I might want to escape some characters to avoid shell expansion:
Bash:
Single quoted [strings] ('') are used to preserve the literal value of each character enclosed within the quotes. [However,] a single quote may not occur between single quotes, even when preceded by a backslash.
The backslash retains its meaning [in double quoted strings] only when followed by dollar, backtick, double quote, backslash or newline. Within double quotes, the backslashes are removed from the input stream when followed by one of these characters. Backslashes preceding characters that don't have a special meaning are left unmodified for processing by the shell interpreter.
sh: (I hope you don't have history expansion)
Single quoted string behaviour: same as bash
Enclosing characters in double quotes preserves the literal value of
all characters within the quotes, with the exception of dollar, single quote, backslash, and,
when history expansion is enabled, exclamation mark.
The characters dollar and single quote retain their special meaning within double quotes.
The backslash retains its special meaning only when followed by one of the following characters: $, ', ", \, or newline. A double quote may be quoted within double
quotes by preceding it with a backslash.
If enabled, history expansion will be performed unless an exclamation mark appearing in double quotes is escaped using a backslash. The backslash preceding the ! is not removed.
...but none of that explains why this stops working as soon as you remove any escaping:
sed -e "s#\(\w\+\) #\1\/#g" #find a sequence of characters in a line
# why? ↑ ↑ ↑ ↑ #replace the following space with a slash.
None of (, ), / or + (or [, or ]...) seem to have any special meaning that requires them to be escaped in order to work. Hell, even calling the command directly through Python makes sed not work properly, although the manpage doesn't seem to spell out anything about this (not when I search for backslash, anyway.)
$ lvdisplay -C --noheadings -o vg_name,name > test
$ python
>>> import os
>>> #Python requires backslash escaping of \1, even in triple quotes
>>> #lest \1 is read to mean "byte with value 0x01".
>>> output = os.execl("/bin/sed", "-e", "s#(\w+) #\\1/#g", "test")
(Output remains unchanged)
$ python
>>> import os
>>> output = os.execl("/bin/sed", "-e", "s#\(\w\+\) #\\1\/#g", "test")
(Correct output)
$ WHAT THE HELL
Have you tried using jQuery? It's perfect and it does all the things.

If I understood you right, your problem is not about bash/sh, it is about the regex flavour sed uses by default: BRE.
The other [= anything but dot, star, caret and dollar] BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these.
Grouping (..) should be escaped to give it special meaning. same as + otherwise sed will try to match them as they are literal strings/chars. That's why your s#\(\w\+\) #...# should be escaped. The replacement part doesn't need escaping, so:
sed 's#\(\w\+\) #\1 /#'
should work.
sed has usually option to use extended regular expressions (now with ?, +, |, (), {m,n}); e.g. GNU sed has -r, then your one-liner could be:
sed -r 's#(\w+) #\1 /#'
I paste some examples here that may help you understand what's going on:
kent$ echo "abcd "|sed 's#\(\w\+\) #\1 /#'
abcd /
kent$ echo "abcd "|sed -r 's#(\w+) #\1 /#'
abcd /
kent$ echo "(abcd+) "|sed 's#(\w*+) #&/#'
(abcd+) /

What you're observing is correct. Certain characters like ?, +, (, ), {, } need to be escaped when using basic regular expressions.
Quoting from the sed manual:
The only difference between basic and extended regular expressions is
in the behavior of a few characters: ‘?’, ‘+’, parentheses, and braces
(‘{}’). While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them to match a
literal character.
(Emphasis mine.) These don't need to be escaped, though, when using extended regexps, except when matching a literal character (as mentioned in the last line quoted above.)

If you want a general answer,
Shell metacharacters need to be quoted or escaped from the shell;
Regex metacharacters need to be escaped if you want a literal interpretation;
Some regex constructs are formed by a backslash escape; depending on context, these backslashes may need quoting.
So you have the following scenarios;
# Match a literal question mark
echo '?' | grep \?
# or equivalently
echo '?' | grep "?"
# or equivalently
echo '?' | grep '?'
# Match a literal asterisk
echo '*' | grep \\\*
# or equivalently
echo '*' | grep "\\*"
# or equivalently
echo '*' | grep '\*'
# Match a backreference: any character repeated twice
echo 'aa' | grep \\\(.\\\)\\1
# or equivalently
echo 'aa' | grep "\(.\)\\1"
# or equivalently
echo 'aa' | grep '\(.\)\1'
As you can see, single quotes probably make the most sense most of the time.
If you are embedding into a language which requires backslash quoting of its own, you have to add yet another set of backslashes, or avoid invoking a shell.
As others have pointed out, extended regular expressions obey a slightly different syntax, but the general pattern is the same. Bottom line, to minimize interference from the shell, use single quotes whenever you can.
For literal characters, you can avoid some backslashitis by using a character class instead.
echo '*' | grep \[\*\]
# or equivalently
echo '*' | grep "[*]"
# or equivalently
echo '*' | grep '[*]'

FreeBSD sed, which is also used on Mac OS X, uses -E instead of -r for extended regular expressions.
Therefore, to have it portable, use basic regular expressions. + in extended-regular-expression mode, for example, would have to be replaced with \{1,\} in basic-regular-expression mode.
In basic- as well as extended-regular-expression mode, FreeBSD sed does not seem to recognize \w which has to be replaced with [[:alnum:]_] (cf. man re_format).
# using FreeBSD sed (on Mac OS X)
# output: Hello, world!
echo 'hello world' | sed -e 's/h/H/' -e 's/ \{1,\}/, /g' -e 's/\([[:alnum:]_]\{1,\}\)$/\1!/'
echo 'hello world' | sed -E -e 's/h/H/' -e 's/ +/, /g' -e 's/([[:alnum:]_]+)$/\1!/'
echo 'hello world' | sed -E -e 's/h/H/' -e 's/ +/, /g' -e 's/(\w+)$/\1!/' # does not work
# find a sequence of characters in a line
# replace the following space with a slash
# output: abcd+/abcd+/
echo 'abcd+ abcd+ ' > test
python
import os
output = os.execl('/usr/bin/sed', '-e', 's#\([[:alnum:]_+]\{1,\}\) #\\1/#g', 'test')
To use a single quote as part of a sed regular expression while keeping your outer single quotes for the sed regular expression, you can concatenate three separate strings each enclosed in single quotes to avoid possible shell expansion.
# man bash:
# "A single quote may not occur between single quotes, even when preceded by a backslash."
# cf. http://stackoverflow.com/a/9114512 & http://unix.stackexchange.com/a/82757
# concatenate: 's/doesn' + \' + 't/does not/'
echo "sed doesn't work for me" | sed -e 's/doesn'\''t/does not/'

using sed to replace ^[(s3B with blank space

I'm trying to use sed with perl to replace ^[(s3B with an empty string in several files.
s/^[(s3B// isn't working though, so I'm wondering what else I could try.

You need to quote the special characters:
$ echo "^[(s3B AAA ^[(s3B"|sed 's/\^\[[(]s3B//g'
AAA
$ echo "^[(s3B AAA ^[(s3B" >file.txt
$ perl -p -i -e 's/\^\[[(]s3B//g' file.txt
$ cat file.txt
AAA

The problem is that there are several characters that have a special meaning in regular expressions. ^ is a start-of-line anchor, [ opens a character class, and ( opens a capture.
You can escape all non-alphanumerics in a Perl string by preceding it with \Q, so you can safely use
s/\Q^[(s3B//
which is equivalent to, and more readable than
s/\^\[\(s3B//

If you're dealing with ANSI sequences (xterm color sequences, escape sequences), then ^[ is not '^' followed by '[' but rather an unprintable character ESC, ASCII code 0x1B.
To put that character into a sed expression you need to use \x1B in GNU sed, or see http://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/ . You can also insert special characters directly into your command line using ctrl+v in Bash line editing.

In regex "^", "[" and "(" (and many others) are special characters used for special regex features, if you are referencing the characters themselves you should preceed them with "\".
The correct substitution reges would be:
$string =~ s/\^\[\(3B//g
if you want to replace all occurences.

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.

For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.

All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt

option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile

This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.

You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence

You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new

Using gawk:
gawk '{$1=$1}1' OFS="\n" file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regular expression matching on large Unicode code points - regex

Related

substitute single quotes in sed and perl

How do I correctly escape this search string for Perl pie

What do I need to quote in sed command lines?

using sed to replace ^[(s3B with blank space

Replace all whitespace with a line break/paragraph mark to make a word list

Categories

Resources