Unicode characters stuffing regex quantifiers in perl 5 [duplicate]

Unicode characters stuffing regex quantifiers in perl 5 [duplicate] - regex

This question already has answers here:
Perl regular expression matching on large Unicode code points
(2 answers)
Closed 2 years ago.
I'm new to perl and having troubles with regex quantifiers on multibyte unicode characters (utf-8) with perl 5, I expect them to count only for one character but they count for as many bytes composing them.
For example, I expect .{1} to match é and .{2} to not match, but I see that :
$ echo 'begin é end' | perl -wnl -e '/begin .{1} end/s and print'
$ echo 'begin é end' | perl -wnl -e '/begin .{2} end/s and print'
begin é end
It is clearly due to "é" being a multibyte character because when I replace it by a simple "e" I get what I expect :
$ echo 'begin e end' | perl -wnl -e '/begin .{1} end/s and print'
begin e end
$ echo 'begin e end' | perl -wnl -e '/begin .{2} end/s and print'
Using some character set modifier (/d /u /a and /l) does not change anything.
When I use another PCRE regex tool it works :
regex101 : https://regex101.com/r/a1Lb9g/1/
php 7 (with u modifier to enable unicode support) :
$ echo 'begin é end' | php7 -r 'var_dump(preg_match("/begin .{1} end/su", file_get_contents("php://stdin")));'
Command line code:1:
int(1)
My TTY uses UTF-8 charset, "é" is encoded c3a9 :
$ echo 'begin é end' | xxd
00000000: 6265 6769 6e20 c3a9 2065 6e64 0a begin .. end.
$ echo 'begin é end' | base64
YmVnaW4gw6kgZW5kCg==
I have tested on several OS and perl versions and I see the same behavior everywhere :
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-msys-thread-multi-64int (Windows 7)
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-msys-thread-multi (Windows 10)
This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi (Ubuntu 16.04)
How to make perl regex quantifiers counting unicode characters for one ?

You need to tell Perl that the input is encoded in UTF-8. That's done by -CI. Add O to encode the output, too:
echo 'begin é end' | perl -CIO -wnl -e '/begin .{1} end/s and print'
begin é end

Related

Shell - Refactoring a string regex that join numbers

I am trying to refacto my script to make it readable and still usable on a single line.
My script do :
a regex on a string (GXXRXXCXX) that get all numbers matched into an array
a string to number for all string in the array (0X -> X)
a join on all numbers with a '.' delimiter
finally, it add a 'v' at the start of the string
The part i am strugguling the most to refacto is the array number (3 2 1) into a join (3.2.1) without using any tmp variable.
code :
GOROCO=G03R02C01
version=v$(tmp=( $(grep -Eo '[[:digit:]]+' <<< $GOROCO | bc) ); echo "${tmp[#]}" | sed 's/ /./g')
process :
G03R02C01
03 02 01
3 2 1
3.2.1
v3.2.1

Using a single sed you can do this:
GOROCO='G03R02C01'
version=$(sed -E 's/[^0-9]+0*/./g; s/^\./v/' <<< "$GOROCO")
# version=v3.2.1
Details:
-E: Enables extended regex mode in sed
s/[^0-9]+0*/./g: Replace all non-digits followed by 0 or more zero by a single dot
s/^\./v/: Replace first dot by a letter v
As an academic exercise here is a pure bash equivalent of doing same:
shopt -s extglob
version="${GOROCO//+([!0-9])*(0)/.}"
version="v${version#.}"

You're looking for paste
$ grep -Eo '[[:digit:]]+' <<< $GOROCO | bc | paste -s -d"."
3.2.1

how to use sed delete Unicode in some range?

I want to remove Unicode in some range, e.g.:
echo "abcＡＢＣ123" | sed 's/[\uff21-\uff3b]//g'
expect "abc123", but get:
sed: -e expression #1, char 20: Invalid range end
or use:
echo "abcＡＢＣ123" | sed 's/[Ａ-Ｚ]//g'
get:
sed: -e expression #1, char 14: Invalid collation character

Unicode support in sed is not well defined. You may be better off using command line perl:
echo "abcＡＢＣ123" | perl -CS -pe 's/[\x{FF21}-\x{FF3B}]+//g'
abc123
It is important to use -CS flags here to be able to get correct UTF8 encodings for input/output/error.

Not sure why sed is not working, but you can use tr instead
$ echo 'abcＡＢＣ123' | tr -d 'Ａ-Ｚ'
abc123
From man tr
tr - translate or delete characters
-d, --delete
delete characters in SET1, do not translate

Why \d\+ or \d+ is not equal to \d* here?

Bash +debian.
To match port number at the end of info.
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo $s | sed 's/\(.*\):\(\d*\)/\2/'
26215
Let's match it with \d+ or \d+ in sed.
echo $s | sed 's/\(.*\):\(\d\+\)/\2/'
echo $s | sed 's/\(.*\):\(\d+\)/\2/'
All of them get the whole string as output.
2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215
None of them can match port number at the end,why?

There is an easier sed pattern to use:
$ echo "$s" | sed -nE 's/.*:([^:])/\1/p'
26215
As stated in comments, regular sed does not have perl meta characters. You need to use the POSIX character class of [[:digit:]]
Explanation:
sed -nE 's/.*:([^:])/\1/p'
^ only print if there is a match
^ use ERE and you don't need to escape the parens
^ capture up to the rightmost :
^ ^ -E means you don't need to escape parens
^ all characters except :
^ print if there is a match
Or, if you want to be more specific you want only digits:
$ echo "$s" | sed -nE 's/.*:([[:digit:]]+$)/\1/p'
26215
Note + to make sure there is at least one digit and $ to match only at the end of the line.
There is a summary of different regex flavors HERE. With -E sed is using ERE the same as egrep.

\d is a PCRE extension not present in BRE or ERE syntax (as used by standard UNIX tools).
In this particular case, there's no need to use any tools not built into bash for this purpose at all:
s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
echo "Port is ${s##*:}"
This is a parameter expansion; when dealing with small amounts of data, such built-in capabilities are much more efficient than running external tools.
There's also native ERE support built into the shell, as follows:
re=':([[:digit:]]+)$'
[[ $s =~ $re ]] && echo "Port is ${BASH_REMATCH[1]}"
BashFAQ #100 also goes into detail on bash string manipulation.

All you need is this:
echo ${s##*:}
Learn your shell string operators.

s="2017-04-17 08:16:14 INFO connecting lh3.googleusercontent.com:443 from 111.111.111.111:26215"
1.grep
echo $s |grep -Po '\d+$'
2.ack
echo $s |ack -o '\d+$'
3.sed
echo $s |sed 's/.*\://'
4.awk
echo $s |awk -F: '{print $NF}'

Self-answer by OP moved from question to community wiki answer, per consensus on meta:
There is no expression \d to stand for numbers in sed.
To get with awk simply with :
echo $s |awk -F: '{print $NF}'
26215

Ignore all letters except for capitals

I have an output like Johny-Smith, Juarez-Hugo, etc. and I need instead S, H, etc. Basically, I need the last uppercase letter in a string and that's it. If this is possible in any built in Linux tools (ex awk, sed, grep, etc.) it would be greatly appreciated.

Do you need like this ?
echo "Johny-Smith" | sed 's/^.*\([A-Z]\)[^A-Z]*$/\1/g'
Test:
$ echo "Johny-Smith-Hello Johny-Smith" | sed 's/.*\([A-Z]\)[^A-Z]*/\1/g'
S

With GNU grep and if PCRE option is available
$ echo 'Johny-Smith' | grep -oP '.*\K[A-Z]'
S
$ echo 'Juarez-Hugo' | grep -oP '.*\K[A-Z]'
H
-o prints only matched portion
-P Perl regular expression
.*\K positive lookbehind, not part of output
[A-Z] any uppercase character
with perl, see perldoc for command line options explanation
$ # prints the string within captured group
$ echo 'Johny-Smith' | perl -lne 'print /.*([A-Z])/'
S
$ echo 'Juarez-Hugo' | perl -lne 'print /.*([A-Z])/'
H

In Bash:
$ var="Johny-Smith-Hello Johny-Smith"; var="${var//[^[:upper:]]/}";echo "${var: -1}"
S
${var//[^[:upper:]]/} remove all non-upper case letter chars
echo ${var: -1} output the last one

sed one-liner to convert all uppercase to lowercase?

I have a textfile in which some words are printed in ALL CAPS. I want to be able to just convert everything in the textfile to lowercase, using sed. That means that the first sentence would then read, 'i have a textfile in which some words are printed in all caps.'

With tr:
# Converts upper to lower case
$ tr '[:upper:]' '[:lower:]' < input.txt > output.txt
# Converts lower to upper case
$ tr '[:lower:]' '[:upper:]' < input.txt > output.txt
Or, sed on GNU (but not BSD or Mac as they don't support \L or \U):
# Converts upper to lower case
$ sed -e 's/\(.*\)/\L\1/' input.txt > output.txt
# Converts lower to upper case
$ sed -e 's/\(.*\)/\U\1/' input.txt > output.txt

If you have GNU extensions, you can use sed's \L (lower entire match, or until \L [lower] or \E [end - toggle casing off] is reached), like so:
sed 's/.*/\L&/' <input >output
Note: '&' means the full match pattern.
As a side note, GNU extensions include \U (upper), \u (upper next character of match), \l (lower next character of match). For example, if you wanted to camelcase a sentence:
$ sed -E 's/\w+/\u&/g' <<< "Now is the time for all good men..." # Camel Case
Now Is The Time For All Good Men...
Note: Since the assumption is we have GNU extensions, we can use sequences such as \w (match a word character) and the -E (extended regex) option, which relieves you of having to escape the one-or-more quantifier (+) and certain other special regex characters.

You also can do this very easily with awk, if you're willing to consider a different tool:
echo "UPPER" | awk '{print tolower($0)}'

Here are many solutions :
To upercaser with perl, tr, sed and awk
perl -ne 'print uc'
perl -npe '$_=uc'
perl -npe 'tr/[a-z]/[A-Z]/'
perl -npe 'tr/a-z/A-Z/'
tr '[a-z]' '[A-Z]'
sed y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
sed 's/\([a-z]\)/\U\1/g'
sed 's/.*/\U&/'
awk '{print toupper($0)}'
To lowercase with perl, tr, sed and awk
perl -ne 'print lc'
perl -npe '$_=lc'
perl -npe 'tr/[A-Z]/[a-z]/'
perl -npe 'tr/A-Z/a-z/'
tr '[A-Z]' '[a-z]'
sed y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
sed 's/\([A-Z]\)/\L\1/g'
sed 's/.*/\L&/'
awk '{print tolower($0)}'
Complicated bash to lowercase :
while read v;do v=${v//A/a};v=${v//B/b};v=${v//C/c};v=${v//D/d};v=${v//E/e};v=${v//F/f};v=${v//G/g};v=${v//H/h};v=${v//I/i};v=${v//J/j};v=${v//K/k};v=${v//L/l};v=${v//M/m};v=${v//N/n};v=${v//O/o};v=${v//P/p};v=${v//Q/q};v=${v//R/r};v=${v//S/s};v=${v//T/t};v=${v//U/u};v=${v//V/v};v=${v//W/w};v=${v//X/x};v=${v//Y/y};v=${v//Z/z};echo "$v";done
Complicated bash to uppercase :
while read v;do v=${v//a/A};v=${v//b/B};v=${v//c/C};v=${v//d/D};v=${v//e/E};v=${v//f/F};v=${v//g/G};v=${v//h/H};v=${v//i/I};v=${v//j/J};v=${v//k/K};v=${v//l/L};v=${v//m/M};v=${v//n/N};v=${v//o/O};v=${v//p/P};v=${v//q/Q};v=${v//r/R};v=${v//s/S};v=${v//t/T};v=${v//u/U};v=${v//v/V};v=${v//w/W};v=${v//x/X};v=${v//y/Y};v=${v//z/Z};echo "$v";done
Simple bash to lowercase :
while read v;do echo "${v,,}"; done
Simple bash to uppercase :
while read v;do echo "${v^^}"; done
Note that ${v,} and ${v^} only change the first letter.
You should use it that way :
(while read v;do echo "${v,,}"; done) < input_file.txt > output_file.txt

I like some of the answers here, but there is a sed command that should do the trick on any platform:
sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/'
Anyway, it's easy to understand. And knowing about the y command can come in handy sometimes.

If you have GNU sed (likely on Linux, but not on *BSD or macOS):
echo "Hello MY name is SUJIT " | sed 's/./\L&/g'
Output:
hello my name is sujit

If you are using posix sed
Selection for any case for a pattern (converting the searched pattern with this sed than use the converted pattern in you wanted command using regex:
echo "${MyOrgPattern} | sed "s/[aA]/[aA]/g;s/[bB]/[bB]/g;s/[cC]/[cC]/g;s/[dD]/[dD]/g;s/[eE]/[eE]/g;s/[fF]/[fF]/g;s/[gG]/[gG]/g;s/[hH]/[hH]/g;s/[iI]/[iI]/g;s/[jJ]/[jJ]/g;s/[kK]/[kK]/g;s/[lL]/[lL]/g;s/[mM]/[mM]/g;s/[nN]/[nN]/g;s/[oO]/[oO]/g;s/[pP]/[pP]/g;s/[qQ]/[qQ]/g;s/[rR]/[rR]/g;s/[sS]/[sS]/g;s/[tT]/[tT]/g;s/[uU]/[uU]/g;s/[vV]/[vV]/g;s/[wW]/[wW]/g;s/[xX]/[xX]/g;s/[yY]/[yY]/g;s/[zZ]/[zZ]/g" | read -c MyNewPattern
YourInputStreamCommand | egrep "${MyNewPattern}"
convert in lower case
sed "s/[aA]/a/g;s/[bB]/b/g;s/[cC]/c/g;s/[dD]/d/g;s/[eE]/e/g;s/[fF]/f/g;s/[gG]/g/g;s/[hH]/h/g;s/[iI]/i/g;s/j/[jJ]/g;s/[kK]/k/g;s/[lL]/l/g;s/[mM]/m/g;s/[nN]/n/g;s/[oO]/o/g;s/[pP]/p/g;s/[qQ]/q/g;s/[rR]/r/g;s/[sS]/s/g;s/[tT]/t/g;s/[uU]/u/g;s/[vV]/v/g;s/[wW]/w/g;s/[xX]/x/g;s/[yY]/y/g;s/[zZ]/z/g"
same for uppercase replace lower letter between // by upper equivalent in the sed
Have fun

short, sweet and you don't even need redirection :-)
perl -p -i -e 'tr/A-Z/a-z/' file

Instead of typing this long expression:
sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' input
One could use this:
sed 'y/'$(printf "%s" {A..Z} "/" {a..z} )'/' input

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unicode characters stuffing regex quantifiers in perl 5 [duplicate] - regex

You need to tell Perl that the input is encoded in UTF-8. That's done by -CI. Add O to encode the output, too: echo 'begin é end' | perl -CIO -wnl -e '/begin .{1} end/s and print' begin é end

Related

Shell - Refactoring a string regex that join numbers

how to use sed delete Unicode in some range?

Why \d\+ or \d+ is not equal to \d* here?

Ignore all letters except for capitals

sed one-liner to convert all uppercase to lowercase?

Categories

Resources