How did [a-z] match é?

How did [a-z] match é? - regex

Wow, this actually matched an é. What happened here? I would like it to not matching anything other than typically lower case letters.
$ echo "frappé"|egrep -E "^[a-z]+$"
frappé
egrep (GNU grep) 2.16 on Ubuntu 14.04

Your locale setting tells egrep/grep -E how to collate the [a-z] character range.
$ export LC_COLLATE=C
$ echo "frappé" | egrep '^[a-z]+$'
# no match
$ export LC_COLLATE=en_US.utf8
$ echo "frappé" | egrep '^[a-z]+$'
frappé
Named character classes can be used to match characters with diacritics in spite of the locale:
$ export LC_COLLATE=C
$ echo "frappé" | egrep '^[[:lower:]]+$'
frappé

Related

Why doesn't this sed expression remove lines with Korean as expected?

I combined these two answers to produce this sed command:
sed '/[\u3131-\uD79D]/d' text.txt # Remove all lines with Korean characters
However it outputs only the lines with Korean characters:
$ cat text.txt
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the
$ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
$ sed '/Hello/d' text.txt # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
$ sed '/[0-9]/d' text.txt # Simple range works
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the
$ sed --version # Git Bash for Windows 2.33.0.windows.2
sed (GNU sed) 4.8
Is this a bug with sed? I was able to use the equivalent command in gVim successfully:
:g/[\u3131-\uD79D]/d

It has to do with the collation order of the expression in the bracket due to sed following POSIX. You need a collation order that sort by numeric Unicode point, C.UTF-8, and then, you need to encode your range characters in utf8. There is an explanation of the details here.
This is how you apply it to your range on a bash shell (I used linux to test it):
$ # first get octal representation of range unicode code points
$ # iconv is to convert to utf-8 in case your locale is not utf-8
$ printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1
343 204 261 355 236 235
$ # format it as a sed range
$ printf '\o%s\o%s\o%s-\o%s\o%s\o%s' $(printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1); echo
\o343\o204\o261-\o355\o236\o235
$ # use the range in sed
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
...
$
Here is the output:
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
1
00:00:00,000 --> 00:00:05,410
Hello, today we're going to explain how to use the
$ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails
$ sed '/Hello/d' text.txt # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
$ sed '/[0-9]/d' text.txt # Simple range works
Hello, today we're going to explain how to use the
$
EDIT: helper scrip/functions
This bash script or its functions can be used to obtain a sed unicode range:
#!/bin/bash
# sur - sed unicode range
#
# Converts a unicode range into an octal utf-8 range suitable for sed
#
# Usage:
# sur \\u452 \\u490
#
# sur \\u3131 \\uD79D
to_octal() {
printf "$1" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\\o\1/g'
}
sur () {
echo "$(to_octal $1)-$(to_octal $2)"
}
sur $1 $2
To use the script, make sure it is executable and in your PATH. Here is an example on how to use the functions. I just copied and pasted them into a bash shell:
$ to_octal() {
> printf "$1" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\\o\1/g'
> }
$
$ sur () {
> echo "$(to_octal $1)-$(to_octal $2)"
> }
$
$ sur \\u3131 \\uD79D
\o343\o204\o261-\o355\o236\o235
$ sur \\u452 \\u490
\o321\o222-\o322\o220
$

grep for lines containing "standard" US characters only

I'm trying to figure out how to grep for lines that are made up of A-Z and a-z exclusively, that is, the "American" alphabet of letters. I would expect this to work, but it does not:
$ echo -e "Jutland\nJastrząb" | grep -x '[A-Za-z]*'
Jutland
Jastrząb
I want this to only print "Jutland", because ą is not a letter in the American alphabet. How can I achieve this?

You need to add LC_ALL=C before grep:
printf '%b\n' "Jutland\nJastrząb" | LC_ALL=C grep -x '[A-Za-z]*'
Jutland
You may also use -i switch to ignore case and reduce regex:
printf '%b\n' "Jutland\nJastrząb" | LC_ALL=C grep -ix '[a-z]*'
LC_ALL=C avoids locale-dependent effects otherwise your current LOCALE treats ą as [a-zA-Z].

You can use perl regex:
$ echo -e "Jutland\nJastrząb" | grep -P '^[[:ascii:]]+$'
Jutland
It's experimental though:
-P, --perl-regexp
Interpret the pattern as a Perl-compatible regular expression (PCRE). This is experimental and
grep -P may warn of unimplemented features.
EDIT
For letters only, use [A-Za-z]:
$ echo -e "L'Egyptienne\nJutland\nJastrząb" | grep -P '^[A-Za-z]+$'
Jutland

Ignore all letters except for capitals

I have an output like Johny-Smith, Juarez-Hugo, etc. and I need instead S, H, etc. Basically, I need the last uppercase letter in a string and that's it. If this is possible in any built in Linux tools (ex awk, sed, grep, etc.) it would be greatly appreciated.

Do you need like this ?
echo "Johny-Smith" | sed 's/^.*\([A-Z]\)[^A-Z]*$/\1/g'
Test:
$ echo "Johny-Smith-Hello Johny-Smith" | sed 's/.*\([A-Z]\)[^A-Z]*/\1/g'
S

With GNU grep and if PCRE option is available
$ echo 'Johny-Smith' | grep -oP '.*\K[A-Z]'
S
$ echo 'Juarez-Hugo' | grep -oP '.*\K[A-Z]'
H
-o prints only matched portion
-P Perl regular expression
.*\K positive lookbehind, not part of output
[A-Z] any uppercase character
with perl, see perldoc for command line options explanation
$ # prints the string within captured group
$ echo 'Johny-Smith' | perl -lne 'print /.*([A-Z])/'
S
$ echo 'Juarez-Hugo' | perl -lne 'print /.*([A-Z])/'
H

In Bash:
$ var="Johny-Smith-Hello Johny-Smith"; var="${var//[^[:upper:]]/}";echo "${var: -1}"
S
${var//[^[:upper:]]/} remove all non-upper case letter chars
echo ${var: -1} output the last one

Sed replace asterisk symbols

I'm am trying to replace a series of asterix symbols in a text file with a -999.9 using sed. However I can't figure out how to properly escape the wildcard symbol.
e.g.
$ echo "2006.0,1.0,************,-5.0" | sed 's/************/-999.9/g'
sed: 1: "s/************/-999.9/g": RE error: repetition-operator operand invalid
Doesn't work. And
$ echo "2006.0,1.0,************,-5.0" | sed 's/[************]/-999.9/g'
2006.0,1.0,-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9,-5.0
puts a -999.9 for every * which isn't what I intended either.
Thanks!

Use this:
echo "2006.0,1.0,************,-5.0" | sed 's/[*]\+/-999.9/g'
Test:
$ echo "2006.0,1.0,************,-5.0" | sed 's/[*]\+/-999.9/g'
2006.0,1.0,-999.9,-5.0

Any of these (and more) is a regexp that will modify that line as you want:
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\**/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\+/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*+/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\{12\}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*{12}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\{1,\}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*{1,}/999.9/g'
2006.0,1.0,999.9,-5.0
sed operates on regular expressions, not strings, so you need to learn regular expression syntax if you're going to use sed and in particular the difference between BREs (which sed uses by default) and EREs (which some seds can be told to use instead) and PCREs (which sed never uses but some other tools and "regexp checkers" do). Only the first solution above is a BRE that will work on all seds on all platforms. Google is your friend.

* is a regex symbol that needs to be escaped.
You can even use BASH string replacement:
s="2006.0,1.0,************,-5.0"
echo "${s/\**,/-999.9,}"
2006.0,1.0,-999.9,-5.0
Using sed:
sed 's/\*\+/999.9/g' <<< "$s"
2006.0,1.0,999.9,-5.0

Ya, * are special meta character which repeats the previous token zero or more times. Escape * in-order to match literal * characters.
sed 's/\*\*\*\*\*\*\*\*\*\*\*\*/-999.9/g'

When this possibility was introduced into gawk I have no idea!
gawk -F, '{sub(/************/,"-999.9",$3)}1' OFS=, file
2006.0,1.0,-999.9,-5.0

Why [^\d\w\s,] matches "leonardo,davinci"?

I can't understand why the regexp:
[^\d\s\w,]
Matches the string:
"leonardo,davinci"
That is my test:
$ echo "leonardo,davinci" | egrep '[^\d\w\s,]'
leonardo,davinci
While this works as expected:
$ echo "leonardo,davinci" | egrep '[\S\W\D]'
$
Thanks very much

It's because egrep doesn't have the predefined sets \d, \w, \s. Therefore, putting slash in front of them is just matching them literally:
leonardo,davinci
echo "leonardo,davinci" | egrep '[^a-zA-Z0-9 ,]'
Will indeed, not match.
If you have it installed, you can use pcregrep instead:
echo "leonardo,davinci" | pcregrep '[^\w\s,]'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How did [a-z] match é? - regex

Wow, this actually matched an é. What happened here? I would like it to not matching anything other than typically lower case letters. $ echo "frappé"|egrep -E "^[a-z]+$" frappé egrep (GNU grep) 2.16 on Ubuntu 14.04

Related

Why doesn't this sed expression remove lines with Korean as expected?

grep for lines containing "standard" US characters only

Ignore all letters except for capitals

Sed replace asterisk symbols

Why [^\d\w\s,] matches "leonardo,davinci"?

Categories

Resources