Remove lines that contain non-english (Ascii) characters from a file

Remove lines that contain non-english (Ascii) characters from a file - regex

I have a text file with characters from different languages like (chinese, latin etc)
I want to remove all lines that contain these non-English characters. I want to include all English characters (a-b), numbers (0-9) and all punctuations.
How can I do it using unix tools like awk or sed.

Perl supports an [:ascii:] character class.
perl -nle 'print if m{^[[:ascii:]]+$}' inputfile

You can use Awk, provided you force the use of the C locale:
LC_CTYPE=C awk '! /[^[:alnum:][:space:][:punct:]]/' my_file
The environment variable LC_TYPE=C (or LC_ALL=C) force the use of the C locale for character classification. It changes the meaning of the character classes ([:alnum:], [:space:], etc.) to match only ASCII characters.
The /[^[:alnum:][:space:][:punct:]]/ regex match lines with any non ASCII character. The ! before the regex invert the condition. So only lines without any non ASCII characters will match. Then as no action is given, the default action is used for matching lines (print).
EDIT: This can also be done with grep:
LC_CTYPE=C grep -v '[^[:alnum:][:space:][:punct:]]' my_file

With GNU grep, which supports perl compatible regular expressions, you can use:
grep -P '^[[:ascii:]]+$' file

You can use egrep -v to return only lines not matching the pattern and use something like [^ a-zA-Z0-9.,;:-'"?!] as pattern (include more punctuation as needed).
Hm, thinking about it, a double negation (-v and the inverted character class) is probably not that good. Another way might be ^[ a-zA-Z0-9.,;:-'"?!]*$.
You can also just filter for ASCII:
egrep -v "[^ -~]" foo.txt

Related

How to match only items preceded by a-z, A-Z, space, or the start of a line when searching with grep?

I need to display all lines in file.txt containing the character "鱼", but only those where "鱼" is immediately preceded by a-z, A-Z, a space, or a line break.
I tried using grep, like this:
grep "[a-zA-Z\s\n]鱼" file.txt
The regular expression [a-zA-Z\s\n] does not appear to work. How can I search for this character, when appearing after a-z, A-Z, a space, or a line break?

If you want to match a space with grep, use a space:
grep "[a-zA-Z ]鱼" file.txt
If you want to match any whitespace, you can use the Posix standard character class:
grep "[a-zA-Z[:space:]]鱼" file.txt
("Any whitespace" is space, newline, carriage return, form feed, tab and vertical tab. If you just want to match space and tab, you can use [:blank:].)
You might also want to use a standard class for letters. Unless you are in the Posix or "C" locale, the meanings of character ranges like A-Z are unpredictable.
grep "[[:alpha:][:space:]]鱼" file.txt
grep works line by line, so it will never see a newline. But using an "extended" pattern, you can also match at the beginning of the line:
egrep "(^|[[:alpha:][:space:]])鱼" file.txt
(You can use grep -E instead of egrep if you prefer. But you need one or the other for the above regular expression to work.)

Grep does not support this by default
$ man grep | grep '\\s'
But awk does
$ man awk | grep '\\s'
\s Matches any whitespace character.
So perhaps use
awk '/[a-zA-Z\s\n]鱼/' file.txt

Use awk:
awk '/[A-Za-z \t]鱼/ || (NR > 1 && /^鱼/)' file
Which would print line if 鱼 is after [A-Za-z \t] or if it's not on the first line and it's in the beginning of the line: NR > 1 && /^鱼/.
If you just really want that it's on the beginning or is followed by [A-Za-z \t], you can simply do this:
awk '/(^|[A-Za-z \t])鱼/' file
Or
grep -E '/(^|[A-Za-z \t])鱼/' file

Try this one:
^[a-zA-Z \n]{1,}鱼
{1,} will make u assure that 鱼 got at least 1 of these element before
what is more i suggest to use awk in this particular case

Escaping given list of characters in bash

Was hoping someone could help me out a one-liner in bash, using anything standard like sed, awk, etc. to take a string and insert a backslash before any characters that are matched to a given list of characters.
For example, the input string abj:"si8'h4# should become abj\:\"si8\'h4\# if the list of characters I want to escape is "':#
Any help is appreciated!

Here's a simple perl version.
echo 'abj:"si8'\''h4#' | perl -pe 's/(["'\'':#])/\\\1/g'
abj\:\"si8\'h4\#
The problem is how you want to input the list of characters to escape. If you're inputting them yourself, this solution will work fine, though you have to be careful with escaping single quotes. If you want to avoid escaping issues, perhaps you'd want to read the list of characters to escape from a file?

sed
echo 'abj:"si8'"'h4#" | sed 's/["\x27:#]/\\&/g'
abj\:\"si8\'h4\#

The set of characters that you want to escape as a variable:
$ esc="\"':#"
The string that you want to escape:
$ str=abj:\"si8\'h4\#
Escaping with sed:
$ echo "$str" | sed 's/['"$esc"']/\\&/g'
Output:
abj\:\"si8\'h4\# <--- Output
abj\:\"si8\'h4\# <--- Correct string for comparison
Yup, that seems to do it.

using sed to replace ^[(s3B with blank space

I'm trying to use sed with perl to replace ^[(s3B with an empty string in several files.
s/^[(s3B// isn't working though, so I'm wondering what else I could try.

You need to quote the special characters:
$ echo "^[(s3B AAA ^[(s3B"|sed 's/\^\[[(]s3B//g'
AAA
$ echo "^[(s3B AAA ^[(s3B" >file.txt
$ perl -p -i -e 's/\^\[[(]s3B//g' file.txt
$ cat file.txt
AAA

The problem is that there are several characters that have a special meaning in regular expressions. ^ is a start-of-line anchor, [ opens a character class, and ( opens a capture.
You can escape all non-alphanumerics in a Perl string by preceding it with \Q, so you can safely use
s/\Q^[(s3B//
which is equivalent to, and more readable than
s/\^\[\(s3B//

If you're dealing with ANSI sequences (xterm color sequences, escape sequences), then ^[ is not '^' followed by '[' but rather an unprintable character ESC, ASCII code 0x1B.
To put that character into a sed expression you need to use \x1B in GNU sed, or see http://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/ . You can also insert special characters directly into your command line using ctrl+v in Bash line editing.

In regex "^", "[" and "(" (and many others) are special characters used for special regex features, if you are referencing the characters themselves you should preceed them with "\".
The correct substitution reges would be:
$string =~ s/\^\[\(3B//g
if you want to replace all occurences.

Perl regular expression matching on large Unicode code points

I am trying to replace various characters with either a single quote or double quote.
Here is my test file:
# Replace all with double quotes
＂ fullwidth
“ left
” right
„ low
" normal
# Replace all with single quotes
' normal
‘ left
’ right
‚ low
‛ reverse
` backtick
I'm trying to do this...
perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt
perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt
But only the backtick character gets replaced properly. I think it has something to do with the other code points being too large, but I cannot find any documentation on this.
Here I have a one-liner which dumps the Unicode code points, to verify they match my regular expression.
$ awk -F\ '{print $1}' test.txt | \
perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}'
U+FF02 ＂
U+201C “
U+201D ”
U+201E „
U+0022 "
U+0027 '
U+2018 ‘
U+2019 ’
U+201A ‚
U+201B ‛
U+0060 `
Why isn't my regular expression matching?

It isn’t matching because you forgot the -CSAD in your call to Perl, and don’t have $PERL_UNICODE set in your environment. You have only said -Mutf8 to announce that your source code is in that encoding. This does not affect your I/O.
You need:
$ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt
I do mention this sort of thing in this answer a couple of times.

With use utf8;, you told Perl your source code is UTF-8. This is useless (though harmless) since you've limited your source code to ASCII.
With /u, you told Perl to use the Unicode definitions of \s, \d, \w. This is useless (though harmless) since you don't use any of those patterns.
You did not decode your input, so your inputs consists solely of bytes, so most of the characters in your class (e.g. \x{2018}) can't possibly match anything. You need to decode your input (and of course, encode your output). Using -CSD will likely do this.
perl -CSD -i -pe'
s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g;
s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g;
' text.txt

Pattern matching digits does not work in egrep?

Why can't I match the string
"1234567-1234567890"
with the given regular expression
\d{7}-\d{10}
with egrep from the shell like this:
egrep \d{7}-\d{10} file
?

egrep doesn't recognize \d shorthand for digit character class, so you need to use e.g. [0-9].
Moreover, while it's not absolutely necessary in this case, it's good habit to quote the regex to prevent misinterpretation by the shell. Thus, something like this should work:
egrep '[0-9]{7}-[0-9]{10}' file
See also
egrep mini tutorial
References
regular-expressions.info/Flavor comparison
Flavor note for GNU grep, ed, sed, egrep, awk, emacs
Lists the differences between grep vs egrep vs other regex flavors

For completeness:
Egrep does in fact have support for character classes. The classes are:
[:alnum:]
[:alpha:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
Example (note the double brackets):
egrep '[[:digit:]]{7}-[[:digit:]]{10}' file

you can use \d if you pass grep the "perl regex" option, ex:
grep -P "\d{9}"

Use [0-9] instead of \d. egrep doesn't know \d.

try this one:
egrep '(\d{7}-\d{10})' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove lines that contain non-english (Ascii) characters from a file - regex

I have a text file with characters from different languages like (chinese, latin etc) I want to remove all lines that contain these non-English characters. I want to include all English characters (a-b), numbers (0-9) and all punctuations. How can I do it using unix tools like awk or sed.

Perl supports an [:ascii:] character class. perl -nle 'print if m{^[[:ascii:]]+$}' inputfile

With GNU grep, which supports perl compatible regular expressions, you can use: grep -P '^[[:ascii:]]+$' file

Related

How to match only items preceded by a-z, A-Z, space, or the start of a line when searching with grep?

Escaping given list of characters in bash

using sed to replace ^[(s3B with blank space

Perl regular expression matching on large Unicode code points

Pattern matching digits does not work in egrep?

Categories

Resources