How to replace all unicode characters except for Spanish ones? - regex

I am trying to remove all Unicode characters from a file except for the Spanish characters.
Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ are not replaced using the following regex (but all other Unicode appears to be replaced):
perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename
But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:
perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename does not replace the following (some are not printable):
³ � �
­
Am I missing something obvious here? I am also open to other ways of doing this on the terminal.

You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.
You should add -Mutf8 to let Perl recognize the UTF8-encoded characters used directly in your Perl code.
Also, you need to pass -CSD (equivalent to -CIOED) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.
perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename
Do not forget about Ü and ü.

Related

sed -e : is it possible to match everything between two quotation marks?

Sed, is it possible to match everything between two chars?
In a script that I have to use there is a bug.
The script has to replace the value of
#define MAPPING,
The line containing the bug is the one below:
sed -i -e "s/#define MAPPING \"\"/#define MAPPING \"$string\"/1" file.hpp
Since in file.hpp MAPPING is defined as:
#define MAPPING ""
the script works, but if I try to call the script again and MAPPING was already redefined, now sed won't match #define MAPPING "" and thus not override anything.
I'm not a sed expert, and with a quick search couldn't find the way to let it match
#define MAPPING "<everything>".
Is it possible to achieve this?
This is does you want:
sed -Ei 's/(#define MAPPING ")[^"]*(")/\1'"$string\2/" file.hpp
[^"]* means zero or more non double quote characters.
I used back references instead of repeating the same text, it's up to you.
1 at the end of your example means replace the first occurence. However this is the default, so it can be removed.
Be aware: if $string contains sequences like &, \5, or \\, they won't be passed literally, and can even cause an error. Also, C escapes like \t for tab are expanded by many sed implementations (so you'll end up with a literal tab in the file, instead of \t).
For what it's worth, this sed does the same thing, but is more accomodating of varied whitespace:
sed -Ei 's/(^[[:space:]]*#[[:space:]]*define[[:space:]]+MAPPING[[:space:]]+")[^"]*(")/\1'"$string\2/" file.hpp
You can also try:
sed -i -e "s/#define MAPPING \".*\"/#define MAPPING \"$string\"/1" file.hpp
The dot means anything can go here and the star means at least 0 times so .* accepts any sequence of characters, including an empty string.

Find and replace curly quotes inside a character class

I'm getting strange results when I try to find and replace curly quotes inside a character class, with another character:
sed -E "s/[‘’]/'/g" in.txt > out.txt
in.txt: ‘foo’
out.txt: '''foo'''
If you use a as a replacement, you'll get aaafooaaa. But this is only an issue when the curly quotes are inside a character class. This works:
sed -E "s/(‘|’)/'/g" in.txt > out.txt
in.txt: ‘foo’
out.txt: 'foo'
Can anyone explain what's going on here? Can I still use a character class for curly quotes?
Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your sed implementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):
$ LC_ALL=C sed -E "s/[‘’]/'/g" <<<'‘foo’' # C locale, single-byte chars
'''foo'''
But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:
$ LC_ALL=en_US.UTF-8 sed -E "s/[‘’]/'/g" <<<'‘foo’' # UTF-8 locale, multibyte chars
'foo'
The way you're running it, sed doesn't see [‘‘] as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.
The version that uses alternation works because each alternate can be more than one character; even though sed is still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.
So make sure your locale is set properly for your text encoding and see if that resolves your issue.

How to negate two specific word in regex?

I have a file containing words, like these.
Good ones words:
művész-ként
luisz-ként
gravid-ként
chips-ként
bizottság-kent
Pannon-ként
Nagyostobafalva-kent
Words to remove:
font-size
line-height
X-Faktor
Calais-nál
What I need, is to remove the words containing a hyphen and the word after the hyphen is not 'ként' or 'kent'. The file also contains other words unhyphenated, that I have to keep (like "keresztül", "kod".....).
This could, but also eliminates the words that do not contain hyphen.
grep -vE "\w+-(kent|ként) " file.txt
Perl's look-around assertions might simplify the solution:
perl -Mutf8 -CS -ne 'print unless /-(?!k[eé]nt)/' < file
-Mutf8 turns on UTF-8 in the source (i.e. makes the é work in the regex)
-CS turns UTF-8 on for the input and output
The regex says: dash not followed by kent or ként
Using grep, you can do:
grep -E '^(\w+-k[eé]nt|[^-]*)$' file
RegEx Demo
This will find hyphenated words ending with kent or ként or words with no hyphen.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens

Regexp doesn't work for specific special characters in Perl

I can't get rid of the special character ¤ and ❤ in a string:
$word = 'cɞi¤r$c❤u¨s';
$word =~ s/[^a-zöäåA-ZÖÄÅ]//g;
printf "$word\n";
On the second line I try to remove any non alphabetic characters from the string $word. I would expect to get the word circus printed out but instead I get:
ci�rc�us
The öäå and ÖÄÅ in the expression are just normal characters in the Swedish alphabet that I need included.
If the characters are in your source code, be sure to use utf8. If they are being read from a file, binmode $FILEHANDLE, ':utf8'.
Be sure to read perldoc perlunicode.
Short answer: add use utf8; to make sure your literal string in the source code are interepreted as utf8, that includes the content of the test string, and the content of the regexp.
Long answer:
#!/usr/bin/env perl
use warnings;
use Encode;
my $word = 'cɞi¤r$c❤u¨s';
foreach my $char (split //, $word) {
print ord($char) . Encode::encode_utf8(":$char ");
}
my $allowed_chars = 'a-zöäåA-ZÖÄÅ';
print "\n";
foreach my $char (split //, $allowed_chars) {
print ord($char) . Encode::encode_utf8(":$char ");
}
print "\n";
$word =~ s/[^$allowed_chars]//g;
printf Encode::encode_utf8("$word\n");
Executing it without utf8:
$ perl utf8_regexp.pl
99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s
97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133:
ci¤rc¤us
Executing it with utf8:
$ perl -Mutf8 utf8_regexp.pl
99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s
97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å
circus
Explanation:
The non-ascii characters that you are typing into your source code are represented by one than more byte. Since your input is utf8 encoded. In a pure ascii or latin-1 terminal the characters would've been one byte.
When not using utf8 module, perl thinks that each and every byte you are inputting is a separate character, like you can see when doing the splitting and printing each and every individual character. When using the utf8 module, it treats the combination of several bytes as one character correctly according to the rules of utf8 encoding.
As you can see by coinscidence, some of the bytes that the swedish characters are made up of match with some of the bytes that some of the characters in your test string are made up of, and they are kept. Namely: the ö which in utf8 consists of 195:Ã 164:¤ - The 164 ends up as one of the characters you allow and it passes thru.
The solution is to tell perl that your strings are supposed to be considered as utf-8.
The encode_utf8 calls are in place to avoid warnings about wide characters being printed to the terminal. As always you need to decode input, and encode output according to the character encoding that input or output is supposed to handle/operate in.
Hope this made it clearer.
As pointed out by choroba, adding this in the beginning of the perl script solves it:
use utf8;
binmode(STDOUT, ":utf8");
where use utf8 lets you use the special characters correctly in the regular expression and binmode(STDOUT, ":utf8") lets you output the special characters correctly on the shell.