Regexp doesn't work for specific special characters in Perl - regex

I can't get rid of the special character ¤ and ❤ in a string:
$word = 'cɞi¤r$c❤u¨s';
$word =~ s/[^a-zöäåA-ZÖÄÅ]//g;
printf "$word\n";
On the second line I try to remove any non alphabetic characters from the string $word. I would expect to get the word circus printed out but instead I get:
ci�rc�us
The öäå and ÖÄÅ in the expression are just normal characters in the Swedish alphabet that I need included.

If the characters are in your source code, be sure to use utf8. If they are being read from a file, binmode $FILEHANDLE, ':utf8'.
Be sure to read perldoc perlunicode.

Short answer: add use utf8; to make sure your literal string in the source code are interepreted as utf8, that includes the content of the test string, and the content of the regexp.
Long answer:
#!/usr/bin/env perl
use warnings;
use Encode;
my $word = 'cɞi¤r$c❤u¨s';
foreach my $char (split //, $word) {
print ord($char) . Encode::encode_utf8(":$char ");
}
my $allowed_chars = 'a-zöäåA-ZÖÄÅ';
print "\n";
foreach my $char (split //, $allowed_chars) {
print ord($char) . Encode::encode_utf8(":$char ");
}
print "\n";
$word =~ s/[^$allowed_chars]//g;
printf Encode::encode_utf8("$word\n");
Executing it without utf8:
$ perl utf8_regexp.pl
99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s
97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133:
ci¤rc¤us
Executing it with utf8:
$ perl -Mutf8 utf8_regexp.pl
99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s
97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å
circus
Explanation:
The non-ascii characters that you are typing into your source code are represented by one than more byte. Since your input is utf8 encoded. In a pure ascii or latin-1 terminal the characters would've been one byte.
When not using utf8 module, perl thinks that each and every byte you are inputting is a separate character, like you can see when doing the splitting and printing each and every individual character. When using the utf8 module, it treats the combination of several bytes as one character correctly according to the rules of utf8 encoding.
As you can see by coinscidence, some of the bytes that the swedish characters are made up of match with some of the bytes that some of the characters in your test string are made up of, and they are kept. Namely: the ö which in utf8 consists of 195:Ã 164:¤ - The 164 ends up as one of the characters you allow and it passes thru.
The solution is to tell perl that your strings are supposed to be considered as utf-8.
The encode_utf8 calls are in place to avoid warnings about wide characters being printed to the terminal. As always you need to decode input, and encode output according to the character encoding that input or output is supposed to handle/operate in.
Hope this made it clearer.

As pointed out by choroba, adding this in the beginning of the perl script solves it:
use utf8;
binmode(STDOUT, ":utf8");
where use utf8 lets you use the special characters correctly in the regular expression and binmode(STDOUT, ":utf8") lets you output the special characters correctly on the shell.

Related

How to replace all unicode characters except for Spanish ones?

I am trying to remove all Unicode characters from a file except for the Spanish characters.
Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ are not replaced using the following regex (but all other Unicode appears to be replaced):
perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename
But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:
perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename does not replace the following (some are not printable):
³ � �
­
Am I missing something obvious here? I am also open to other ways of doing this on the terminal.
You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.
You should add -Mutf8 to let Perl recognize the UTF8-encoded characters used directly in your Perl code.
Also, you need to pass -CSD (equivalent to -CIOED) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.
perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename
Do not forget about Ü and ü.

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

bash substitutions don't appear to work when matching newline characters

I have an executable file called test.script containing this simple bash script:
#!/bin/bash
temp=${1//$'\n'/}
output=${temp//$'\r'/}
printf "$output" > output.txt
When I run
sudo ./test.script "^\r\r\n\n\r\n\r\n\r\n\r\n\n\r\n\rHello World\n\n\r\r\r\n\n\r\r\r$"
in the same directory as test.script, I expect to end up with an output.txt looking like this:
^Hello World$
but when I take a look I instead see this:
^
Hello World
$
Clearly I have a misunderstanding about regex in bash.
Please explain to me what I am missing, then show me how to write the bash so that all newline characters are removed from the string before said string is written to a file. Thanks in advance.
You can "fix" your script like this (although I must say this isn't typical usage of printf):
#!/bin/bash
temp=${1//'\n'/}
output=${temp//'\r'/}
printf "$output"
The argument to your script $1 doesn't contain real newlines or carriage returns, which is what $'\n' and $'\r' are for. Instead, it looks like you just want to remove the literal strings '\n' and '\r'.
To elaborate on my point about printf, normally two (or more) arguments are passed: the format specifier and the variables that are to be inserted. For example, to print a single string you would use something like printf '%s' "$output". In your script, the variable $output is being treated as the format specifer; you're relying on printf to expand your \n and \r into newlines and carriage returns.
You're not actually using regular expressions here by the way; the syntax ${var//match/replace} is a substring replacement, where // means that all occurrences of the substring match in $var are replaced. As you haven't specified anything to replace the substring with, the substring is replaced with nothing (i.e. removed).

Perl, substitute hexa text by the corresponding chr-ed character

I have a file which contains long lines made of text mixed with encoded character.
%255D%252C%2522actualPage%2522%253A1%252C%2522rowPerPage%2522%253A50%257D%255D
Each encoder character is %25xx where xx is the hexa value of the ascii char (ex. %2540 = #)
I tried the folowing but w/o success
perl -pe 's/%25([0-9A-F](0-9A-F])/\x$1/' myfile.txt
perl -pe 's/%25([0-9A-F](0-9A-F])/chr($1)/' myfile.txt
Do you have any clue for me ?
TIA, Peyre
Perhaps what you want is URI::Encode. It is a better idea to use a module for this than a regex.
perl -MURI::Encode -nle'$u=URI::Encode->new(); print $u->decode($u->decode($_));'
Output is this for your input string:
],"actualPage":1,"rowPerPage":50}]
As you'll notice, the string had to be decoded twice, because it had been encoded twice (%25 is apparently the percent sign %). The interim output was
%5D%2C%22actualPage%22%3A1%2C%22rowPerPage%22%3A50%7D%5D
perl -MURI::Escape -ne'print uri_unescape(uri_unescape($_))'
perl -pe 's/%25([0-9A-F][0-9A-F])/chr hex $1/ge' myfile.txt
output
],"actualPage":1,"rowPerPage":50}]

Using sed to convert text file to a C string

I would like to use sed to replace newlines, tabs, quotes and backslashes in a text file to use it as char constant in C, but I'm lost at the start. It would be nice to maintain the newlines also in the output, adding a '\n', then a double quote to close the text line, a crlf, and another double quote to reopen the line, for example:
line1
line2
would become
"line1\n"
"line2\n"
Can anybody at least point me in the right direction?
Thanks
Try this as a sed command file:
s/\\/\\\\/g
s/"/\\"/g
s/ /\\t/g
s/^/"/
s/$/\\n"/
NB: there's an embedded tab in the third line, if using vi insert by pressing ^v <tab>
s/\\/\\\\/g - escape back slashes
s/"/\\"/g - escape quotes
s/ /\\t/g - convert tabs
s/^/"/ - prepend quote
s/$/\\n"/ - append \n and quote
Better still:
'"'`printf %s "$your_text" | sed -n -e 'H;$!b' -e 'x;s/\\/\\\\/g;s/"/\\"/g;s/ /\\t/g;s/^\n//;s/\n/\\n" \n "/g;p'`'"'
Unlike the one above, this handles newlines correctly as well. I'd just about call it a one-liner.
In Perl, file input from stdin, out to stdout. Hexifies, so no worries about escaping stuff. Doesn't remove tabs, etc. Output is a static C string.
use strict;
use warnings;
my #c = <STDIN>;
print "static char* c_str = {\n";
foreach (#c) {
my #s = split('');
print qq(\t");
printf("\\x%02x", ord($_)) foreach #s;
print qq("\n);
}
print "};\n";