perl regrex that captures substring between tic marks - regex

I am trying to find a solution in perl that captures the filename in the following string -- between the tic marks.
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])‘/;
print $results if $results;
I need to end up with wapenc?T=mavodi-7-13b-2b-3-96-1e3431a

The final tick seems to be different in your regex than in the input string - char 8217 (RIGHT SINGLE QUOTATION MARK U+2019) versus 8216 (LEFT SINGLE QUOTATION MARK U+2018). Also, when using Unicode characters in the source, be sure to include
use utf8;
and save the file UTF-8 encoded.
After fixing these two issues, the code worked for me:
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])’/;
print $results if $results;

Your tic characters aren't in the 7-bit ASCII character set, so there is a whole character-encoding rabbit hole to go down here. But the quick and dirty solution is to capture everything in between extended characters.
($result) = $str =~ /[^\0-\x7f]+(.*?)[^\0-\x7f]/;
[^\0-\x7f] matches characters with character values not between 0 and 127, i.e., anything that is not a 7-bit ASCII character including new lines, tabs, and other control sequences. This regular expression will work whether your input is UTF-8 encoded or has already been decoded, and may work for other character encodings, too.

Related

perl substitute string characters using a hash and tr

I need an efficient way to replace all chars from a string with another char based on a hash map
Currently I am using regex s/// and that is working fine. Can I use tr instead , because I just need character by character conversion.
This is what I am trying:
my %map = ( a => 9 , b => 4 , c => 8 );
my $str = 'abc';
my $str2 = $str;
$str2 =~ s/(.)/$map{$1}/g; # $str2 =~ tr /(.)/$map{$1}/ Does not work
print " $str => $str2\n";
If you need to replace exacly 1 character by 1 character, tr is ideal for you:
#!/usr/bin/perl
use strict;
use warnings;
my $str = 'abcd';
my $str2 = $str;
$str2 =~ tr /abc/948/;
print " $str => $str2\n";
It didn't delete "d", which will happen with the code from your question. Output:
abcd => 948d
No, one cannot do that with tr. That tool is very different from regex.
Its entry in Quote Like Operators in perlop says
Transliterates all occurrences of the characters found (or not found if the /c modifier is specified) in the search list with the positionally corresponding character in the replacement list [...]
and further down it adds
Characters may be literals, or (if the delimiters aren't single quotes) any of the escape sequences accepted in double-quoted strings. But there is never any variable interpolation, so "$" and "#" are always treated as literals. [...]
So one surely can't have a hash evaluated, nor match on a regex pattern in the first place.
The lack of even basic variable interpolation is explained at the very end
... the transliteration table is built at compile time, ...
We are then told of using eval, with an example, if we must use variables with tr.
In this case you'd need to first build variables, one a sequence of characters to replace (keys) and the other a sequence of their replacement characters (values), and then use them in tr via eval like in the docs. Better yet, you'd build a sub with it as in ikegami's comment. Here is a related page.
But this is the opposite of finding an approach simpler than that basic regex, the question's point.

Strip string but allow umlauts

#!/usr/bin/perl -T
use strict;
use warnings;
use utf8;
my $s = shift || die;
$s =~ s/[^A-Za-z ]//g;
print "$s\n";
exit;
> ./poc.pl "El Guapö"
El Guap
Is there a way to modify this Perl code so that various umlauts and character accents are not stripped out? Thanks!
For the direct question, you may simply need \p{L} (Letter) Unicode Character Property
However, more importantly, decode all input and encode output.
use warnings;
use strict;
use feature 'say';
use utf8; # allow non-ascii (UTF-8) characters in the source
use open ':std', ':encoding(UTF-8)'; # for standard streams
use Encode qw(decode_utf8); # #ARGV escapes the above
my $string = 'El Guapö';
if (#ARGV) {
$string = join ' ', map { decode_utf8($_) } #ARGV;
}
say "Input: $string";
$string =~ s/[^\p{L} ]//g;
say "Processed: $string";
When run as   script.pl 123 El Guapö=_
Input: 123 El Guapö=_
Processed: El Guapö
I've used the "blanket" \p{L} property (Letter), as specific description is lacking; adjust if/as needed. The Unicode properties provide a lot, see the link above and the complete list at perluniprops.
The space between 123 El remains, perhaps strip leading (and trailing) spaces in the end.
Note that there is also \P{L}, where the capital P indicates negation.
The above simple-minded \pL won't work with Combining Diacritical Marks, as the mark will be removed as well. Thanks to jm666 for pointing this out.
This happens when an accented "logical" character (extended grapheme cluster, what appears as a single character) is written using separate characters for its base and for non-spacing mark(s) (combining accents). Often a single character for it with its codepoint also exists.
Example: in niño the ñ is U+OOF1 but it can also be written as "n\x{303}".
To keep accents written this way add \p{Mn} (\p{NonspacingMark}) to the character class
my $string = "El Guapö=_ ni\N{U+00F1}o.* nin\x{303}o+^";
say $string;
(my $nodiac = $string) =~ s/[^\pL ]//g; #/ naive, accent chars get removed
say $nodiac;
(my $full = $string) =~ s/[^\pL\p{Mn} ]//g; # add non-spacing mark
say $full;
Output
El Guapö=_ niño.* niño+^
El Guapö niño nino
El Guapö niño niño
So you want s/[^\p{L}\p{Mn} ]//g in order to keep the combining accents.

Using string variables containing literal escapes in a Perl substitution

I'm new to Perl and I found behaviour which I don't understand and can't solve.
I'm making a small find and replace program and there are some things I need to do. I have bunch of files that I need to process. Then I have a list of find / replace rules in an external text file. In replacing there I need three special things:
Replacing utf-8 characters (Czech diacritics)
Work with adding/removing lines (so working in a slurp mode)
Use a regular expressions
I want a program that works alone, so I wrote it so that it takes three arguments:
The file to work on
What to find
What to replace.
I'm sending parameters in a loop from a bash script which parse the rules list and loads other files.
My problem is when I have a "\n" string in a rules list and I send it to the Perl script. If it's in the first part of replacement (in the find section) it looks for a newline correctly, but when it's in the second part (the replace section) it just prints \n instead of a newline.
I tried hardcoding "\n" to the string right into the variable instead of passing it from the list and then it works fine.
What's the reason Perl doesn't interpret the "\n" string there, and how can I make it work?
This is my code:
list.txt - One line from the external replacement list
1\. ?\\n?NÁZEV PŘÍPRAVKU;\\n<<K1>> NÁZEV PŘÍPRAVKU;
farkapitoly.sh - The bash script for parsing list.txt and cycling through all of the files and calling the Perl script
...
FILE="/home/tmp.txt"
while read LINE
do
FIND=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $1}'`
REPLACE=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $2}'`
perl -CA ./pathtiny.pl "$FILE" "$FIND" "$REPLACE"
done < list.txt
...
pathtiny.pl - The Perl script for find and replace
#!/usr/bin/perl
use strict;
use warnings;
use Modern::Perl;
use utf8; # Enable typing Unicode in Perl strings
use open qw(:std :utf8); # Enable Unicode to STDIN/OUT/ERR and filehandles
use Path::Tiny;
my $file = path("$ARGV[0]");
my $searchStr = "$ARGV[1]";
my $replaceStr = "$ARGV[2]";
# $replaceStr="\n<<K1>> NÁZEV PRÍPRAVKU"; # if I hardcode it here \n is replaced right away
print("Search String:", "$searchStr", "\n");
print("Replace String:", "$replaceStr", "\n\n");
my $guts = $file->slurp_utf8;
$guts =~ s/$searchStr/$replaceStr/gi;
$file->spew_utf8($guts);
If it's important, I'm using Linux Mint 13 64-bit on VirtualBox (under Win 8.1) and I have Perl v5.14.2. Every file is UTF-8 with Linux endings.
Example files can be found on pastebin. this should end up like this.
But examples varies a lot. I need a universal solution to write down newline in a replacement string so it replaces correctly.
The problem is that the replacement string is read literally from the file, so if your file contains
xx\ny
then you will read exactly those six characters. Also, the replacement part of a substitution is evaluated as if it was in double quotes. So your replacement string is "$replaceStr" which interpolates the variable and goes no further, so you will again have xx\nyy in the new string. (By the way, please avoid using capital letters in local Perl identifiers as in practice they are reserved for globals such as Module::Names.)
The answer lies in using eval, or its equivalent - the /e modifier on the substitution.
If I write
my $str = '<b>';
my $r = 'xx\ny';
$str =~ s/b/$r/;
then the replacement string is interpolated to xx\ny, as you have experienced.
A single /e modifier evaluates the replacement as an expression instead of just a double-quoted string, but of course $r as an expression is xx\ny again.
What you need is a second /e modifier, which does the same evaluation as a single /e and then does an additional eval of the result on top. For this it is cleanest if you use qq{ .. } as you need two levels of quotation.
If you write
$str =~ s/b/qq{"$r"}/ee
then perl will evaluate qq{"$r"} as an expression, giving "xx\nyy", which, when evaluated again will give you the string you need - the same as the expression 'xx' . "\n" . 'yy'.
Here's a full program
use strict;
use warnings;
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/qq{"$r"}/ee;
print $s;
output
<xx
yy>
But don't forget that, if your replacement string contains any double quotes, like this
my $r = 'xx\n"yy"'
then they must be escaped before putting the through the substitution as the expression itself also uses double quotes.
All of this is quite hard to grasp, so you may prefer the String::Escape module which has an unbackslash function that will change a literal \n (and any other escapes) within a string to its equivalent character "\n". It's not a core module so you probably will need to install it.
The advantage is that you no longer need a double evaluation, as the replacement string can be just unbackslash $r which give the right result if it evaluated as an expression. It also handles double quotes in $r without any problem, as the expression doesn't use double quotes itself.
The code using String::Escape goes like this
use strict;
use warnings;
use String::Escape 'unbackslash';
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/unbackslash $r/e;
print $s;
and the output is identical to that of the previous code.
Update
Here is a refactoring of your original program that uses String::Escape. I have removed Path::Tiny as I believe it is best to use Perl's built-in inplace-edit extension, which is documented under the General Variables section of perlvar.
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use 5.010;
use open qw/ :std :utf8 /;
use String::Escape qw/ unbackslash /;
our #ARGV;
my ($file, $search, $replace) = #ARGV;
print "Search String: $search\n";
print "Replace String: $replace\n\n";
#ARGV = ($file);
$^I = '';
while (<>) {
s/$search/unbackslash $replace/eg;
print;
}
You got \n as a content of a string. (as two chacters 1: \ and second n, and not as one newline.
Perl interprets the \n as newline when it is as literal (e.g. it is in your code).
The quick-fix would be:
my $replaceStr=eval qq("$ARGV[2]"); #evaling a string causes interpreting the \n as literal
or, if you don't like eval, you can use the String-Escape cpan module. (the unbackslash function)
You're wanting a literal string to be treated as if it were a double quoted string. To do that you'll have to translate any backslash followed by another character.
The other experts have shown you how to do that over the entire string (which is risky since it uses eval with unvalidated data). Alternatively, you could use a module, String::Escape, which requires an install (not a high bar, but too high for some).
However, the following does a translation of the return value string itself in a safe way, and then it can be used like a normal value in your other search and replace:
use strict;
use warnings;
my $r = 'xx\nyy';
$r =~ s/(\\.)/qq{"$1"}/eeg; # Translate \. as a double quoted string would
print $r;
Outputs:
xx
yy

Convert multiple Unicode in a string to character

Problem -- I have a string, say Buna$002C_TexasBuna$002C_Texas' and where $ is followed by Unicode. I want to replace these Unicode with its respective Unicode character representation.
In Perl if any Unicode is in the form of "\x{002C} then it will be converted to it respective Unicode character. Below is the sample code.
#!/usr/bin/perl
my $string = "Hello \x{263A}!\n";
#arr= split //,$string;
print "#arr";
I am processing a file which contain 10 million of records. So I have these strings in a scalar variable. To do the same as above I am substituting $4_digit_unicode to \x{4_digit_unicode} as below.
$str = 'Buna$002C_TexasBuna$002C_Texas';
$str =~s/\$(.{4})/\\x\{$1\}/g;
$str = "$str"
It gives me
Buna\x{002C}_TexasBuna\x{002C}_Texas
It is because at $str = "$str", line $str is being interpolated, but not its value. So \x{002C} is not being interpolated by Perl.
Is there a way to force Perl so that it will also interpolate the contents of $str too?
OR
Is there another method to achieve this? I do not want to take out each of the Unicodes then pack it using pack "U4",0x002C and then substitute it back. But something in one line (like the below unsuccessful attempt) is OK.
$str =~ s/\$(.{4})/pack("U4",$1)/g;
I know the above is wrong; but can I do something like above?
For the input string $str = 'Buna$002C_TexasBuna$002C_Texas', the desired output is Buna,_TexasBuna,_Texas.
This gives the desired result:
use strict;
use warnings;
use feature 'say';
my $str = 'Buna$002C_TexasBuna$002C_Texas';
$str =~s/\$(.{4})/chr(hex($1))/eg;
say $str;
The main interesting item is the e in s///eg. The e means to treat the replacement text as code to be executed. The hex() converts a string of hexadecimal characters to a number. The chr() converts a number to a character. The replace line might be better written as below to avoid trying to convert a dollar followed by non-hexadecimal characters.
$str =~s/\$([0-9a-f]{4})/chr(hex($1))/egi;
You can execute statements such as pack in the replacement string, you just have to use the e regular expression modifier.
Or you can do this
$str =~s/\$(.{4})/"#{[pack("U4",$1)]}/g;
If those two options don't work please let me know, take a look at this Stackoverflow question for more information.
"\x{263A}" (quotes included) is a string literal, a piece of code that produces a string containing the lone character 263A when it's evaluated by the interpreter (by being part of the script passed to perl to be evaluated).
"\\x\{$1\}" (quotes included), on the other hand, produces a string consisting of \, x, {, the contents of $1, and }.
The latter is the string you are producing. You appear to be attempting to produce Perl code, but it's not valid Perl code -- it's missing the quotes -- and you never have the code interpreted by perl.
$str =~ s/\$(.{4})/\\x\{$1\}/g;
is short for
$str =~ s/\$(.{4})/ "\\x\{$1\}" /eg;
which is completely different than
$str =~ s/\$(.{4})/ "\x{263A}" /eg;
It looks like you were going for the following:
$str =~ s/\$(.{4})/ eval qq{"\\x\{$1\}"} /eg;
But there are much simpler ways of producing the desired string, such as
$str =~ s/\$(.{4})/ pack "U4", $1 /eg;
or better yet,
$str =~ s/\$(.{4})/ chr hex $1 /eg;

How do I omit lines that contain Unicode NULL (U+0000)?

I am reading a file and am wondering how to skip lines that have Unicode NULL, U+0000? I have tried everything below, but none works:
if($line)
chomp($line)
$line =~ s/\s*$//g;
Your list of "everything" does not seem to include the obvious $line =~ m/\000/.
Because you asked about Unicode NULL (identical to ASCII NUL when encoded in UTF-8), let’s use the \N{U+...} form, described in the perlunicode documentation.
Unicode characters can also be added to a string by using the \N{U+...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces, after the U. For instance, a smiley face is \N{U+263A}.
You can also match against \N{U+...} in regexes. See below.
#! /usr/bin/env perl
use strict;
use warnings;
my $contents =
"line 1\n" .
"\N{U+0000}\n" .
"foo\N{U+0000}bar\n" .
"baz\N{U+0000}\n" .
"\N{U+0000}quux\n" .
"last\n";
open my $fh, "<", \$contents or die "$0: open: $!";
while (defined(my $line = <$fh>)) {
next if $line =~ /\N{U+0000}/;
print $line;
}
Output:
$ ./filter-nulls
line 1
last
Perl strings can contain arbitrary data, including NUL characters. Your if only checks for true or false (where "" and "0" are the two false strings, everything else being true including a string containing a single NUL "\x00"). Your chomp only removes the line separator, not NULs. A NUL character is not whitespace, so doesn't match \s.
You can explicitly match a NUL character by specifying it in a regex using octal or hex notation ("\000" or "\x00", respectively).