Custom Character Class in Perl 5.018

Custom Character Class in Perl 5.018 - regex

I have a parsing environment (Marpa::R2::Scanless) that needs to use single Perl regexp character classes to control tokenizing. I've got something to tokenize that doesn't seem to fit any of the existing character classes. So, after digging around in the perlunicode docs, I've come up with the following code, except it doesn't work as expected. I expect to see a row of dots interspersed with all the non-alphanumerics (except parens). Instead, I get a runtime error about not being able to find the character class.
#!/usr/bin/env perl
use 5.018;
use utf8;
local $| = 1;
for my $i (map { chr($_) } 32 .. 127) {
if ($i =~ /\p{Magic::Wow}/) {
print $i;
}
else {
print ".";
}
}
package Magic;
sub Wow {
return <<'MAGIC';
+utf8::Assigned
-utf8::Letter
-utf8::Number
-0028
-0029
MAGIC
}
1;
Any hints, tips, tricks, or suggestions?

Name the sub IsWow and the property Magic::IsWow.
Quoting User-Defined Character Properties in perlunicode:
You can define your own binary character properties by defining subroutines whose names begin with "In" or "Is".

Related

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}

Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.

Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

How can I use regex with sed (or equivalent unix command line tool) to fix title case in LaTeX headings?

regular expression attempt
(\\section\{|\\subsection\{|\\subsubsection\{|\\paragraph[^{]*\{)(\w)\w*([ |\}]*)
search text
\section{intro to installation of apps}
\subsection{another heading for \myformatting{special}}
\subsubsection{good morning, San Francisco}
\paragraph{installation of backend services}
desired output
All initial characters are capitalized except prepositions, conjunctions, and the usual parts of speech that are made upper case on titles.
I supposed I should really narrow this down, so let me borrow from the U.S. Government Printing Office Style Manual:
The articles a, an, and the; the prepositions at, by, for, in, of, on, to, and up; the conjunctions and, as, but, if, or, and nor; and the second element of a compound numeral are not capitalized.
Page 41
\subsection{Installation guide for the server-side app \myapp{webgen}}
changes to
\subsection{Installation Guide for the Server-side App \myapp{Webgen}}
OR
\subsection{Installation Guide for the Server-side App \myapp{webgen}}
How would you name this type of string modification?
Applying REGEX to a string between strings?
Applying REGEX to a part of a string when that part falls between two other strings of characters?
Applying REGEX to a substring that occurs between two
other substrings within a string?
<something else>
problem
I match each latex heading command, including the {. This means that my expresion does not match more than the first word in the actually heading text. I cannot surround the whole heading code with an "OR space" because then I will find nearly every word in the document. Also, I have to be careful of formatting commands within the headings themselves.
other helpful related questions
Uppercasing First Letter of Words Using SED
https://superuser.com/questions/749164/how-to-use-regex-to-capitalise-the-first-letter-of-each-word-in-a-sentence
Using Sed to capitalize the first letter of each word
Capitalize first letter of each word in a selection using vim

So it seems to me as if you need to implement pseudo-code like this:
Are we on the first word? If yes, capitalize it and move on.
Is the current word "reserved"? If yes, lower it and move on.
Is the current word a numeral? If yes, lower it and move on.
Are we still in the list? If yes, print the line verbatim and move on.
One other helpful rule might be to leave fully upper-case words as they are, just in case they're acronyms.
The following awk script might do what you need.
#!/usr/bin/awk -f
function toformal(subject) {
return toupper(substr(subject,1,1)) tolower(substr(subject,2))
}
BEGIN {
# Reserved word list gets split into an array for easy matching.
reserved="at by for in of on to up and as but if or nor";
split(reserved,a_reserved," "); for(i in a_reserved) r[a_reserved[i]]=1;
# Same with the list of compound numerals. If this isn't what you mean, say so.
numerals="hundred thousand million billion";
split(numerals,a_numerals," "); for(i in a_numerals) n[a_numerals[i]]=1;
}
# This awk condition matches the lines we're interested in modifying.
/^\\(section|subsection|subsubsection|paragraph)[{]/ {
# Separate the particular section and the text, then split text to an array.
section=$0; sub(/\\/,"",section); sub(/[{].*/,"",section);
text=$0; sub(/^[^{]*[{]/,"",text); sub(/[}].*/,"",text);
size=split(text,atext,/[[:space:]]/);
# First word...
newtext=toformal(atext[1]);
for(i=2; i<=size; i++) {
# Reserved word...
if (r[tolower(atext[i])]) { newtext=newtext " " atext[i]; continue; }
# Compound numerals...
if (n[tolower(atext[i])]) { newtext=newtext " " tolower(atext[i]); continue; }
# # Acronyms maybe...
# if (atext[i] == toupper(atext[i])) { newtext=newtext " " atext[i]; continue; }
# Everything else...
newtext=newtext " " toformal(atext[i]);
}
print newtext;
next;
}
# Print the line if we get this far. This is a non-condition with
# a print-only statement.
1

Here is an example of how you could do it in Perl using the module Lingua::EN::Titlecase and recursive regular expressions :
use strict;
use warnings;
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
my $data = do {local $/; <> };
my ($kw_regex) = map { qr/$_/ }
join '|', qw(section subsection subsubsection paragraph);
$data =~ s/(\\(?: $kw_regex))(\{(?:[^{}]++|(?2))*\})/title_case($tc,$1,$2)/gex;
print $data;
sub title_case {
my ($tc, $p1, $p2) = #_;
$p2 =~ s/^\{//;
$p2 =~ s/\}$//;
if ($p2 =~ /\\/ ) {
while ($p2 =~ /\G(.*?)(\\.*?)(\{(?:[^{}]++|(?3))*\})/ ) {
my $next_pos = $+[0];
substr($p2, $-[1], $+[1] -$-[1], $tc->title($1));
substr($p2, $-[3], $+[3] -$-[3], title_case($tc,'',$3));
pos($p2) = $next_pos;
}
$p2 =~ s/\G(.+)$/$tc->title($1)/e;
}
else {
$p2 = $tc->title($p2);
}
return $p1 . '{' . $p2 . '}';
}

evaluate pattern stored in variable perl regexp

I am trying to find out if basket has apple [simplified version of a big problem]
$check_fruit = "\$fruit =~ \/has\/apple\/";
$fruit="basket/has/mango/";
if ($check_fruit) {
print "apple found\n";
}
check_fruit variable is holding the statement of evaluating the regexp.
However it check_fruit variable always becomes true and shows apple found :(
Can somebody help me here If I am missing something.
Goal to accomplish:
Okay so let me explain:
I have a file with a pattern clause defined on eachline similar to:
Line1: $fruit_origin=~/europe\\/finland/ && $fruit_taste=~/sweet/
Line2: similar stuff that can contain ~10 pattern checks seprated by && or || with metacharacters too
2.I have another a list of fruit attributes from a perl hash containing many such fruits
3 I want to categorize each fruit to see how many fruits fall into category defined by each line of the file seprately.
Sort of fruit count /profile per line Is there an easier way to accomplish this ? Thanks a lot

if ($check_fruit) returns true because $check_fruit is defined, not empty and not zero. If you want to evaluate its content, use eval. But a subroutine would serve better:
sub check_fruit {
my $fruit = shift;
return $fruit =~ m(has/apple);
}
if (check_fruit($fruit)) {
print "Apple found\n";
}

Why is there a need to store the statement in a variable? If you're sure the value isn't set by a user, then you can do
if (eval $check_fruit) {
but this isn't safe if the user can set anything in that expression.

Put the pattern (and only the pattern) into the variable, use the variable inside the regular expression matching delimiters m/.../. If you don't know the pattern in advance then use quotemeta for escaping any meta characters.
It should look like this:
my $check_fruit = '/has/apple/'; # here no quotemeta is needed
my $fruit = 'basket/has/mango/';
if ($fruit =~ m/$check_fruit/) {
# do stuff!
}

$check_fruit is nothing but a variable holding string data. If you want to execute the code it contains, you have to use eval.
There were also some other errors in your code related to string quoting/escaping. This fixes that as well:
use strict;
use warnings;
my $check_fruit = '$apple =~ m|/has/mango|';
my $apple="basket/has/mango/";
if (eval $check_fruit) {
print "apple found\n";
}
However, this is not usually a good design. At the very least, it makes for confusing code. It is also a huge security hole if $check_fruit is coming from the user. You can put a regex into a variable, which is preferable:
Edit: note that a regex that comes from user input can be a security problem as well, but it is more limited in scope.
my $check_fruit = qr|/has/mango|;
my $apple="basket/has/mango/";
if ($apple =~ /$check_fruit/) {
print "apple found\n";
}
There are other things you can do to make your Perl code more dynamic, as well. The best approach would depend on what you are trying to accomplish.

What's the best way to clear regex matching variables?

What's the best way to clear/reset all regex matching variables?
Example how $1 isn't reset between regex operations and uses the most recent match:
$_="this is the man that made the new year rumble";
/ (is) /;
/ (isnt) /;
say $1; # outputs "is"
Example how this may be problematic when working with loops:
foreach (...){
/($some_value)/;
&doSomething($1) if $1;
}
Update: I didn't think I'd need to do this, but Example-2 is only an example. This question is about resetting matching variables, not the best way to implement them.
Regardless, originally my coding style was more inline with being explicit and using if-blocks. After coming back to this (Example2) now, it is much more concise in reading many lines of code, I'd find this syntax faster to comprehend.

You should use the return from the match, not the state of the group vars.
foreach (...) {
doSomething($1) if /($some_value)/;
}
$1, etc. are only guaranteed to reflect the most recent match if the match succeeds. You shouldn't be looking at them other than right after a successful match.

Regex captures* are reset by a successful match. To reset regex captures, one would use a trivial match operation that's guaranteed to match.
"a" =~ /a/; # Reset captures to undef.
Yeah, it looks weird, but you asked to do some thing weird.
If you fix your code, you don't need weird-looking workarounds. Fixing your code even reveals a bug!
Fixes:
$_ = "this is the man that made the new year rumble";
if (/ (is) / || / (isnt) /) {
say $1;
} else{
... # You're currently printing something random.
}
and
for (...) {
if (/($some_pattern)/) {
do_something($1);
}
}
* — Backrefs are regex patterns that match previously captured text. e.g. \1, \k<foo>. You're actually talking about "regex capture buffers".

You should test whether the match succeeded. For example:
foreach (...){
/($some_value)/ or next;
doSomething($1) if $1;
}
foreach (...){
doSomething($1) if /($some_value)/ and $1;
}
foreach (...){
if (/($some_value)/) {
doSomething($1) if $1;
}
}
Depending on what $some_value is, and how you want to handle matching the empty string and/or 0, you may or may not need to test $1 at all.

To complement the existing, helpful answers (and the sensible recommendation to normally test the result of a matching operation in a Boolean context and take action only if the test succeeds notwithstanding):
Depending on your scenario, you can approach the problem differently:
Disclaimer: I'm not an experienced Perl programmer; do let me know if there are problems with this approach.
Enclose the matching operation in a do { ... } block scopes all regex-related special variables ($&, $1, ...) to that block.
Thus, you can use a do { ... } to prevent these special variables from getting set in the first place (although the ones from a previous regex operation outside the block will obviously remain in effect); for instance:
$_="this is the man that made the new year rumble";
# Match in current scope; -> $&, $1, ... *are* set.
/ (is) /;
# Match inside a `do` block; the *new* $&, $1, ... values
# are set only *inside* the block;
# `&& $1` passes out the block's version of `$1`.
$do1 = do { / (made) / && $1 };
print "\$1 == '$1'; \$do1 == '$do1'\n"; # -> $1 == 'is'; $do1 == 'made'
The advantage of this approach is that none of the current scope's special regex variables are set or altered; the accepted answer, by contrast, alters variables such as $&, and $'.
The disadvantage is that you must explicitly pass out variables of interest; you do get the result of the matching operation by default, however, and if you're only interested in the contents of capture buffers, that will suffice.

You shoud do it this way:
foreach (...) {
someFnc($1) if /.../;
}
But if you want to stick with your style, then check this as an idea:
$_ = "this is the man that made the new year rumble";
$m = /(is)/ ? $1 : undef;
$m = /(isnt)/ ? $1 : undef;
print $m, "\n" if defined $m;

Assigning captures to a list behave closer to what it sounds like you want.
for ("match", "fail") {
my ($fake_1) = /(m.+)/;
doSomething($fake_1) if $fake_1;
}

Using perl to split a line that may contain whitespace

Okay, so I'm using perl to read in a file that contains some general configuration data. This data is organized into headers based on what they mean. An example follows:
[vars]
# This is how we define a variable!
$var = 10;
$str = "Hello thar!";
# This section contains flags which can be used to modify module behavior
# All modules read this file and if they understand any of the flags, use them
[flags]
Verbose = true; # Notice the errant whitespace!
[path]
WinPath = default; # Keyword which loads the standard PATH as defined by the operating system. Append with additonal values.
LinuxPath = default;
Goal: Using the first line as an example "$var = 10;", I'd like to use the split function in perl to create an array that contains the characters "$var" and "10" as elements. Using another line as an example:
Verbose = true;
# Should become [Verbose, true] aka no whitespace is present
This is needed because I will be outputting these values to a new file (which a different piece of C++ code will read) to instantiate dictionary objects. Just to give you a little taste of what it might look like (just making it up as I go along):
define new dictionary
name: [flags]
# Start defining keys => values
new key name: Verbose
new value val: 10
# End dictionary
Oh, and here is the code I currently have along with what it is doing (incorrectly):
sub makeref($)
{
my #line = (split (/=/)); # Produces ["Verbose", " true"];
}
To answer one question, why I am not using Config::Simple, is that I originally did not know what my configuration file would look like, only what I wanted it to do. Making it up as I went along - at least what seemed sensible to me - and using perl to parse the file.
The problem is I have some C++ code that will load the information in the config file, but since parsing in C or C++ is :( I decided to use perl. It's also a good learning exercise for me since I am new to the language. So that's the thing, this perl code is not really apart of my application, it just makes it easier for the C++ code to read the information. And, it is more readable (both the config file, and the generated file). Thanks for the feedback, it really helped.

If you're doing this parsing as a learning exercise, that's fine. However, CPAN has several modules that will do a lot of the work for you.
use Config::Simple;
Config::Simple->import_from( 'some_config_file.txt', \my %conf );

split splits on a regular expression, so you can simply put the whitespace around the = sign into its regex:
split (/\s*=\s*/, $line);
You obviously do not want to remove all whitespace, or such a line would be produced (whitespace missing in the string):
$str="Hellothere!";
I guess that only removing whitespace from the beginning and end of the line is sufficient:
$line =~ s/^\s*(.*?)\s*$/$1/;
A simpler alternative with two statements:
$line =~ s/^\s+//;
$line =~ s/\s+$//;

Seems like you've got it. Strip the whitespaces before splitting.
sub makeref($)
{
s/\s+//g;
my #line = (split(/=/)); # gets ["verbose", "true"]
}

This code does the trick (and is more efficient without reversing).
for (#line) {
s/^\s+//;
s/\s+$//;
}

You probably have it all figured out, but I thought I'd add a little. If you
sub makeref($)
{
my #line = (split(/=/));
foreach (#line)
{
s/^\s+//g;
s/\s+$//g;
}
}
then you will remove the whitespace before and after both the left and right side. That way something like:
this is a parameter = all sorts of stuff here
will not have crazy spaces.
!!Warning: I probably don't know what I'm talking about!!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Custom Character Class in Perl 5.018 - regex

Name the sub IsWow and the property Magic::IsWow. Quoting User-Defined Character Properties in perlunicode: You can define your own binary character properties by defining subroutines whose names begin with "In" or "Is".

Related

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

How can I use regex with sed (or equivalent unix command line tool) to fix title case in LaTeX headings?

evaluate pattern stored in variable perl regexp

What's the best way to clear regex matching variables?

Using perl to split a line that may contain whitespace

Categories

Resources