Match string with escape characters or backslashes - regex

The following perl script and TestData simulate the situation where I can only find 2 instead of 4 expected. (to match all support.tier.1 with backslash in between).
How can I modify this perl regex here? thanks
my #TestData(
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com"
);
Here is the code to be changed:
my $count = 0;
foreach my $tier(#TestData){
if($tier =~ m/support.tier.1/){
print "$count: $tier\n";
}
$count++;
}
I only get 2 matches while the expected is 4:
0: support.tier.1
6: support.tier.1#example.com

Update
Since it seems that you may indeed be getting strings containing backslashes, I suggest that you use String::Unescape to remove those backslashes before testing your strings. You will probably have to install it as it isn't a core module
Your code would look like this
use strict;
use warnings;
use String::Unescape;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
for my $tier ( #tiers ) {
my $plain = String::Unescape->unescape($tier);
if ( $plain =~ /support\.tier\.1/ ) {
printf "%d: %s\n", ++$count, $tier;
}
}
output
1: support.tier.1
2: "support\.tier\.1"
3: support\.tier\.1\#example\.com
4: support.tier.1#example.com
Note that there is a bug in the String::Unescape module that prevents it from exporting the unescape function. It just means you have to use String::Unescape::unescape or String::Unescape->unescape all the time. Or you could import it manually with *unescape = \&String::Unescape::unescape
The #tiers array contains these exact strings
support.tier.1
support.tier.2
"support\.tier\.1"
support.tier.2
support\.tier\.1\#example\.com
support.tier.2#example.com
support.tier.1#example.com
support.tier.2#example.com
sales#example.com
Can you see that only items 1 and 7 contain the string support.tier.1? The other two that I imagine you expected to match are 3 and 5, which contain spurious backslashes
It's not clear, but it seems unlikely that you will be getting data in this format. If you really want to match support.tier.1 where either dot may be preceded by a backslash character then you need /support\\?\.tier\\?\.1/, but I think you are misunderstanding the way Perl strings work

I may not fully understand, but if I do I agree with the answer that Matt has already attempted to give you. Regex definitely can handle your request if you are saying that the escape character may or may not be before each period in support.tier.1.
A single backslash is \\ and ? means essentially "one or zero:"
use strict;
use warnings;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
foreach my $tier (#tiers) {
if ($tier =~ /support\\?.tier\\?.1/) {
print "$count: $tier\n";
}
$count++;
}
On an unrelated note, for the purpose of creating an easy-to-follow example, I included a suggestion on how you might better format your sample data instead of using the $str and pushes.
If this works, I'd recommend you ask Matt to post his comment responses as an answer and accept it.

Related

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}
Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.
Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Why isn't my regex matching my input data?

I have a column of values (strings) that look like this:
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
I am trying to identify those values where the following pattern is met only once and no extra text is present:
three letters-some numbers-three letters
In the previous example, this would match the first value, but not the other three, because they have extra bits of text or there are two instances of the pattern separated by blank spaces or commas.
I tried using something like this in Perl:
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
And actually checked the regular expression in regexplanet.com, where it worked beautifully. However, when I used it in my code it wasn't matching any of the values I listed above, missing even the first one. Any idea why this could be happening? Any advice on an alternative for this?
It works fine. Here it is fixed (you didn't terminate your regex) and incorporated into a working program
use strict;
use warnings;
use v5.10;
while ( my $value = <DATA> ) {
my $qualifier;
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$/ ) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
say $qualifier;
}
__DATA__
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
output
ok
needs cleaning
needs cleaning
needs cleaning
Looks like topic starter forgot final / in regexp.
I would use expression like this: /^[a-z]{3}\d+[a-z]{3}$/

Use regular expressions to find host name

so I have little problem, because I need to print host name which is bettwen "(?# )", for example:
Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu
And I need to print "researchscan425.eecs.umich.edu".
I tried something like:
if(my ($test) = $linelist =~ /\b\(\?\#(\S*)/)
{
print "$test\n";
}
But it doesn't print me anything.
You can use this regex:
\(\?#(.*?)\)
researchscan425.eecs.umich.edu will be captured into Group 1.
See demo
Sample code:
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?#(.*?)\)/)
{
print "$test\n";
}
How about:
if(my ($test) = $linelist =~ /\(\?\#([^\s)]+)/)
You need to remove the \b which exists before (. Because there isn't a word boundary exists before ( (non-word character) and after space (non-word charcater).
my $linelist = 'Apr 17 23:39:02 test pure-ftpd: (?#researchscan425.eecs.umich.edu) [INFO] New connection from researchscan425.eecs.umich.edu';
if(my ($test) = $linelist =~ /\(\?\#([^)]*)/)
{
print "$test\n";
}
The problem here is the definition of \b.
It's "word boundary" - on regex101 that means:
(^\w|\w$|\W\w|\w\W)
Now, why this is causing you problems - ( is not a word character. So the transition from space to bracket doesn't trigger this pattern.
Switch your pattern to:
\s\(\?\#(\S+)
And it'll work. (Note - I've changed * to + because you probably want one or more, not zero or more).
It's amazing what you can do with logging tools or with perl as part of the logging service itself (c.f. Ubic), but even if you're just writing a "quick script" to parse logs for reporting (i.e. something you or someone else won't look at again for months or years) it helps to make them easy to maintain.
One approach to doing this is to process the lines of your log file lines with Regexp::Common. One advantage is that RX::Common matches practically "self document" what you are doing. For example, to match on specific "RFC compliant" definitions of what constitutes a "domain" using the $linelist you posted:
use Regexp::Common qw /net/;
if ( $line =~ /\?\#$RE{net}{domain}{-keep}/ ) { say $1 }
Then, later, if you need you can add other matches e.g "numeric" IPv4 or IPv6 addresses, assign them for use later in the script, etc. (Perl6::Form and IO::All used for demonstration purposes only - try them out!):
use IO::All ;
use Regexp::Common qw/net/;
use Perl6::Form;
my $purelog = io 'logfile.lines.txt' ;
sub _get_ftphost_names {
my #hosts = () ;
while ($_ = $purelog->getline) {
/\(\?\#$RE{net}{IPv6}{-sep => ":" }{-keep}/ ||
/\(\?\#$RE{net}{IPv4}{-keep}/ ||
/\(\?\#$RE{net}{domain}{-keep}/ and push #hosts , $1 ;
}
return \#hosts ;
}
sub _get_bytes_transfered {
... ;
}
my #host_list = _get_ftphost_names ;
print form
"{[[[[[[[[[[(30+)[[[[[[[[[[[[[}", #host_list ;
One of the great things about Regexp::Common (besides stealing regexp ideas from the source) is that it also makes it fairly easy to roll your own matches, You can use those to capture other parts of the file in an easily understandable way adding them piece by piece. Then, as what was supposed to be your four line script grows and transforms itself into a ITIL compliant corporate reporting tool, you and your career can advance apace :-)

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.
Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.
Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".
I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers
What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.
My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}
What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.