How do I implement IPv6 zero compression in Powershell? - regex

For example: here is an IP address : fc06:0000:0002:0760:e000:7201:7003:06b7.
But i'd like to replace ":0" or :and more zeros to ":". So it should look like this: fc06::2:760:e000:7201:7003:6b7
The code:
if (Test-Path $file){
foreach ($line in Get-Content $file){
$line=$line.Replace(":0+",":")
Write-Output $line
}
}
I know, that the problem is with :0+. So: how can I say, that one or more from the last character (now: zero)? Because simply :0+ doesn't works.
Thanks!

Use the built-in [System.Net.IPAddress] class:
$ip = [System.Net.IPAddress]"fc06:0000:0002:0760:e000:7201:7003:06b7"
$ip.IPAddressToString
or
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0002:0760:e000:7201:7003:06b7")
$ip.IPAddressToString
I recommend this over using string replacement because it will deal with the issue of multiple groups of zeroes in different parts of the address, preventing you from creating an invalid IPv6 address. It will not collapse the first consecutive zeros into :: though.
Edit
Actually, this method will use :: if there are more than 1 group of consecutive zeroes, for example:
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0000:0760:e000:7201:0000:06b7")
$ip.IPAddressToString
Will give you fc06::760:e000:7201:0:6b7.
Edit 2
Here's a way to get exactly what you want:
$ip = [System.Net.IPAddress]::Parse("fc06:0000:0002:0760:e000:7201:7003:06b7")
$ip.IPAddressToString -replace '(?<!(?:^|:)0?:.*?)(?:^|:)0:','::'
This lets the class do the heavy lifting, and in case there's a single zero that could be collapsed, it takes care of it.
Output: fc06::2:760:e000:7201:7003:6b7
(also tested with other addresses that have different sizes of zero blocks in different positions).
Edit 3:
As #Ron Maupin points out, RFC 5952 section 4.2.2 states:
The symbol "::" MUST NOT be used to shorten just one 16-bit 0 field.
For example, the representation 2001:db8:0:1:1:1:1:1 is correct, but
2001:db8::1:1:1:1:1 is not correct.
Given this, I strongly recommend that you don't use the code in Edit 2, or otherwise attempt to collapse a single 16 bit field.
Instead, use the built-in class's zero compression which appears to already be RFC compliant.

The string .replace() method uses literal string arguments, not regular expressions.
Switch to the -replace operator:
$line=$line -Replace ':0+',':'

Related

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}
Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.
Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

Need to add prefix to captured names, based on given lists

I have two arrays #Mister and #Mrs and need to add prefix based on the values.
#Mister = qw(Parasuram Raghavan Srivatsan);
#Mrs = qw(Kalai Padmini Maha);
my $str = "I was invited the doctor Parasuram and Kalai and civil Engineer Raghavan and Padmini and finally Advocate Srivatsan and Maha";
#Mr. Parasuram Mr. Raghavan Mr. Srivatsan
if(grep ($_ eq $str), #Mister)
{ $str=~s/($_)/Mr. $1/g; }
#Mrs. Kalai Mrs. Padmini Mrs. Maha`
if(grep ($_ eq $str), #Mrs)
{ $str=~s/($_)/Mrs. $1/g; }
Output Should be:
I was invited the doctor Mr. Parasuram and Mrs. Kalai and civil Engineer Mr. Raghavan and Mrs. Padmini and finally Advocate Mr. Srivatsan and Mrs. Maha
Could someone simplify the way I am doing and whats wrong in this code.
A simple take
my $mr_re = join '|', #Mister;
my $mrs_re = join '|', #Mrs;
$str =~ s/\b($mr_re)\b/Mr. $1/g;
$str =~ s/\b($mrs_re)\b/Ms. $1/g;
(note that I used the neutral Ms above instead of Mrs)
However, when we consider the bewildering complexity of names, the \b doesn't take care of all ways for a name to contain another. An easy example: the - is readily found in names and \b is an anchor between \w and \W, where \w does not include -.
Thus Name-Another would be matched by Name alone as well.
If there are characters other than alphanumeric (plus _) that can be inside names consider
my $w_re = /[a-z-]/i; # list all characters that can be in a name
$str =~ s/(?<!$w_re)($mr_re)(?!$w_re)/Mr. $1/g; # same for Ms.
where negative lookarounds ?<! and ?! are assertions that match your non-name characters (those not listed in $w_re) but do not consume them. Thus they delimit acceptable names.
The same holds for accents, and yet many other characters used in names in various cultures. The task of forming a satisfactory $w_re may be a tricky one even for one particular use case.
If names can come in multiple words (with spaces), in order to handle names within others you would have to parse them in general. That is a complex task; seek modules as little regex won't cut it.
A simple fix would be to preprocess lists to check for names with multiple words that contain other names from your lists, and to handle that case by case.
For your example with hard coded and verifiable names the above works. However, in general, when assembling a regex from strings make sure that all (ascii) non-word chars are escaped so that you actually have the intended literal characters without a special meaning
my $mr_re = join '|', map { quotemeta } #Mister;
my $mrs_re = join '|', map { quotemata } #Mrs;
See quotemeta; inside a regex use \Q, see it in perlbackslash and in perlre.
Note that this problem critically relies on sensible input.
If names are duplicated in lists the problem is ill-posed: If they repeat in the sentence it is unknown which is which, if they don't it is unknown whether it is Mr. or Ms. Thus the name lists should be first checked for duplicates.
"Could someone simplify the way I am doing and whats wrong in this code."
The first part is addressed by zdim in a way I would do it too, but the "what's wrong" part could get some more addressing, in my opinion (just nitpicking, but maybe useful for someone):
if(grep ($_ eq $str), #Mister)
{ $str =~ s/($_)/Mr. $1/g; }
Your list entries will never equal the $str, I think you meant $str =~/$_/
Either use an additional pair of parenthesis around both condition and #list or use the block form of grep (grep { $str =~ /$_/ } #Mister) - otherwise grep will miss the list as argument, since it takes the one existing pair as limiter for it's argument list right now.
the $_ used in the grep command is not available outside of the command, so the $str-substitution would use whatever the value of $_ is currently. In the example it would most likely be undef, so that between each character in the former $str 'Mr. ' is inserted.
Like I said: A perfectly good solution to your problem is given in zdim's answer, but you also asked "what's wrong in this code".
#ssr1012 and other readers: Be careful! It's tempting to think there is a universal solution for this problem. But, unfortunately, even #zdim's approach will give undesirable results if the same name appears in both arrays, and it is still tricky if a name in one array is the same as a name in the other array except for a few additional characters at the start or end.
Here's your example using slightly different names:
my #Mister = qw(Parasuram Mahan Srivatsan);
my #Mrs = qw(Kalai Padmini Maha);
...
# I was invited the doctor Mr. Parasuram and Ms. Kalai and civil Engineer Mr. Ms. Mahan and Ms. Padmini and finally Advocate Mr. Srivatsan and Ms. Maha
See the "Mr. Ms. Mahan"? You don't have enough information for a universal solution. This is only reliable if your names are hard-coded and checked first to avoid collisions.
Even if you added first names, you might not have enough information - guessing gender from first names is unreliable in many language cultures.

RegEx for an invoice format

I'm quite new to regular expressions and I'm trying to create a regex for the validation of an invoice format.
The pattern should be:
JjYy (all 4 characters are legit), used 0, 2 or 4 times
e.g. no Y's at all is valid, YY is valid, YYYY is valid, but YYY should fail.
Followed by a series of 0's repeating 3 to 10 times.
The whole should never exceed 10 characters.
examples:
JyjY000000 is valid (albeit quite strange)
YY000 is valid
000000 is valid
jjj000 is invalid
jjjj0 is invalid
I learned some basics from here, but my regex fails when it shouldn't. Can someone assist in improving it?
My regex so far is: [JjYy]{0}|[JjYy]{2}|[JjYy]{4}[0]{3,10}.
The following failed also: [JjYy]{0|2|4}[0]{3,10}
As you need the total length to never exceed 10 characters I think you have to handle the three kinds of prefixes separately:
0{3,10}|[JjYy]{2}0{3,8}|[JjYy]{4}0{3,6}
How about:
^([JjYy]{2}){0,2}0{3,10}$
To check the length is ten characters or less, use a string length function rather than a regular expression - don't hammer nails with a screwdriver, and so forth.
Test:
#!perl
use warnings;
use strict;
my $re = qr/^([JjYy]{2}){0,2}0{3,10}$/;
my %tests = qw/JyjY000000 valid
YY000 valid
000000 valid
jjj000 invalid
jjjj0 invalid/;
for my $k (keys %tests) {
print "$k is ";
if ($k =~ /$re/) {
print "valid";
} else {
print "invalid";
}
print " and it should be $tests{$k}.\n";
}
Produces
jjjj0 is invalid and it should be invalid.
YY000 is valid and it should be valid.
JyjY000000 is valid and it should be valid.
jjj000 is invalid and it should be invalid.
000000 is valid and it should be valid.
([jJyY]{2}){0,2}0{3,10}
If the total length limit is inclusive of the jJyY part, you can check it with a negative look ahead to make sure there are no more than 10 characters in the string to begin with (?![jJyY0]{11,})
\b(?![jJyY0]{11,})([jJyY]{2}){0,2}0{3,10}\b
It may depend on what you are using to implement the regular expression. For example I found out the other day that Notepad++ only supports a few basic operators. Things like the pipe are not part of the core regex standard.
I'd suggest something like this:
([JjYy]{2}([JjYy]{2})?)?[0]{3,10}
If you're using a programming language, you'll need to use a string length function to validate the length.
EDIT: actually, you should be able to validate the length by separating the different situations:
([0]{3,10})|([JjYy]{2}[0]{3,8})|([JjYy]{4}[0]{3,6})
You want to limit the string to 10 characters. So in order to do this you have to consider what valid combinations will make up 10 characters.
Valid combinations therefore would be:
0000000000
000
cc00000000
cc000
cccc000000
cccc000
So, an expression to include all of these would be:
/0{3,10}|[JY]{2}0{3,8}|[JY]{4}0{3,6}/i
A case insensitive match would suffice, although you do get additional performance from some regular expression engines by explicitly saying /[JjYy]/ instead of /[JY]/i.

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.