Regex to capture a variable number of items?

Regex to capture a variable number of items? - regex

I am trying to use a regex to capture values from SPACE delimited items. Yes, I know that I could use [string]::Split() or -split. The goal is to use a regex in order fit it into the regex of another, larger regex.
There are a variable number of items in the string. In this example there are four (4). The resulting $Matches variable has the full string for all Value members. I also tried the regex '^((.*)\s*)+', but that resulted in '' for all except the first .\value.txt
How can I write a regex to capture a variable number of items.
PS C:\src\t> $s = 'now is the time'
PS C:\src\t> $m = [regex]::Matches($s, '^((.*)\s*)')
PS C:\src\t> $m
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 15
Value : now is the time
ValueSpan :
PS C:\src\t> $m.Groups.Value
now is the time
now is the time
now is the time
PS C:\src\t> $PSVersionTable.PSVersion.ToString()
7.2.2

You can use [regex]::Match() to find the first matching substring, then call NextMatch() to advance through the input string until no further matches can be made.
I've taken the liberty of simplifying the expression to \S+ (consecutive non-whitespace characters):
$string = 'now is the time'
$regex = [regex]'\S+'
$match = $regex.Match($string)
while($match.Success){
Write-Host "Match at index [$($match.Index)]: '$($match.Value)'"
# advance to the next match, if any
$match = $match.NextMatch()
}
Which will print:
Match at index [0]: 'now'
Match at index [4]: 'is'
Match at index [7]: 'the'
Match at index [11]: 'time'

Mathias' answer shows an iterative approach to retrieving all matches, which may or may not be needed.
Building on your own attempt to use [regex]::Matches(), the solution is as simple as:
$s = 'now is the time'
[regex]::Matches($s, '\S+').Value # -> #('now', 'is', 'the', 'time')
As noted, \S+ matches any non-empty run (+) of non-whitespace characters (\S).
Thanks to member-access enumeration, accessing the .Value property on the method call's result, which is a collection of System.Text.RegularExpressions.Match instances, each instance's .Value property is returned, which in the case of two or more instances results in an array of values.

I guess the following will work for you
[^\s]+
[^\s] means "not a space"
+ means 1 or more characters

Related

Change 3rd octet of IP in string format using PowerShell

Think I've found the worst way to do this:
$ip = "192.168.13.1"
$a,$b,$c,$d = $ip.Split(".")
[int]$c = $c
$c = $c+1
[string]$c = $c
$newIP = $a+"."+$b+"."+$c+"."+$d
$newIP
But what is the best way? Has to be string when completed. Not bothered about validating its a legit IP.

Using your example for how you want to modify the third octet, I'd do it pretty much the same way, but I'd compress some of the steps together:
$IP = "192.168.13.1"
$octets = $IP.Split(".") # or $octets = $IP -split "\."
$octets[2] = [string]([int]$octets[2] + 1) # or other manipulation of the third octet
$newIP = $octets -join "."
$newIP

You can simply use the -replace operator of PowerShell and a look ahead pattern. Look at this script below
Set-StrictMode -Version "2.0"
$ErrorActionPreference="Stop"
cls
$ip1 = "192.168.13.123"
$tests=#("192.168.13.123" , "192.168.13.1" , "192.168.13.12")
foreach($test in $tests)
{
$patternRegex="\d{1,3}(?=\.\d{1,3}$)"
$newOctet="420"
$ipNew=$test -replace $patternRegex,$newOctet
$msg="OLD ip={0} NEW ip={1}" -f $test,$ipNew
Write-Host $msg
}
This will produce the following:
OLD ip=192.168.13.123 NEW ip=192.168.420.123
OLD ip=192.168.13.1 NEW ip=192.168.420.1
OLD ip=192.168.13.12 NEW ip=192.168.420.12
How to use the -replace operator?
https://powershell.org/2013/08/regular-expressions-are-a-replaces-best-friend/
Understanding the pattern that I have used
The (?=) in \d{1,3}(?=.\d{1,3}$) means look behind.
The (?=.\d{1,3}$ in \d{1,3}(?=.\d{1,3}$) means anything behind a DOT and 1-3 digits.
The leading \d{1,3} is an instruction to specifically match 1-3 digits
All combined in plain english "Give me 1-3 digits which is behind a period and 1-3 digits located towards the right side boundary of the string"
Look ahead regex
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
CORRECTION
The regex pattern is a look ahead and not look behind.

If you have PowerShell Core (v6.1 or higher), you can combine -replace with a script block-based replacement:
PS> '192.168.13.1' -replace '(?<=^(\d+\.){2})\d+', { 1 + $_.Value }
192.168.14.1
Negative look-behind assertion (?<=^(\d+\.){2}) matches everything up to, but not including, the 3rd octet - without considering it part of the overall match to replace.
(?<=...) is the look-behind assertion, \d+ matches one or more (+) digits (\d), \. a literal ., and {2} matches the preceding subexpression ((...)) 2 times.
\d+ then matches just the 3rd octet; since nothing more is matched, the remainder of the string (. and the 4th octet) is left in place.
Inside the replacement script block ({ ... }), $_ refers to the results of the match, in the form of a [MatchInfo] instance; its .Value is the matched string, i.e. the 3rd octet, to which 1 can be added.
Data type note: by using 1, an implicit [int], as the LHS, the RHS (the .Value string) is implicitly coerced to [int] (you may choose to use an explicit cast).
On output, whatever the script block returns is automatically coerced back to a string.
If you must remain compatible with Windows PowerShell, consider Jeff Zeitlin's helpful answer.

For complete your method but shortly :
$a,$b,$c,$d = "192.168.13.1".Split(".")
$IP="$a.$b.$([int]$c+1).$d"

function Replace-3rdOctet {
Param(
[string]$GivenIP,
[string]$New3rdOctet
)
$GivenIP -match '(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})' | Out-Null
$Output = "$($matches[1]).$($matches[2]).$New3rdOctet.$($matches[4])"
Return $Output
}
Copy to a ps1 file and dot source it from command line, then type
Replace-3rdOctet -GivenIP '100.201.190.150' -New3rdOctet '42'
Output: 100.201.42.150
From there you could add extra error handling etc for random input etc.

here's a slightly different method. [grin] i managed to not notice the answer by JeffZeitlin until after i finished this.
[edit - thanks to JeffZeitlin for reminding me that the OP wants the final result as a string. oops! [*blush*]]
what it does ...
splits the string on the dots
puts that into an [int] array & coerces the items into that type
increments the item in the targeted slot
joins the items back into a string with a dot for the delimiter
converts that to an IP address type
adds a line to convert the IP address to a string
here's the code ...
$OriginalIPv4 = '1.1.1.1'
$TargetOctet = 3
$OctetList = [int[]]$OriginalIPv4.Split('.')
$OctetList[$TargetOctet - 1]++
$NewIPv4 = [ipaddress]($OctetList -join '.')
$NewIPv4
'=' * 30
$NewIPv4.IPAddressToString
output ...
Address : 16908545
AddressFamily : InterNetwork
ScopeId :
IsIPv6Multicast : False
IsIPv6LinkLocal : False
IsIPv6SiteLocal : False
IsIPv6Teredo : False
IsIPv4MappedToIPv6 : False
IPAddressToString : 1.1.2.1
==============================
1.1.2.1

Arithmetic Calculation in Perl Substitute Pattern Matching

Using just one Perl substitute regular expression statement (s///), how can we write below:
Every success match contains just a string of Alphabetic characters A..Z. We need to substitute the match string with a substitution that will be the sum of character index (in alphabetical order) of every character in the match string.
Note: For A, character index would be 1, for B, 2 ... and for Z would be 26.
Please see example below:
success match: ABCDMNA
substitution result: 38
Note:
1 + 2 + 3 + 4 + 13 + 14 + 1 = 38;
since
A = 1, B = 2, C = 3, D = 4, M = 13, N = 14 and A = 1.

I will post this as an answer, I guess, though the credit for coming up with the idea should go to abiessu for the idea presented in his answer.
perl -ple'1 while s/(\d*)([A-Z])/$1+ord($2)-64/e'
Since this is clearly homework and/or of academic interest, I will post the explanation in spoiler tags.
- We match an optional number (\d*), followed by a letter ([A-Z]). The number is the running sum, and the letter is what we need to add to the sum.
- By using the /e modifier, we can do the math, which is add the captured number to the ord() value of the captured letter, minus 64. The sum is returned and inserted instead of the number and the letter.
- We use a while loop to rinse and repeat until all letters have been replaced, and all that is left is a number. We use a while loop instead of the /g modifier to reset the match to the start of the string.

Just split, translate, and sum:
use strict;
use warnings;
use List::Util qw(sum);
my $string = 'ABCDMNA';
my $sum = sum map {ord($_) - ord('A') + 1} split //, $string;
print $sum, "\n";
Outputs:
38

Can you use the /e modifier in the substitution?
$s = "ABCDMNA";
$s =~ s/(.)/$S += ord($1) - ord "#"; 1 + pos $s == length $s ? $S : ""/ge;
print "$s\n"

Consider the following matching scenario:
my $text = "ABCDMNA";
my $val = $text ~= s!(\d)*([A-Z])!($1+ord($2)-ord('A')+1)!gr;
(Without having tested it...) This should repeatedly go through the string, replacing one character at a time with its ordinal value added to the current sum which has been placed at the beginning. Once there are no more characters the copy (/r) is placed in $val which should contain the translated value.

Or an short alternative:
echo ABCDMNA | perl -nlE 'm/(.)(?{$s+=-64+ord$1})(?!)/;say$s'
or readable
$s = "ABCDMNA";
$s =~ m/(.)(?{ $sum += ord($1) - ord('A')+1 })(?!)/;
print "$sum\n";
prints
38
Explanation:
trying to match any character what must not followed by "empty regex". /.(?!)/
Because, an empty regex matches everything, the "not follow by anything", isn't true ever.
therefore the regex engine move to the next character, and tries the match again
this is repeated until is exhausted the whole string.
because we want capture the character, using capture group /(.)(?!)/
the (?{...}) runs the perl code, what sums the value of the captured character stored in $1
when the regex is exhausted (and fails), the last say $s prints the value of sum
from the perlre
(?{ code })
This zero-width assertion executes any embedded Perl code. It always
succeeds, and its return value is set as $^R .
WARNING: Using this feature safely requires that you understand its
limitations. Code executed that has side effects may not perform
identically from version to version due to the effect of future
optimisations in the regex engine. For more information on this, see
Embedded Code Execution Frequency.

Convert string with preg_replace in PHP

I have this string
$string = "some words and then #1.7 1.7 1_7 and 1-7";
and I would like that #1.7/1.7/1_7 and 1-7 to be replaced by S1E07.
Of course, instead of "1.7" is just an example, it could be "3.15" for example.
I managed to create the regular expression that would match the above 4 variants
/\#\d{1,2}\.\d{1,2}|\d{1,2}_\d{1,2}|\d{1,2}-\d{1,2}|\d{1,2}\.\d{1,2}/
but I cannot figure out how to use preg_replace (or something similar?) to actually replace the matches so they end up like S1E07

You need to use preg_replace_callback if you need to pad 0 if the number less than 10.
$string = "some words and then #1.7 1.7 1_7 and 1-7";
$string = preg_replace_callback('/#?(\d+)[._-](\d+)/', function($matches) {
return 'S'.$matches[1].'E'.($matches[2] < 10 ? '0'.$matches[2] : $matches[2]);
}, $string);

You could use this simple string replace:
preg_replace('/#?\b(\d{1,2})[-._](\d{1,2})\b/', 'S${1}E${2}', $string);
But it would not yield zero-padded numbers for the episode number:
// some words and then S1E7 S1E7 S1E7 and S1E7
You would have to use the evaluation modifier:
preg_replace('/#?\b(\d{1,2})[-._](\d{1,2})\b/e', '"S".str_pad($1, 2, "0", STR_PAD_LEFT)."E".str_pad($2, 2, "0", STR_PAD_LEFT)', $string);
...and use str_pad to add the zeroes.
// some words and then S01E07 S01E07 S01E07 and S01E07
If you don't want the season number to be padded you can just take out the first str_pad call.

I believe this will do what you want it to...
/\#?([0-9]+)[._-]([0-9]+)/
In other words...
\#? - can start with the #
([0-9]+) - capture at least one digit
[._-] - look for one ., _ or -
([0-9]+) - capture at least one digit
And then you can use this to replace...
S$1E$2
Which will put out S then the first captured group, then E then the second captured group

You need to put brackets around the parts you want to reuse ==> capture them. Then you can access those values in the replacement string with $1 (or ${1} if the groups exceed 9) for the first group, $2 for the second one...
The problem here is that you would end up with $1 - $8, so I would rewrite the expression into something like this:
/#?(\d{1,2})[._-](\d{1,2})/
and replace with
S${1}E${2}
I tested it on writecodeonline.com:
$string = "some words and then #1.7 1.7 1_7 and 1-7";
$result = preg_replace('/#?(\d{1,2})[._-](\d{1,2})/', 'S${1}E${2}', $string);

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101

You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";

Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.

You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.

Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js