Advanced pattern matching in Powershell - regex

Hope you can help me with something. Thanks to #mklement0 I've gotten a great script matching the most basic, initial pattern for words in alphabetical order. However what's missing is a full text search and select.
An example of current script with a small sample of a few words within a Words.txt file:
App
Apple
Apply
Sword
Swords
Word
Words
Becomes:
App
Sword
Word
This is great as it really narrows down to a basic pattern per line! However the result of it going line by line there is still a pattern that can further be narrowed down which is "Word" (capitalization not important) so ideally the output should be:
App
Word
And "Sword" is removed as it falls in more basic pattern prefixed as "Word".
Would you have any suggestion on how to achieve this? Keep in mind this will be a dictionary list of about 250k words, so I would not know what I am looking for ahead of time
CODE (from a related post, handles prefix matching only):
$outFile = [IO.File]::CreateText("C:\Temp\Results.txt") # Output File Location
$prefix = '' # initialize the prefix pattern
foreach ($line in [IO.File]::ReadLines('C:\Temp\Words.txt')) # Input File name.
{
if ($line -like $prefix)
{
continue # same prefix, skip
}
$line # Visual output of new unique prefix
$prefix = "$line*" # Saves new prefix pattern
$outFile.writeline($line) # Output file write to configured location
}

You can try a two-step approach:
Step 1: Find the list of unique prefixes in the alphabetically sorted word list. This is done by reading the lines sequentially, and therefore only requires you to hold the unique prefixes as a whole in memory.
Step 2: Sort the resulting prefixes in order of length and iterate over them, checking in each iteration whether the word at hand is already represented by a substring of it in the result list.
The result list starts out empty, and whenever the word at hand has no substring in the result list, it is appended to the list.
The result list is implemented as a regular expression with alternation (|), to enable matching against all already-found unique words in a single operation.
You'll have to see if the performance is good enough; for best performance, .NET types are used directly as much as possible.
# Read the input file and build the list of unique prefixes, assuming
# alphabetical sorting.
$inFilePath = 'C:\Temp\Words.txt' # Be sure to use a full path.
$uniquePrefixWords =
foreach ($word in [IO.File]::ReadLines($inFilePath)) {
if ($word -like $prefix) { continue }
$word
$prefix = "$word*"
}
# Sort the prefixes by length in ascending order (shorter ones first).
# Note: This is a more time- and space-efficient alternative to:
# $uniquePrefixWords = $uniquePrefixWords | Sort-Object -Property Length
[Array]::Sort($uniquePrefixWords.ForEach('Length'), $uniquePrefixWords)
# Build the result lists of unique shortest words with the help of a regex.
# Skip later - and therefore longer - words, if they are already represented
# in the result list of word by a substring.
$regexUniqueWords = ''; $first = $true
foreach ($word in $uniquePrefixWords) {
if ($first) { # first word
$regexUniqueWords = $word
$first = $false
} elseif ($word -notmatch $regexUniqueWords) {
# New unique word found: add it to the regex as an alternation (|)
$regexUniqueWords += '|' + $word
}
}
# The regex now contains all unique words, separated by "|".
# Split it into an array of individual words, sort the array again...
$resultWords = $regexUniqueWords.Split('|')
[Array]::Sort($resultWords)
# ... and write it to the output file.
$outFilePath = 'C:\Temp\Results.txt' # Be sure to use a full path.
[IO.File]::WriteAllLines($outFilePath, $resultWords)

Reducing arbitrary substrings is a bit more complicated than prefix matching, as we can no longer rely on alphabetical sorting.
Instead, you could sort by length, and then keep track of patterns that can't be satisfied by a shorter one, by using a hash set:
function Reduce-Wildcard
{
param(
[string[]]$Strings,
[switch]$SkipSort
)
# Create set containing all patterns, removes all duplicates
$Patterns = [System.Collections.Generic.HashSet[string]]::new($Strings, [StringComparer]::CurrentCultureIgnoreCase)
# Now that we only have unique terms, sort them by length
$Strings = $Patterns |Sort-Object -Property Length
# Start from the shortest possible pattern
for ($i = 0; $i -lt ($Strings.Count - 1); $i++) {
$current = $Strings[$i]
if(-not $Patterns.Contains($current)){
# Check that we haven't eliminated current string already
continue
}
# There's no reason to search for this substring
# in any of the shorter strings
$j = $i + 1
do {
$next = $Strings[$j]
if($Patterns.Contains($next)){
# Do we have a substring match?
if($next -like "*$current*"){
# Eliminate the superstring
[void]$Patterns.Remove($next)
}
}
$j++
} while ($j -lt $Strings.Count)
}
# Return the substrings we have left
return $Patterns
}
Then use like:
$strings = [IO.File]::ReadLines('C:\Temp\Words.txt')
$reducedSet = Reduce-Wildcard -Strings $strings
Now, this is definitely not the most space-efficient way of reducing your patterns, but the good news is that you can easily divide-and-conquer a large set of inputs by merging and reducing the intermediate results:
Reduce-Wildcard #(
Reduce-Wildcard -Strings #('App','Apple')
Reduce-Wildcard -Strings #('Sword', 'Words')
Reduce-Wildcard -Strings #('Swords', 'Word')
)
Or, in case of multiple files, you can chain successive reductions like this:
$patterns = #()
Get-ChildItem dictionaries\*.txt |ForEach-Object {
$patterns = Reduce-Wildcard -String #(
$_ |Get-Content
$patterns
)
}

My two cents:
Using -Like or RegEx might get expensive on the long run knowing that they used in the inner loop of the selection the invocation will increase exponentially with the size of the word list. Besides, the pattern of the -Like and RegEx operation might need to be escaped (especially for Regex where e.g. a dot . has a special meaning. I Suspect that this question has something to do with checking for password complexity).
Presuming that it doesn't matter whether the output list is in lower case, I would use the String.Contains() method. Otherwise, If the case of the output does matter, you might prepare a hash table like $List[$Word.ToLower()] = $Word and use that restore the actual case at the end.
# Remove empty words, sort by word length and change everything to lowercase
# knowing that .Contains is case sensitive (and therefore presumably a little faster)
$Words = $Words | Where-Object {$_} | Sort-Object Length | ForEach-Object {$_.ToLower()}
# Start with a list of the smallest words (I guess this is a list of all the words with 3 characters)
$Result = [System.Collections.ArrayList]#($Words | Where-Object Length -Eq $Words[0].Length)
# Add the word to the list if it doesn't contain any of the all ready listed words
ForEach($Word in $Words) {
If (!$Result.Where({$Word.Contains($_)},'First')) { $Null = $Result.Add($Word) }
}
2020-04-23 updated the script with the suggestion from #Mathias:
You may want to use Where({$Word.Contains($_)},'First') to avoid comparing against all of $Result everytime
which is about twice as fast.

Related

Match a cycle of letters in a string in Perl

Let's say I have a string 'abc'. How do I match all 3 or more occurrences of 'abc' and its cycles ('bca', 'cab') in a large string.
Right now I am using individual entries as regex to match, but a) It is taking too long because the string is very large, and b) I'm getting the same regions in subsequent matches. For example, if my input is:
dabcabcabcabgyklagkbcabcabcahkgljla
^-------^ ^-------^
I want my output to be two matches:
1. abcabcabc position 2
2. bcabcabca position 20
Right now I'm getting 4 lines of output:
1. abcabcabc position 2
2. bcabcabca position 3
3. cabcabcab position 4
4. bcabcabca position 20
I hope I explained my problem. I got the desired output in another complicated way by doing a multi regex matching using all possible combinations in a single regex like this:
while($str =~ /(abc){3,}|(bca){3,}|cab{3,}/g {
print "$1\tposition $-[0]\n";
}
But it was a serious performance hit, and given the size of my input, it is taking forever to run. Please help me with a more efficient algorithm. Really sorry if this was asked earlier, but I couldn't find any page that helped me.
Thanks in advance
I suggest you use just /(abc){2,}/ preceded by nothing, c, or bc and followed by nothing, a, or ab, so
/ ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg
The idea is to break down any sequence, like bcabcabcabcabca into a number of abcs, possibly preceded bt c or (here) bc and possibly followed by (here) a or ab, like this.
bc abcabcabcabc a
so that the regex engine doesn't have to check for three diffrent strings at every point.
Doing it that way may find sequences up to three characters shorter than you require, but it should be faster and you can add an additional filter on length. Like this
use strict;
use warnings;
my $seq = 'dabcabcabcabgyklagkbcabcabcahkgljla';
while ($seq =~ / ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg) {
next unless length $1 >= 9;
my $subseq = $1;
chop $subseq while length($subseq) % 3;
print "$subseq\tposition $-[0]\n";
}
output
abcabcabc position 1
bcabcabca position 19
I've used the data you posted and have found a variation of my original solution that runs about four to five times faster than your original. Unfortunately the sequence you posted is only 225KB and there is only a single occurrence of one of the SSRs in it, so I don't know how representative it is.
Essentially, instead looking for a sequence of four rotations of the pattern, it looks only for repetitions of the core SSR, with an optional prefix and suffix that lets the overall sequence start anywhere within the SSR, like this
/ (?:AAT|AT|T|) (?:AAAT){3,} (:?AAA|AA|A|) /x
All of this regex is built automatically.
use strict;
use warnings;
use autodie;
open my $fh, '<', 'chr1.txt';
my $seq = <$fh>;
close $fh;
my #ssrs = qw( AAAT AAAC AACC AACG );
retrieve_ssr('Sample', $seq, \#ssrs);
sub retrieve_ssr {
my ($name, $seq, $ssr_list) = #_;
for my $ssr (#$ssr_list) {
my $len = length $ssr;
my $n = $len == 5 ? 3 : 12 / $len;
$n = 1;
my $prefix = join '', map { substr($ssr, -$_) . '|' } 1 .. $len-1;
my $suffix = join '', map { substr($ssr, 0, $_) . '|' } reverse 1 .. $len-1;
my $re = qr/ (?:$prefix) (?:$ssr){$n,} (?:$suffix) /x;
while ($seq =~ /$re/g) {
my $start = $-[0] + 1;
my $length = $+[0] - $-[0];
my $excess = $length % $len;
pos($seq) -= $excess;
$length -= $excess;
my $seq = substr $seq, $-[0], $length;
print "$start\t$+[0]\t$length\t$seq\n";
}
}
}
output
23738 23752 12 TAAATAAATAAA
It strikes me that you don't need to have 3 separate regexes, you really just need one regex like this:
perl -ne 'print "$1\tposition $-[0]\n" while /(b?c?(abc){1,}a?b?)/g' mydata.txt
The idea is that the core pattern abc is matched as needed, and then you just need to account for the potential prefix of "b?c?" and a potential suffix of "a?b?" (if the prefix or suffix were longer then it would be matched by the main regex in the center).
As given this expression will find matches of 3 chars or longer, but you can obviously up the minimum length by changing the value inside {1,}
This solution does risk a few false positives in the prefix and suffix however, as it would match "babc", so you could run a 2nd slow search on the results for complete accuracy.

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:
Script:
$string = gc $filePath
$string | % {
$_ -replace 'something0','somethingelse0' `
-replace 'something1','somethingelse1' `
-replace 'something2','somethingelse2' `
-replace 'something3','somethingelse3' `
-replace 'something4','somethingelse4' `
-replace 'something5','somethingelse5' `
...
(600 More Lines...)
...
}
$string | ac "C:\log.txt"
But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.
Is there a better alternative to doing this that is more efficient?
Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:
# Build hashtable of search and replace values.
$replacements = #{
'something0' = 'somethingelse0'
'something1' = 'somethingelse1'
'something2' = 'somethingelse2'
'something3' = 'somethingelse3'
'something4' = 'somethingelse4'
'something5' = 'somethingelse5'
'X:\Group_14\DACU' = '\\DACU$'
'.*[^xyz]' = 'oO{xyz}'
'moresomethings' = 'moresomethingelses'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
Get-Content $filePath |
foreach { $r.Replace( $_, $matchEval ) } |
Add-Content 'C:\log.txt'
So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?
Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.
The Method:
Construct a hash where the keys are the somethings and the values are the somethingelses.
Join the keys of the hash with the | symbol, and use it as a match group in the regex.
In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group
The Problem:
Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.
In Perl, you can do this, for example:
$string =~ s/(1|2|3)/#{[$1 + 5]}/g;
This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".
However, in PowerShell, both of these fail:
$string -replace '(1|2|3)',"$($1 + 5)"
[regex]::replace($string,'(1|2|3)',"$($1 + 5)")
In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.
The Solution:
[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]
If using another language is acceptable to you, the following Perl script works like a charm:
$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
'something0' => 'somethingelse0',
'something1' => 'somethingelse1',
'something2' => 'somethingelse2',
'something3' => 'somethingelse3',
'something4' => 'somethingelse4',
'something5' => 'somethingelse5',
'X:\Group_14\DACU' => '\\DACU$',
'.*[^xyz]' => 'oO{xyz}',
'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
push #strings, qr/\Q$_\E/;
$replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', #strings;
while (<INPUT>) {
s/($pattern)/$replacements{$1}/g;
print OUTPUT;
}
close INPUT;
close OUTPUT;
It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:
The foreach loop goes through all the elements of the hash and create an array called #strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.
BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.
while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
I simplified #{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so #{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
I also have no idea how to solve this in powershell, but I do know how to solve it in Bash and that is by using a tool called sed. Luckily, there is also Sed for Windows. If all you want to do is replace "something#" with "somethingelse#" everywhere then this command will do the trick for you
sed -i "s/something([0-9]+)/somethingelse\1/g" c:\log.txt
In Bash you'd actually need to escape a couple of those characters with backslashes, but I'm not sure you need to in windows. If the first command complains you can try
sed -i "s/something\([0-9]\+\)/somethingelse\1/g" c:\log.txt
I would use the powershell switch statement:
$string = gc $filePath
$string | % {
switch -regex ($_) {
'something0' { 'somethingelse0' }
'something1' { 'somethingelse1' }
'something2' { 'somethingelse2' }
'something3' { 'somethingelse3' }
'something4' { 'somethingelse4' }
'something5' { 'somethingelse5' }
'pattern(?<a>\d+)' { $matches['a'] } # sample of more complex logic
...
(600 More Lines...)
...
default { $_ }
}
} | ac "C:\log.txt"

Powershell: Replacing regex named groups with variables

Say I have a regular expression like the following, but I loaded it from a file into a variable $regex, and so have no idea at design time what its contents are, but at runtime I can discover that it includes the "version1", "version2", "version3" and "version4" named groups:
"Version (?<version1>\d),(?<version2>\d),(?<version3>\d),(?<version4>\d)"
...and I have these variables:
$version1 = "3"
$version2 = "2"
$version3 = "1"
$version4 = "0"
...and I come across the following string in a file:
Version 7,7,0,0
...which is stored in a variable $input, so that ($input -match $regex) evaluates to $true.
How can I replace the named groups from $regex in the string $input with the values of $version1, $version2, $version3, $version4 if I do not know the order in which they appear in $regex (I only know that $regex includes these named groups)?
I can't find any references describing the syntax for replacing a named group with the value of a variable by using the group name as an index to the match - is this even supported?
EDIT:
To clarify - the goal is to replace templated version strings in any kind of text file where the version string in a given file requires replacement of a variable number of version fields (could be 2, 3, or all 4 fields). For example, the text in a file could look like any of these (but is not restricted to these):
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
Users can specify a file set and a regular expression to match the line containing the fields, with the original idea being that the individual fields would be captured by named groups. The utility has the individual version field values that should be substituted in the file, but has to preserve the original format of the line that will contain the substitutions, and substitute only the requested fields.
EDIT-2:
I think I can get the result I need with substring calculations based on the position and extent of each of the matches, but was hoping Powershell's replace operation was going to save me some work.
EDIT-3:
So, as Ansgar correctly and succinctly describes below, there isn't a way (using only the original input string, a regular expression about which you only know the named groups, and the resulting matches) to use the "-replace" operation (or other regex operations) to perform substitutions of the captures of the named groups, while leaving the rest of the original string intact. For this problem, if anybody's curious, I ended up using the solution below. YMMV, other solutions possible. Many thanks to Ansgar for his feedback and options provided.
In the following code block:
$input is a line of text on which substitution is to be performed
$regex is a regular expression (of type [string]) read from a file that has been verified to contain at least one of the supported named groups
$regexToGroupName is a hash table that maps a regex string to an array of group names ordered according to the order of the array returned by [regex]::GetGroupNames(), which matches the left-to-right order in which they appear in the expression
$groupNameToVersionNumber is a hash table that maps a group name to a version number.
Constraints on the named groups within $regex are only (I think) that the expression within the named groups cannot be nested, and should match at most once within the input string.
# This will give us the index and extent of each substring
# that we will be replacing (the parts that we will not keep)
$matchResults = ([regex]$regex).match($input)
# This will hold substrings from $input that were not captured
# by any of the supported named groups, as well as the replacement
# version strings, properly ordered, but will omit substrings captured
# by the named groups
$lineParts = #()
$startingIndex = 0
foreach ($groupName in $regexToGroupName.$regex)
{
# Excise the substring leading up to the match for this group...
$lineParts = $lineParts + $input.Substring($startingIndex, $matchResults.groups[$groupName].Index - $startingIndex)
# Instead of the matched substring, we'll use the substitution
$lineParts = $lineParts + $groupNameToVersionNumber.$groupName
# Set the starting index of the next substring that we will keep...
$startingIndex = $matchResults.groups[$groupName].Index + $matchResults.groups[$groupName].Length
}
# Keep the end of the original string (if there's anything left)
$lineParts = $lineParts + $input.Substring($startingIndex, $input.Length - $startingIndex)
$newLine = ""
foreach ($part in $lineParts)
{
$newLine = $newLine + $part
}
$input= $newLine
Simple Solution
In the scenario where you simply want to replace a version number found somewhere in your $input text, you could simply do this:
$input -replace '(Version\s+)\d+,\d+,\d+,\d+',"`$1$Version1,$Version2,$Version3,$Version4"
Using Named Captures in PowerShell
Regarding your question about named captures, that can be done by using curly brackets. i.e.
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. '
Gives:
I have a pet dog. I have a pet cat. cher
Issue with multiple captures & solution
You can't replace multiple values in the same replace statement, since the replacement string is used for everything. i.e. if you did this:
'dogcatcher' -replace '(?<pet>dog|cat)|(?<singer>cher)','I have a pet ${pet}. I like ${singer}''s songs. '
You'd get:
I have a pet dog. I like 's songs. I have a pet cat. I like 's songs. I have a pet . I like cher's songs.
...which is probably not what you're hoping for.
Rather, you'd have to do a match per item:
'dogcatcher' -replace '(?<pet>dog|cat)','I have a pet ${pet}. ' -replace '(?<singer>cher)', 'I like ${singer}''s songs. '
...to get:
I have a pet dog. I have a pet cat. I like cher's songs.
More Complex Solution
Bringing this back to your scenario, you're not actually using the captured values; rather you're hoping to replace the spaces they were in with new values. For that, you'd simply want this:
$input = 'I''m running Programmer''s Notepad version 2.4.2.1440, and am a big fan. I also have Chrome v 56.0.2924.87 (64-bit).'
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\bv(?:ersion)?\s+)\d+(?=\.\d+\.\d+\.\d+)'
$v2Pattern = '(?<=\bv(?:ersion)?\s+\d+\.)\d+(?=\.\d+\.\d+)'
$v3Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.)\d+(?=\.\d+)'
$v4Pattern = '(?<=\bv(?:ersion)?\s+\d+\.\d+\.\d+\.)\d+'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Which would give:
I'm running Programmer's Notepad version 1.3.5.7, and am a big fan. I also have Chrome v 1.3.5.7 (64-bit).
NB: The above could be written as a 1 liner, but I've broken it down to make it simpler to read.
This takes advantage of regex lookarounds; a way of checking the content before and after the string you're capturing, without including those in the match. i.e. so when we select what to replace we can say "match the number that appears after the word version" without saying "replace the word version".
More info on those here: http://www.regular-expressions.info/lookaround.html
Your Example
Adapting the above to work for your example (i.e. where versions may be separated by commas or dots, and there's no consistency to their format beyond being 4 sets of numbers:
$input = #'
#define SOME_MACRO(4, 1, 0, 0)
Version "1.2.3.4"
SomeStruct vs = { 99,99,99,99 }
'#
$version1 = 1
$version2 = 3
$version3 = 5
$version4 = 7
$v1Pattern = '(?<=\b)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v2Pattern = '(?<=\b\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\s*[\.,]\s*\d+\b)'
$v3Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+(?=\s*[\.,]\s*\d+\b)'
$v4Pattern = '(?<=\b\d+\s*[\.,]\s*\d+\s*[\.,]\s*\d+\s*[\.,]\s*)\d+\b'
$input -replace $v1Pattern, $version1 -replace $v2Pattern, $version2 -replace $v3Pattern,$version3 -replace $v4Pattern,$version4
Gives:
#define SOME_MACRO(1, 3, 5, 7)
Version "1.3.5.7"
SomeStruct vs = { 1,3,5,7 }
Regular expressions don't work that way, so you can't. Not directly, that is. What you can do (short of using a more appropriate regular expression that groups the parts you want to keep) is to extract the version string and then in a second step replace that substring with the new version string:
$oldver = $input -replace $regexp, '$1,$2,$3,$4'
$newver = $input -replace $oldver, "$Version1,$Version2,$Version3,$Version4"
Edit:
If you don't even know the structure, you must extract that from the regular expression as well.
$version = #($version1, $version2, $version3, $version4)
$input -match $regexp
$oldver = $regexp
$newver = $regexp
for ($i = 1; $i -le 4; $i++) {
$oldver = $oldver -replace "\(\?<version$i>\\d\)", $matches["version$i"]
$newver = $newver -replace "\(\?<version$i>\\d\)", $version[$i-1]
}
$input -replace $oldver, $newver

split first from rest of list using regex substitution

I need to split a list between its first item and the rest of its items using regex substitution only.
The lists of items are input as strings using '##' as a separator, e.g.:
''
'one'
'one##two'
'one##two##three'
'one##two words##three'
My Perl attempt doesn't really work:
my $sampleText = 'one##two words##three';
my $first = $sampleText;
my $rest = $sampleText;
$first =~ s/(.+?)(##.*)?/$1/g;
$rest =~ s/(.?+)(##)?(.*)/$3/g;
print "sampleText = '$sampleText', first = '$first', rest = '$rest'\n";
sampleText = 'one##two words##three', first = 'one', rest = 'ne##two words##three'
Please note the constraints:
the separator is a multi-character string
only regex substitutions are allowed (1)
I could "chain" regex substitutions if necessary
The expected end result is two strings: the first element, and the initial string with the first element cut off (2)
the list may have from 0 to n items, each being any string not containing the separator.
(1) I work with this rather large Perl system where at some point lists of items are processed using provided operations. One of them is a regex substitution. None of the others one are applicable. Solving the problem using full Perl code is easy, but that would mean modifying the system, which is not an option as this time.
(2) the context is the Unimarc bibliographic format, where authors of a publication are to be split into the standard Unimarc fields 700$a for the first author, and 701$a for any remaining authors.
I assume point (1) means you cannot use the split builtin? It would be easy using splits optional third parameter which lets you specify the maximum number of items.
my( $first, $rest ) = split( '##', $sampleText, 2 );
But if it has to be regex replace then your is almost right, but using .+? wont work when there's no sperators (because it will just take the first character You can fix this by anchoring the end. Instead something like:
my $sampleText = 'one##two words##three';
my $first = $sampleText;
my $rest = $sampleText;
$first =~ s/(.+?)(|##(.*))$/$1/g;
$rest =~ s/(.+?)(|##(.*))$/$3/g;
print "sampleText = '$sampleText', first = '$first', rest = '$rest'\n";
Whatever is the matter with :
my ( $first, $rest ) = split /##/, $sampleText, 2;
?
try
my ($first, $rest) = /(.+?)\#\#(.*)/;
// (or, m//) is matching; you don't need to use s/// for substitution. It returns the matches (here, to $first, $rest), or you can capture them later using $1, $2, &c.
You have reversed the quantifiers ? and + in the second regex, it should be:
$rest =~ s/(.+?)(##)?(.*)/$3/g;
___^^
or more concise:
$rest =~ s/.+?##(.*)/$1/;
I'd must match; not substitute:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
m{([^#]*?)##(.*)} and print "[$1][$2]\n";
}
__DATA__
''
'one'
'one##two'
'one##two##three'
'one##two words##three'

Using Regex/Grep to grab lines from an array using an array of patterns

After searching everywhere on the web, and being a noob to perl, for a solution to this I have decided to post on Stack.
I am looking to do is loop through array1 containing required matches (they will be different each time and could contain lots of patterns (well strings that need to be matched) but using this example so I can understand the problem). Then testing each element against a grep which is using array2 that contains some strings. Then printing out the lines that grep found to match the patterns used.
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw( strftime );
my (#regressions,#current_test_summary_file,#regression_links);
#regressions = ("test","table");
#current_test_summary_file = ("this is the line for test \n","this is the line for table \n","this is the line for to\n");
foreach (#regressions)
{
print $_ . "\n";
#regression_links = grep(/$_/, #current_test_summary_file);
}
foreach(#regression_links)
{
print $_ . "\n";
}
So would like to pick up only the first two elements instead of all three which is happening now.
Hopefully I've explained my problem properly. I've tried quite a few things (using qq for example) but have only really used grep to try this (unsure how I could do this approach using something different). If someone can point me in the right direction (and whether I should be using grep at all to solve this problem for that matter) I would be very grateful. Just tried this code below instead of just get the second element any ideas anyone ? (sorry can't reply to ur comment some reason but so u know axeman second idea worked).
foreach my $regression (#regressions)
{
print $regression . "\n";
#regression_links = grep(/$regression/, #current_test_summary_file);
}
Inside of grep, $_ refers to the list element involved in the test. Also /abc/ is short for $_ =~ /abc/ so you're effectively testing $_ =~ /$_/ guess what the answer is likely to be (with no metacharacters)?
So you're passing all values into #regression_links.
What you need to do is save the value of $_. But since you're not using the simple print statement, you could just as easily reserve the $_ variable for the grep, like so:
foreach my $reg ( #regressions ) {
print "$reg\n";
#regression_links = grep(/$reg/, #current_test_summary_file );
}
However, you're resetting #regression_links with each loop, and a push would work better:
push #regression_links, grep(/$reg/, #current_test_summary_file );
However, a for loop is a bad choice for this anyway, because you could get duplicates and you might not want them. Since you're matching by regex, one alternative with multiple criteria is to build a regex alternation. But in order to get a proper alternation, we need to sort it by length of string descending and then by alphabetic order (cmp).
# create the alternation expression
my $filter
= join( '|'
, sort { length( $b ) <=> length( $a )
|| $a cmp $b
}
#regressions
);
#regression_links = grep( /$filter/, #current_test_summary_file );
Or other than concatenating a regex, if you wanted to test them separately, the better way would be with something like List::MoreUtils::any:
#regression_links
= grep {
my $c = $_; # save $_
return any { /$c/ } #regressions;
} #current_test_summary_file
;
Axeman is correct and localising $_ with $reg will solve your problem. But as for pulling out matches I would naively push all matches onto #regression_links producing a list of (probably) multiple matches. You can then use List::MoreUtils::uniq to trim down the list. If you don't have List::MoreUtils installed you can just copy the function (its 2 lines of code).
# Axeman's changes
foreach my $reg (#regressions) {
print "regression: $reg\n";
# Push all matches.
push #regression_links, grep(/$reg/, #current_test_summary_file);
}
# Trim down the list once matching is done.
use List::MoreUtils qw/uniq/;
foreach ( uniq(#regression_links) ) {
print "$_\n";
}