Output Substring to Newline from a Raw Text String using Regex - regex

I have a name delimiter that I want to use to extract the whole line where it is found.
[string]$testString = $null
# broken text string of text & newlines which simulates $testString = Get-Content -Raw
$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"
# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"
# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined
$testString -match "(?s)(?<BOURKE>.*)"
#test3
# I want just the line with BOURKE
$result = $matches['BOURKE']
$result
#Test1 finds the match but only prints to the newline. #Test2 finds the match and includes all newlines. I would like to know what is the regex pattern that forces the output to begin 001 BOURKE ...
Any suggestions would be appreciated.

Note:
I'm assuming you're looking for the whole line on which BOURKE appears as a substring.
In your own attempts, (?<BOURKE>...) simply gives the regex capture group a self-chosen name (BOURKE), which is unrelated to what the capture group's subexpression (...) actually matches.
For the use case at hand, there's no strict need to use a (named) capture group at all, so the solutions below make do without one, which, when the -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable, as shown below.
If your multiline input string contains only Unix-format LF newlines (\n), use the following:
if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }
Note:
To match case-sensitively, use -cmatch instead of -match.
If you know that the substring is preceded / followed by at least one char., use .+ instead of .*
If you want to search for the substring verbatim and it happens to or may contain regex metacharacters (e.g. . ), apply [regex]::Escape() to it; e.g, [regex]::Escape('file.txt') yields file\.txt (\-escaped metacharacters).
If necessary, add additional constraints for disambiguation, such as requiring that the substring start or end only at word boundaries (\b)
If there's a chance that Windows-format CLRF newlines (\r\n) are present , use:
if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }
For an explanation of the regexes and the ability to experiment with them, see this regex101.com page for .*BOURKE.*, and this one for .*BOURKE[^\r\n]*
In short:
By default, . matches any character except \n, which obviates the need for line-specific anchors (^ and $) altogether, but with CRLF newlines requires excluding \r so as not to capture it as part of the match.[1]
Two asides:
PowerShell's -match operator only ever looks for one match; if you need to find all matches, you currently need to use the underlying [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'GitHub issue #7867 suggests bringing this functionality directly to PowerShell in the form of a -matchall operator.
If you want to anchor the substring to find, i.e. if you want to stipulate that it either occur at the start or at the end of a line, you need to switch to multi-line mode ((?m)), which makes ^ and $ match on each line; e.g., to only match if BOURKE occurs at the very start of a line:
if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }
If line-by-line processing is an option:
Line-by-line processing has the advantage that you needn't worry about differences in newline formats (assuming the utility handling the splitting into lines can handle both newline formats, which is true of PowerShell in general).
If you're reading the input text from a file, the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches, additionally offers streaming processing, i.e. it reads lines one by one, without the need to read the whole file into memory.
(Select-String -LiteralPath file.txt -Pattern BOURKE).Line
Add -CaseSensitive for case-sensitive matching.
The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines, recognizing either newline format):
(
#'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'# -split '\r?\n' |
Select-String -Pattern BOURKE
).Line
Output:
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
[1] Strictly speaking, the [^\r\n]* would also stop matching at a \r character in isolation (i.e., even if not directly followed by \n). If ruling out that case is important (which seems unlikely), use a (simplified version of) the regex suggested by Mathias R. Jessen in a comment on the question: .*BOURKE.*?(?=\r?\n)

I find it best to have a match consume up to what is not needed; the \r\n. That can be done with the set nomenclature with the ^ in the set such as [^\r\n]+ which says consume up to either a \r or a \n. Hence everything that is not a \r\n.
To do that use
$testString -match "(?<Bourke>\d\d\d\s[^\r\n]+)"
Also one should try to avoid the * when one knows there will be matchable txt...the * is a greedy type that consumes everything. Usage of the +, one or more, limits the match considerably because the parser doesn't have to try patterns (The zero of the *s zero or more), backtracking as its called which are patently not plausible.

Related

Powershell regex missing ones with CR etc

I'm working on a regular expression to extract a map of key and associated string.
For some reason, it's working for lines that don't show a line split, but misses where there are line splits.
This is what I'm using:
$errorMap = [ordered]#{}
# process the lines one-by-one
switch -Regex ($fileContent -split ';') {
'InsertCodeInfo\(([\w]*), "(.*)"' { # key etc., followed by string like "Media size cassette missing"
$key,$value = ($matches[1,2])|ForEach-Object Trim
$errorMap[$key] = $value
}
}
This is an example of $fileContent:
InsertCodeInfo(pjlWarnCommunications,
"communications error");
InsertCodeInfo(pjlNormalOnline,
"Online");
InsertCodeInfo(pjlWarnOffline,
"offline");
InsertCodeInfo(pjlNormalAccessing, "Accessing"); #this is first match :(
InsertCodeInfo(pjlNormalArrive, "Normal arrive");
InsertCodeInfo(pljNormalProcessing, "Processing");
InsertCodeInfo(pjlNormalDataInBuffer, "Data in buffer");
It's returning the pairs from pjlNormalAccessing down, where it doesn't have a line split. I thought that using the semicolon to split the regex content would fix it, but it didn't help. I was formerly splitting regex content with
'\r?\n'
I thought maybe there was something going on with VSCode so I have exited and re-opened it, and re-running the script had the same result. Any idea how to get it to match every InsertCodeInfo through the semicolon line with the key-value pair?
This is using VSCode and Powershell 5.1.
Update:
Someone asked how $fileContent is created:
I call my method with the filenamepath ($FileHandler), and from/to strings/methodNames ($matchFound2 becomes $fileContent later as a method parameter):
$matchFound2 = Get-MethodContents -codePath $FileHandler -methodNameToReturn "OkStatusHandler::PopulateCodeInfo" -followingMethodName "OkStatusHandler::InsertCodeInfo"
Function Get-MethodContents{
[cmdletbinding()]
Param ( [string]$codePath, [string]$methodNameToReturn, [string]$followingMethodName)
Process
{
$contents = ""
Write-Host "In GetMethodContents method File:$codePath method:$methodNameToReturn followingMethod:$followingMethodName" -ForegroundColor Green
$contents = Get-Content $codePath -Raw #raw gives content as single string instead of a list of strings
$null = $contents -match "($methodNameToReturn[\s\S]*)$followingMethodName" #| Out-Null
return $Matches.Item(1)
}#End of Process
}#End of Function
You can use
InsertCodeInfo\((\w+),\s*"([^"]*)
See the online regex demo.
Details:
InsertCodeInfo\( - a literal InsertCodeInfo( text
(\w+) - Group 1: one or more word chars (letters, digits, diacritics or underscores (connector punctuation)
, - a comma
\s* - zero or more whitespaces
" - a " char
([^"]*) - Group 2: zero or more chars other than a " char.
See the regex graph:
This regular expression seems to be catching all lines, including ones with newline in the middle. Thanks for the suggestion #WiktorStribizew. I tweaked your suggestion, and it helped.
InsertCodeInfo\(([\w]*),[\s]*"([^"]*)
It might be the most succinct, but it's catching all lines. Feel free as always to post alternative suggestions. This is why I didn't accept my own answer.

Perl In place edit: Find and replace in X12850 formatted file

I am new to Perl and cannot figure this out. I have a file called Test:
ISA^00^ ^00^ ^01^SupplyScan ^01^NOVA ^180815^0719^U^00204^000000255^0^P^^
GS^PO^SupplyScan^NOVA^20180815^0719^00000255^X^002004
ST^850^00000255
BEG^00^SA^0000000059^^20180815
DTM^097^20180815^0719
N1^BY^^92^
N1^SE^^92^1
N1^ST^^92^
PO1^1^4^BX^40.000^^^^^^^^IN^131470^^^1^
PID^F^^^^CATH 6FR .070 MPA 1 100CM
REF^
PO1^2^4^BX^40.000^^^^^^^^IN^131295^^^1^
PID^F^^^^CATHETER 6FR XB 3.5
REF^
PO1^3^2^EA^48.000^^^^^^^^IN^132288^^^1^
PID^F^^^^CATH 6FR AL-1 SH
REF^
PO1^4^2^BX^48.000^^^^^^^^IN^131297^^^1^
PID^F^^^^CATHETER 6FR .070 JL4SH 100CM
REF^
CTT^4^12
SE^20^00000255
GE^1^00000255
IEA^1^00000255
What I am trying to do is an in place edit, dropping any value in the N1^SE segment after the 92^. I tried this but I cant seem to make it work:
perl -i -pe 's/^N1\^SE\^\^92\^\d+$/N1^SE^^92^/g' Test
The final result should include the N1^SE segment looking like this:
N1^SE^^92^
It worked when I just had the one line in the file: N1^SE^^92^1. But when I try to globally substitute in the entire file, it doesn't work
Thanks.
You may have missed to copy here some hidden character(s) or spaces. Those may well be at the end of the line so try
perl -i -pe 's/^N1\^SE\^\^92\^\K.*//' Test
The \K is a special form of the "positive lookbehind" which drops all previous matches so only .* after it (the rest) are removed by the substitution. †
This takes seriously the requirement "dropping any value ... after", as it matches lines with things other than the sole \d from the question's example.
Or use \Q...\E sequence to escape special characters (see quotemeta)
perl -i -pe 's/^\QN1^SE^^92^\E\K.*//' Test
per Borodin's comment.
Another take is to specifically match \d as in the question
s/^N1\^SE\^\^92\^\K\d+//
per ikegami's comment. This stays true to your patterns and it also doesn't remove whatever may be hiding at the end of the line.
† The term "lookbehind" for \K is from documentation but, while \K clearly "looks behind," it has marked differences from how the normal lookbehind assertions behave.
Here is a striking example from ikegami. Compare
perl -le'print for "abcde" =~ /(?<=\w)\w/g' # prints lines: b c d e
and
perl -le'print for "abcde" =~ /\w\K\w/g' # prints lines: b d

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.