Extract part of text in PowerShell - regex

This is my input file which is random, can be any number not just 9999 and any letters:
The below format will always come after a - (dash).
-
9999 99AKDSLY9ZWSRK99999
9999 99BGRPOE99FTRQ99999
Expected output:
AKDSLY9ZSRK
BGRPOE99TRQ
So I need to remove the first part of each line, always numbers:
9999 99
9999 99
Then remove the not-required characters:
99AKDSLY9ZW → in this case is the W but could be any letter
99BGRPOE99F → in this case is the F but could be any letter
And finally remove the last 5 digits, always numbers:
99999
99999
What I´m trying to use, regex (first time using it):
$result = [regex]::Matches($InputFile, '(^\d{4}\s\d{2}[A-Z0-9]\d{5}$)') -replace '\d{4}\s\d{2}', '')
$result
It's not giving me an error message but it's not showing me the characters I'm expecting to see at $result.
I was expecting to see something in $result to then start the formatting, deleting the characters I don't need.
What could be missing here, please?

Try something like this:
$str = (Get-Content ... -Raw) -replace '\r'
$cb = {
$args[0].Groups[1].Value -replace '(?m)^.{7}' -replace '(?m).(.{3}).{5}$', '$1'
}
$re = [regex]'(?m)^(?<=-\n)((?:\d{4}\s\d{2}[^\n]*\d{5}(?:\n|$))+)'
$re.Replace($str, $cb)
The regular expression $re matches multiline substrings that start with a hyphen and a newline, followed by one or more line with your digit/letter combinations. The (?<=...) is a positive lookbehind assertion to ensure that you only get a match when the lines with the digit/letter combinations are preceded by a line with a hyphen (without making that line part of the actual match).
The scriptblock $cb is an anonymous callback function that the Regex.Replace() method calls on each match. For each line in a match it removes the first 7 characters from the beginning of the line, and replaces the last 9 characters from the end of the line with the 2nd through 4th of those characters.
For simplicity reasons the sample code removes carriage return characters (CR, \r) from the string, so that all newlines are linefeed characters (LF, \n) instead of the default CR-LF.

Related

Parse SWIFT(Financial) message string with REGEX in Powershell

I am working on a Powershell script to parse SWIFT messages (text based) into a database. I am using REGEX to find the appropriate strings in the file and extract them. I now run into the issue that one of the data fields can have CR/LF characters in the string - in the example below I would need to extract the second line as well.
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
I tested this regex pattern (:61:.*[\r\n].*) in RegExr and it recognizes the [\r\n] characters as requirement to be valid, so my plan was to have two expressions - one with and one without CR/LF characters to identify both messages - either with line break or without - however the code below will return all matches no matter whether a line break in included or not - it seems that PS stops evaluation strings after CR/LF.
$transaction = $swift | select-string ‘:61:.*[\r\n].*’ -AllMatches | % { $_.Matches } | % { $_.Value }
Can I use REGEX for this task or do I have to create a function to read the entire string and check for the next line tag to determine the end of this string?
Describe the first line more accurately, then whatever is left is necessarily the message:
$swift = #'
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
'#
$swift |Select-String -Pattern '(?m):\d+:[^,]+,[^/]+//\d+MT\d+[\s\r\n]+.*$'
The regex pattern breaks down as follows:
(?m) # Multi-line mode, this will make `$` match end-of-line positions as well as end-of-string
:\d+: # 1 or more digits, surrounded by colons, matches `:61:`
[^,]+, # 1 or more non-commas followed by a comma, matches `2111261126D12000,`
[^/]+// # 1 or more non-slashes, followed by 2, matches `00NTRF11000004217657P//`
\d+MT\d+ # 1 or more digits followed by `MT` and more digits, matches `03MT211124101166`
[\s\r\n]+ # 1 or more white-space/CR/LF characters
.*$ # everything until the end of the current line, matches `JANE DOE 1232`
Since we're using [\s\r\n]+ to describe the potential line break, it'll still work when the linebreak is replaced with other whitespace characters.

PowerShell Regular Expression match Y or Z

I am trying to match some strings using a regular expression in PowerShell but due to the differing format of the original string that I'm extracting from, encountering difficulty. I admittedly am not very strong with creating regular expressions.
I need to extract the numbers from each of these strings. These can vary in length but in both cases will be preceded by Foo
PC1-FOO1234567
PC2-FOO1234567/FOO98765
This works for the second example:
'PC2-FOO1234567/FOO98765' -match 'FOO(.*?)\/FOO(.*?)\z'
It lets me access the matched strings using $matches[1] and $matches[2] which is great.
It obviously doesn't work for the first example. I suspect I need some way to match on either / or the end of the string but I'm not sure how to do this and end up with my desired match.
Suggestions?
You may use
'FOO(.*?)(?:/FOO(.*))?$'
It will match FOO, then capture any 0 or more chars as few as possible into Group 1 and then will attempt to optionally match a sequence of patterns: /FOO, any 0 or more chars as many as possible captured into Group 2 and then the end of string position should follow.
See the regex demo
Details
FOO - literal substring
(.*?) - Group 1: any zero or more chars other than newline, as few as possible
(?:/FOO(.*))? - an optional non-capturing group matching 1 or 0 repetitions of:
/FOO - a literal substring
(.*) - Group 2: any 0+ chars other than newline as many as possible (* is greedy)
$ - end of string.
[edit - removed the unneeded pipe to Where-Object. thanks to mklement0 for that! [*grin*]]
this is a somewhat different approach. it splits on the foo, then replaces the unwanted / with nothing, and finally filters out any string that contains letters.
the pure regex solutions others offered will likely be faster, but this may be slightly easier to understand - and therefore to maintain. [grin]
# fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
PC1-FOO1234567
PC2-FOO1234567/FOO98765
'# -split [environment]::NewLine
$InStuff -split 'foo' -replace '/' -notmatch '[a-z]'
output ...
1234567
1234567
98765
To offer a more concise alternative with the -split operator, which obviates the need to access $Matches afterwards to extract the numbers:
PS> 'PC1-FOO1234568', 'PC2-FOO1234567/FOO98765' -split '(?:^PC\d+-|/)FOO' -ne ''
1234568 # single match from 1st input string
1234567 # first of 2 matches from 2nd input string
98765
Note: -split always returns a [string[]] array, even if only 1 string is returned; result strings from multiple input strings are combined into a single, flat array.
^PC\d+-|/ matches PC followed by 1 or more (+) digits (\d) at the start of the string (^) or (|) a / char., which matches both PC2-FOO at the beginning and /FOO.
(?:...), a non-capturing subexpression, must be used to prevent -split from including what the subexpression matched in the results array.
-ne '' filters out the empty elements that result from the input strings starting with a separator.
To learn more about the regex-based -split operator and in what ways it is more powerful than the string literal-based .NET String.Split() method, see this answer.

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Perl multiline regex for first 3 individual items

I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.
For the below single line format I can regex as
/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/
to get the first 3 individual items in line
Hi There FirstName.LastName 10 3/23/2011 2:46 PM
Below is the multi-line format I see. I am trying to use something like
/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m
to get individual items but don’t seem to work.
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Any suggestions? Is multi-line regex possible?
NOTE: In the same output i can see either Single line or Multi line or both so output can be like below
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
Hello Line2
Line2FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM
You can for sure apply regex over multiple lines.
I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]).
The chat is viewed as a repetead \w+\W+ block.
If you provide more specific input / output case i can refine the example code:
#!/usr/bin/env perl
my $input = <<'__END__';
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
__END__
my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;
$chat =~ s/\s+$//; #remove trailing spaces
print "chat -> ${chat}\n";
print "username -> ${username}\n";
print "chars -> ${chars}\n";
print "timestamp -> ${timestamp}\n";
Legenda
m/^.../ match regex (not substitute type) starting from start of line
(?im): case insensitive search and multiline (^/$ match start/end of line also)
\s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
(\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
(\d+): (match group $chars) a number composed by one or more digits
([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
[0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
[0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above
You can test it online here
Your regex says:
^\s*(.*)\n*\n* # line starts with optional space followed by anything
| # or
\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$ # spaces followed by any words followed by spaces, digits, spaces, anything at the end of the line
Consider this:
/^From|To$/
Alternation sticks as close to the sequences.
Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line
Compare to this:
/^(From|To)$/
Above will find lines that only have 'From' or 'To'

What does this regular expression try to match?

These days I am learning regular expressions, but it seems like a little hard to me. I am reading some code in TCL, but what does it want to match?
regexp ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]" $input
If you un-escape the characters, you get the following:
.* ([\d]{3}:[\d]{3}:[\d]{3}.[\d]{5}).[^\n]
The term [\d]{x} would match x number of consecutive digits. Therefore, the portion inside the parentheses would match something of the form ###:###:###?##### (where # can be any digit and ? can be any character). The parentheses themselves aren't matched, they're just used for specifying what part of the input to "capture" and return to the caller. Following this sequence is a single dot ., which matches a single character (which can be anything). The trailing [^\n] will match a single character that is anything except a newline (a ^ at the start of a bracketed expression inverts the match). The .* term at the very beginning matches a sequence of characters of any length (even zero), followed by a space.
With all of this taken into account, it appears that this regular expression extracts a series of digits from the middle of a line. Given the format of the numbers, it may be looking for a timestamp in the hours:minutes:seconds.milliseconds format (although if that is the case, {1,3} and {1,5} should be used instead). The trailing .[^\n] term looks like it could be trying to exclude timestamps that are at or near the end of a line. Timestamped logs often have a timestamp followed by some sort of delimiting character (:, >, a space, etc). A regular expression like this might be used to extract timestamps from the log while ignoring "blank" lines that have a timestamp but no message.
Update:
Here's an example using TCL 8.4:
% set re ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]"
% regexp $re "TEST: 123:456:789:12345> sample log line"
1
% regexp $re " 111:222:333.44444 foo"
1
% regexp $re "111:222:333.44444 foo"
0
% regexp $re " 111:222:333.44444 "
0
% regexp $re " 10:44:56.12344: "
0
%
% regexp $re "TEST: 123:456:789:12345> sample log line" match data
1
% puts $match
TEST: 123:456:789:12345>
% puts $data
123:456:789:12345
The first two examples match the expression. The third fails because it lacks the space character before the first number sequence. The fourth fails because it doesn't have a non-newline character at the end after the trailing space. The fifth fails because the numerical sequences don't have enough digits. By passing parameters after the input, you can store the part of the input that matched the expression as well as the data that was "captured" by using parentheses. See the TCL wiki for details on the regexp command.
The interesting part with TCL is that you have to escape the [ character but not the ], while both the { and } need escaping.
.* ==> match junk part of the input
( ==> start capture
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}. ==> match 3 digits followed by any character
\[\\d]\{5\} ==> match 5 digits
). ==> close capture and match any character
\[^\\n] ==> match a character that is not a newline