Parse SWIFT(Financial) message string with REGEX in Powershell - regex

I am working on a Powershell script to parse SWIFT messages (text based) into a database. I am using REGEX to find the appropriate strings in the file and extract them. I now run into the issue that one of the data fields can have CR/LF characters in the string - in the example below I would need to extract the second line as well.
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
I tested this regex pattern (:61:.*[\r\n].*) in RegExr and it recognizes the [\r\n] characters as requirement to be valid, so my plan was to have two expressions - one with and one without CR/LF characters to identify both messages - either with line break or without - however the code below will return all matches no matter whether a line break in included or not - it seems that PS stops evaluation strings after CR/LF.
$transaction = $swift | select-string ‘:61:.*[\r\n].*’ -AllMatches | % { $_.Matches } | % { $_.Value }
Can I use REGEX for this task or do I have to create a function to read the entire string and check for the next line tag to determine the end of this string?

Describe the first line more accurately, then whatever is left is necessarily the message:
$swift = #'
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
'#
$swift |Select-String -Pattern '(?m):\d+:[^,]+,[^/]+//\d+MT\d+[\s\r\n]+.*$'
The regex pattern breaks down as follows:
(?m) # Multi-line mode, this will make `$` match end-of-line positions as well as end-of-string
:\d+: # 1 or more digits, surrounded by colons, matches `:61:`
[^,]+, # 1 or more non-commas followed by a comma, matches `2111261126D12000,`
[^/]+// # 1 or more non-slashes, followed by 2, matches `00NTRF11000004217657P//`
\d+MT\d+ # 1 or more digits followed by `MT` and more digits, matches `03MT211124101166`
[\s\r\n]+ # 1 or more white-space/CR/LF characters
.*$ # everything until the end of the current line, matches `JANE DOE 1232`
Since we're using [\s\r\n]+ to describe the potential line break, it'll still work when the linebreak is replaced with other whitespace characters.

Related

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash
As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line
Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Extract part of text in PowerShell

This is my input file which is random, can be any number not just 9999 and any letters:
The below format will always come after a - (dash).
-
9999 99AKDSLY9ZWSRK99999
9999 99BGRPOE99FTRQ99999
Expected output:
AKDSLY9ZSRK
BGRPOE99TRQ
So I need to remove the first part of each line, always numbers:
9999 99
9999 99
Then remove the not-required characters:
99AKDSLY9ZW → in this case is the W but could be any letter
99BGRPOE99F → in this case is the F but could be any letter
And finally remove the last 5 digits, always numbers:
99999
99999
What I´m trying to use, regex (first time using it):
$result = [regex]::Matches($InputFile, '(^\d{4}\s\d{2}[A-Z0-9]\d{5}$)') -replace '\d{4}\s\d{2}', '')
$result
It's not giving me an error message but it's not showing me the characters I'm expecting to see at $result.
I was expecting to see something in $result to then start the formatting, deleting the characters I don't need.
What could be missing here, please?
Try something like this:
$str = (Get-Content ... -Raw) -replace '\r'
$cb = {
$args[0].Groups[1].Value -replace '(?m)^.{7}' -replace '(?m).(.{3}).{5}$', '$1'
}
$re = [regex]'(?m)^(?<=-\n)((?:\d{4}\s\d{2}[^\n]*\d{5}(?:\n|$))+)'
$re.Replace($str, $cb)
The regular expression $re matches multiline substrings that start with a hyphen and a newline, followed by one or more line with your digit/letter combinations. The (?<=...) is a positive lookbehind assertion to ensure that you only get a match when the lines with the digit/letter combinations are preceded by a line with a hyphen (without making that line part of the actual match).
The scriptblock $cb is an anonymous callback function that the Regex.Replace() method calls on each match. For each line in a match it removes the first 7 characters from the beginning of the line, and replaces the last 9 characters from the end of the line with the 2nd through 4th of those characters.
For simplicity reasons the sample code removes carriage return characters (CR, \r) from the string, so that all newlines are linefeed characters (LF, \n) instead of the default CR-LF.

Vimgrep before any empty line

I have a lot of files which starts with some tags I defined.
Example:
=Title
#context
!todo
#topic
#subject
#etc
And some text (notice the blank line just before this text).
Foo
Bar
I'd like to write a Vim search command (with vimgrep) to match something before an empty line.
How do I grep only in the lines before the first blank line? Will it make quicker grep action? Please, no need to mention :grep and binary like Ag - silver search.
I know \_.* to match everything including EOL. I know the negation [^foo]. I succeed to match everything but empty lines with /[^^$]. But I didn't manage to compose my :vimgrep command. Thank you for your help!
If you want a general solution which works for any content of file let me tell you that AFAK, you can't with that form of text. You may ask why ?
Explanation:
vimgrep requires a pattern argument to do the search line by line which behaves exactly as the :global cmd.
For your case we need to get the first part preceding the first blank line. (It can be extended to: Get the first non blank text)
Let's call:
A :Every block of text not containing any single blank line inside
x :Blank lines
With these only 5 forms of content file you can get the first A block with vimgrep(case 2,4,5 are obvious):
1 | 2 | 3 | 4 | 5
x | x | A | x | A
A | A | x | A | x
x | x | A
A |
Looking to your file, it is having this form:
A
x
A
x
A
the middle block causes a problem that's why you cannot split the first A unless you delimit it by some known UNIQUE text.
So the only solution that I can come up for the only 5 cases is:
:vimgrep /\_.\{-}\(\(\n\s*\n\)\+\)\#=/ %
AFAIK the most you can do with :vimgrep is use the \%<XXl atom to search below a specific line number:
:vim /\%<20lfunction/ *.vim
That command will find all instances of function above line 20 in the given files.
See :help \%l.
[...] always matches a single character. [^^$] matches a character that is not ^ or $. This is not what you want.
One of the things you can do is:
/\%^\%(.\+\n\)\{-}.\{-}\zsfoo/
This matches
\%^ - the beginning of the file
\%( \) - a non-capturing group
\{-} - ... repeated 0 or more times (as few as possible)
.\+ - 1 or more non-newline characters
\n - a newline
.\{-} - 0 or more non-newline characters (as few as possible)
\zs - the official start of the match
This will find the first occurrence of foo, starting from the beginning of the file, searching only non-empty lines. But that's all it does: You can't use it to find multiple matches.
Alternatively:
/\%(^\n\_.*\)\#<!foo/
\%( \) - a non-capturing group
\#<! - not-preceded-by modifier
^ - beginning of line
\n - newline
\_.* - 0 or more of any character
This matches every occurrence of foo that is not preceded anywhere by an empty line (i.e. a beginning-of-line / newline combo).

Perl multiline regex for first 3 individual items

I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.
For the below single line format I can regex as
/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/
to get the first 3 individual items in line
Hi There FirstName.LastName 10 3/23/2011 2:46 PM
Below is the multi-line format I see. I am trying to use something like
/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m
to get individual items but don’t seem to work.
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Any suggestions? Is multi-line regex possible?
NOTE: In the same output i can see either Single line or Multi line or both so output can be like below
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
Hello Line2
Line2FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM
You can for sure apply regex over multiple lines.
I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]).
The chat is viewed as a repetead \w+\W+ block.
If you provide more specific input / output case i can refine the example code:
#!/usr/bin/env perl
my $input = <<'__END__';
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
__END__
my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;
$chat =~ s/\s+$//; #remove trailing spaces
print "chat -> ${chat}\n";
print "username -> ${username}\n";
print "chars -> ${chars}\n";
print "timestamp -> ${timestamp}\n";
Legenda
m/^.../ match regex (not substitute type) starting from start of line
(?im): case insensitive search and multiline (^/$ match start/end of line also)
\s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
(\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
(\d+): (match group $chars) a number composed by one or more digits
([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
[0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
[0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above
You can test it online here
Your regex says:
^\s*(.*)\n*\n* # line starts with optional space followed by anything
| # or
\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$ # spaces followed by any words followed by spaces, digits, spaces, anything at the end of the line
Consider this:
/^From|To$/
Alternation sticks as close to the sequences.
Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line
Compare to this:
/^(From|To)$/
Above will find lines that only have 'From' or 'To'

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';