Perl multiline regex for first 3 individual items - regex

I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.
For the below single line format I can regex as
/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/
to get the first 3 individual items in line
Hi There FirstName.LastName 10 3/23/2011 2:46 PM
Below is the multi-line format I see. I am trying to use something like
/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m
to get individual items but don’t seem to work.
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Any suggestions? Is multi-line regex possible?
NOTE: In the same output i can see either Single line or Multi line or both so output can be like below
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
Hello Line2
Line2FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

You can for sure apply regex over multiple lines.
I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]).
The chat is viewed as a repetead \w+\W+ block.
If you provide more specific input / output case i can refine the example code:
#!/usr/bin/env perl
my $input = <<'__END__';
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
__END__
my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;
$chat =~ s/\s+$//; #remove trailing spaces
print "chat -> ${chat}\n";
print "username -> ${username}\n";
print "chars -> ${chars}\n";
print "timestamp -> ${timestamp}\n";
Legenda
m/^.../ match regex (not substitute type) starting from start of line
(?im): case insensitive search and multiline (^/$ match start/end of line also)
\s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
(\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
(\d+): (match group $chars) a number composed by one or more digits
([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
[0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
[0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above
You can test it online here

Your regex says:
^\s*(.*)\n*\n* # line starts with optional space followed by anything
| # or
\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$ # spaces followed by any words followed by spaces, digits, spaces, anything at the end of the line
Consider this:
/^From|To$/
Alternation sticks as close to the sequences.
Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line
Compare to this:
/^(From|To)$/
Above will find lines that only have 'From' or 'To'

Related

Autohotekey: How to extract text between two words with multiple occurrences in a large text document

Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000+ lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text
De l e te this text
StartWord
Apples Oranges
Pears Grapes
StopWord
Delete this text
Delete this text
StartWord
Peas Carrots
Peas Carrots
StopWord
Delete this text
Delete this text
Desired Output Text
Apples Oranges
Pears Grapes
Peas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)+
^ Start of string
StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
(?: Non capture group to repeat as a whole
\R Match a newline
(?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
)+ Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
This is only slightly different than #Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R+\K.*?\R(?=\R*^StopWord\R)
Demo
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R+ # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.

Parse SWIFT(Financial) message string with REGEX in Powershell

I am working on a Powershell script to parse SWIFT messages (text based) into a database. I am using REGEX to find the appropriate strings in the file and extract them. I now run into the issue that one of the data fields can have CR/LF characters in the string - in the example below I would need to extract the second line as well.
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
I tested this regex pattern (:61:.*[\r\n].*) in RegExr and it recognizes the [\r\n] characters as requirement to be valid, so my plan was to have two expressions - one with and one without CR/LF characters to identify both messages - either with line break or without - however the code below will return all matches no matter whether a line break in included or not - it seems that PS stops evaluation strings after CR/LF.
$transaction = $swift | select-string ‘:61:.*[\r\n].*’ -AllMatches | % { $_.Matches } | % { $_.Value }
Can I use REGEX for this task or do I have to create a function to read the entire string and check for the next line tag to determine the end of this string?
Describe the first line more accurately, then whatever is left is necessarily the message:
$swift = #'
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
'#
$swift |Select-String -Pattern '(?m):\d+:[^,]+,[^/]+//\d+MT\d+[\s\r\n]+.*$'
The regex pattern breaks down as follows:
(?m) # Multi-line mode, this will make `$` match end-of-line positions as well as end-of-string
:\d+: # 1 or more digits, surrounded by colons, matches `:61:`
[^,]+, # 1 or more non-commas followed by a comma, matches `2111261126D12000,`
[^/]+// # 1 or more non-slashes, followed by 2, matches `00NTRF11000004217657P//`
\d+MT\d+ # 1 or more digits followed by `MT` and more digits, matches `03MT211124101166`
[\s\r\n]+ # 1 or more white-space/CR/LF characters
.*$ # everything until the end of the current line, matches `JANE DOE 1232`
Since we're using [\s\r\n]+ to describe the potential line break, it'll still work when the linebreak is replaced with other whitespace characters.

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

Regex matching a text after a specific string until another specific string

If I have the following example:
X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>
How can I select the text
Here is our forecast
after "X-FileName .... \n" until "Message-ID" execluded?
I read about lookahead and behind and tried this but didn't work:
(?<=X-FileName:(\n)+$).+(?=Message-ID:)
This should do it:
(?:X-FileName:[^\n]+)\n+([^\n]+)\n+(?:Message-ID:) (group #1 is the match)
Demo
Explanation:
(?:X-FileName:[^\n]+) matches X-Filename: followed by any number of characters that aren't newlines, without capturing it (?:).
\n+ matches any number of consecutive newlines.
([^\n]+) matches and captures any number of consecutive characters that aren't newlines.
\n+, again, matches any number of consecutive newlines.
(?:Message-ID:) matches Message-ID: without capturing it (?:).
Edit: as #WiktorStribiżew mentioned though, splitting your text into lines may be an easier/cleaner way to retrieve what you want.
There are two approaches here, and they depend on the broader context. If your expected substring is the second paragraph, just split with \n\n (or \r\n\r\n) and get the second item from the resulting list.
If it is a text inside some larger text, use a regex.
See a Python demo:
import re
s='''X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>'''
# Non-regex way for the string in the exact same format
print(s.split('\n\n')[1])
# Regex way to get some substring in a known context
m = re.search(r'X-FileName:.*[\r\n]+(.+)', s)
if m:
print(m.group(1))
The regex means:
X-FileName: - a literal substring
.* - any 0+ chars other than line break chars
[\r\n]+ - 1 or more CR or LF chars
(.+) - Group 1: one or more chars other than line break chars, as many as possible.
See the regex demo.

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';