Multiline C# Regex to match after a blank line - regex

I'm looking for a multiline regex that will match occurrences after a blank line. For example, given a sample email below, I'd like to match "From: Alex". ^From:\s*(.*)$ works to match any From line, but I want it to be restricted to lines in the body (anything after the first blank line).
Received: from a server
Date: today
To: Ted
From: James
Subject: [fwd: hi]
fyi
----- Forwarded Message -----
To: James
From: Alex
Subject: hi
Party!

I'm not sure of the syntax of C# regular expressions but you should have a way to anchor to the beginning of the string (not the beginning of the line such as ^). I'll call that "\A" in my example:
\A.*?\r?\n\r?\n.*?^From:\s*([^\r\n]+)$
Make sure you turn the multiline matching option on, however that works, to make "." match \n

Writing complicated regular expressions for such jobs is a bad idea IMO. It's better to combine several simple queries. For example, first search for "\r\n\r\n" to find the start of the body, then run the simple regex over the body.

This is using a look-behind assertion. Group 1 will give you the "From" line, and group 2 will give you the actual value ("Alex", in your example).
(?<=\n\n).*(From:\s*(.*?))$

\s{2,}.+?(.+?From:\s(?<Sender>.+?)\s)+?
The \s{2,} matches at least two whitespace characters, meaning your first From: James won't hit. Then it's just a matter of looking for the next "From:" and start capturing from there.
Use this with RegexOptions.SingleLine and RegexOptions.ExplicitCapture, this means the outer group won't hit.

Related

Regular Expression to Capture First Two Lines That Don't Include String

I am struggling to find a method to extract the first two lines of an address using a regular expression, where it doesn't include the word "Account".
If we take this address:
Company Name Some Road Some Town
I can use the regular expression (?:.*\s*){2} to return
Company Name Some Road
Which is great.
However, if there is an extra line at the top, making the address become:
Accounts Payable Company Name Some Road Some Town
Then it no longer picks up those two lines that I want.
I have tried the method here: Regular expression to match a line that doesn't contain a word? without success, and have also tried combinations of using things like (?!Account.*)(?:.*\s*){3}, but am having little success.
The Microsoft website https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference has masses of characters etc to use, but I haven't managed to get a combination working yet.
The closest I've got was using [^Account.*](?:.*\s*){3} which returns
s Payable
Company Name
Some Road
I just can't get it to remove the rest of that line! Any help would be appreciated. Thanks.
You may use a ^ with multiline mode on:
(?m)^(?!Accounts)(?:.*\n?){2}
Or (a bit more efficient and following best practices):
(?m)^(?!Accounts).*(?:\n.*)?
See the regex demo and this regex demo.
When (?m) is added to the pattern, ^ matches start of a line, and the whole pattern matches
^ - start of a line
(?!Accounts) - with no Accounts as the first word
(?:.*\n?){2} - two occurrences of any 0+ chars other than line break chars followed with an optional newline
.*(?:\n.*)? - matches a line and an optional subsequent line.

regex challenge

I've got a little challenge that's bodering me for past 2 days.
I've have to chech if "From:" and "X-Sender:" have the same value using RegEx
Problem:
From: some text
<someone#mail.com>
X-Sender: notthatmail.com
How colud RegEx perform check if those two mails are matching?
This is actually a mail where I have to look form mail consistency of Mime headers.
You can use this:
From: .+?<(.+?)>.+?X-Sender: \1\b
If it matches, the two emails are the same.
Note that this requires the single line option to be on. If your regex flavour does not have a single line option, you can replace all the . with [\s\S] to achieve the same effect.
How this works:
It first finds the the email address in the <> brackets, captures it into group 1. And the it continues to look for the word X-Sender:. And then it asserts that there must be whatever is in group 1 (\1) after the word X-Sender:.
Demo

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

How can I match the last two words in a sentence in PostgreSQL?

Have been trying for a while, to match the last word of a sentence:
select regexp_matches('My name is Harry Potter', '[^ ]+$');
returned {Potter}
to try to match the last two words:
select regexp_matches('My name is Harry Potter', '[^ ]\s+[^ ]+$');
failed.
select regexp_matches('My name is Harry Potter', '(.*?)\s+(.*?)$');
Did not word as intended either.
Any insights?
Instead of using REGEXP_MATCHES which returns an array of matches, you may be better off using SUBSTRING which will give you the match as TEXT directly.
Using the correct pattern, as #Abelisto shared, you can do this:
SELECT SUBSTRING('My name is Harry Potter' FROM '\w+\W+\w+$')
This returns Harry Potter as opposed to {"Harry Potter"}
Per #Hambone's comment, if either of the words at the end contain punctuation, like an apostrophe, you would want to consider using the following pattern:
SELECT SUBSTRING('My name is Danny O''neal' FROM '\S+\s+\S+$')
The above would correctly return Danny O'neal as opposed to just O'neal
You should use double escaping in the pattern since it seems the standard_conforming_strings parameter of your PostgreSQL instance is turned off. See PostgreSQL 9.5.3 Documentation:
standard_conforming_strings (boolean)
This controls whether ordinary string literals ('...') treat backslashes literally, as specified in the SQL standard. Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).
Thus, you need to use
'[^ ]+\\s+[^ ]+$'
^^
or
'\\S+\\s+\\S+$'
Here,
[^ ]+ - 1 or more characters other than a space (any non-whitespace if \\S is used)
\\s+ - 1 or more whitespaces
[^ ]+ - 1 or more characters other than a space (any non-whitespace if \\S is used)
$ - end of string anchor.
Don't know how the regex works for postgres, but
online regex testers tell me that .*\s(.+)\s+(.*?)$ might do the trick.
I'm not 100% clear on what you're trying to do, but this regex matches the last two words of a sentence, and it's similar to your initial regex: "[^ ]+\s+[^ ]+$" (I just added a '+'.)
For further testing, I suggest going to https://regex101.com/ It's one of the best online regex helpers I've found, and it even breaks down the regex for you. (I'm not involved with the site in any way - it's a recommendation, not a plug)

Regex to match "Warm Regards"-type email signatures

I am an absolute regex noob and have been banging my head against the wall trying to write a regex to remove email signatures from a string that look like this:
Hi There, this is an email.
Warm Regards,
Joe Bloggs
Thus far, I’ve tried variations on:
/^[\w |][R|r]egards,/
The regex should:
look at the beginning of the line (what I was aiming for with the ^,
cover variations like “Warm Regards”, “Kind Regards”, “Best Regards”, and plain old “Regards” (which I was hoping to accomplish with the [\w |] to match any word or blank and the [R|r] to cover Regards/regards),
be OK with mixed case like “warm regards” or “Warm Regards”, and
only pickup lines that are [word] Regards or just regards, so that we don’t grab email body that has the word “regards” somewhere in it.
This seems elementary, but I just can’t nail it, and I seem to err on broadening my regex too much such that any line that contains “regards” gets picked up. I’m doing this in Node.js combined with the string.search function if that matters.
This seems to fit all your requirements:
^(\w*\s)?[r|R]egards,?
Has to start on a new line, then can have any word followed by a space, and the word regards, or just the word regards, with the comma also being optional.
If you want to wipe out everything after the regards line as well you can add in \s*.*
^(\w*\s)?[r|R]egards,?\s*.*
If you are trying to remove everything from the Warm Regards line on, this should do it
^[^<]*?(?=(.*)[R|r]egards)
Try the following regular expression
^\w* ?regards,?
with the case insensitive & global flag specified.
You can see the regular expression explanation and what it matches here: http://regex101.com/r/vR3zG5
The regular expression that matches signatures defined in #1-#4 is following:
/^(\w+ +)?regards,? *$/im
How it works:
"^" in the beginning means new line
"(\w+ +)?" means optional segment that contains exactly one word followed by at least one space
"regards" is just a simple match
",?" optional comma at the end
" *" - the line may contain trailing spaces (it may be useful to put the same match after ^)
"$" - end of line
/.../i - means that the expression is case-insensitive
/.../m - means that ^ and $ match at line breaks