regex challenge - regex

I've got a little challenge that's bodering me for past 2 days.
I've have to chech if "From:" and "X-Sender:" have the same value using RegEx
Problem:
From: some text
<someone#mail.com>
X-Sender: notthatmail.com
How colud RegEx perform check if those two mails are matching?
This is actually a mail where I have to look form mail consistency of Mime headers.

You can use this:
From: .+?<(.+?)>.+?X-Sender: \1\b
If it matches, the two emails are the same.
Note that this requires the single line option to be on. If your regex flavour does not have a single line option, you can replace all the . with [\s\S] to achieve the same effect.
How this works:
It first finds the the email address in the <> brackets, captures it into group 1. And the it continues to look for the word X-Sender:. And then it asserts that there must be whatever is in group 1 (\1) after the word X-Sender:.
Demo

Related

Regular Expression to Capture First Two Lines That Don't Include String

I am struggling to find a method to extract the first two lines of an address using a regular expression, where it doesn't include the word "Account".
If we take this address:
Company Name Some Road Some Town
I can use the regular expression (?:.*\s*){2} to return
Company Name Some Road
Which is great.
However, if there is an extra line at the top, making the address become:
Accounts Payable Company Name Some Road Some Town
Then it no longer picks up those two lines that I want.
I have tried the method here: Regular expression to match a line that doesn't contain a word? without success, and have also tried combinations of using things like (?!Account.*)(?:.*\s*){3}, but am having little success.
The Microsoft website https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference has masses of characters etc to use, but I haven't managed to get a combination working yet.
The closest I've got was using [^Account.*](?:.*\s*){3} which returns
s Payable
Company Name
Some Road
I just can't get it to remove the rest of that line! Any help would be appreciated. Thanks.
You may use a ^ with multiline mode on:
(?m)^(?!Accounts)(?:.*\n?){2}
Or (a bit more efficient and following best practices):
(?m)^(?!Accounts).*(?:\n.*)?
See the regex demo and this regex demo.
When (?m) is added to the pattern, ^ matches start of a line, and the whole pattern matches
^ - start of a line
(?!Accounts) - with no Accounts as the first word
(?:.*\n?){2} - two occurrences of any 0+ chars other than line break chars followed with an optional newline
.*(?:\n.*)? - matches a line and an optional subsequent line.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.

Multiline C# Regex to match after a blank line

I'm looking for a multiline regex that will match occurrences after a blank line. For example, given a sample email below, I'd like to match "From: Alex". ^From:\s*(.*)$ works to match any From line, but I want it to be restricted to lines in the body (anything after the first blank line).
Received: from a server
Date: today
To: Ted
From: James
Subject: [fwd: hi]
fyi
----- Forwarded Message -----
To: James
From: Alex
Subject: hi
Party!
I'm not sure of the syntax of C# regular expressions but you should have a way to anchor to the beginning of the string (not the beginning of the line such as ^). I'll call that "\A" in my example:
\A.*?\r?\n\r?\n.*?^From:\s*([^\r\n]+)$
Make sure you turn the multiline matching option on, however that works, to make "." match \n
Writing complicated regular expressions for such jobs is a bad idea IMO. It's better to combine several simple queries. For example, first search for "\r\n\r\n" to find the start of the body, then run the simple regex over the body.
This is using a look-behind assertion. Group 1 will give you the "From" line, and group 2 will give you the actual value ("Alex", in your example).
(?<=\n\n).*(From:\s*(.*?))$
\s{2,}.+?(.+?From:\s(?<Sender>.+?)\s)+?
The \s{2,} matches at least two whitespace characters, meaning your first From: James won't hit. Then it's just a matter of looking for the next "From:" and start capturing from there.
Use this with RegexOptions.SingleLine and RegexOptions.ExplicitCapture, this means the outer group won't hit.