How to use regex to include linebreaks in extracted results - regex

I am processing a text file of messages that resembles this (though a lot longer):
13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
Hello
13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
where someone added a line break
13/09/18, 4:10 pm - Fred Dag: Here is another message
The following regex works to extract the data into Date, Time, Name and Message except where the Message includes a line break:
(?<date>(?:[0-9]{1,2}\/){2}[0-9]{1,2}),\s(?<time>(?:[0-9]{1,2}:)[0-9]{2}\s[a|p]m)\s-\s(?<name>(?:.*)):\s(?<message>(?:.+))
Using preg_match_all, and the regex above, in php7.4 I have generated the following array:
Array
(
[0] => Array
(
[date] => 13/09/18
[time] => 4:14 pm
[name] => Fred Dag
[message] => Jackie, please could you send to me too? ‚ thank you
)
[1] => Array
(
[date] => 13/09/18
[time] => 4:45 pm
[name] => Jackie Johnson
[message] => Here is yet another message
)
[2] => Array
(
[date] => 13/09/18
[time] => 4:10 pm
[name] => Fred Dag
[message] => Here is another message
)
)
But the array is missing the lines caused by the line breaks which should be appended to the previous Message. I get the same result when playing in regex101.com.
I tried including the single line modifier for the message like
this (?<message>(?s:.+)) but that then selected everything from the start of the first message to the end of the file.
I tried playing with greedy vs non-greedy but I couldn't get that to work.
I tried using a reverse lookup, but I don't seem to have enough understanding
to get that to work and ended up just randomly pasting code off the internet which did nothing but get me frustrated.
I think I have exhausted my knowledge of regex and reached the end of Google with the terms I know to search with :) Could anyone point me in the right direction?

Your immediate problem seems to be that the dot you are using to match the message content does not match across newlines. That can easily be fixed by using the /s dot all flag in your PHP regex. But that aside, I think your regex would also need to change. I suggest the following pattern:
\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)
This pattern matches a line from the starting date, across newlines, until reaching either the start of the next message or the end of the input.
Sample script:
$input = "13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you\nHello\n13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message\nwhere someone added a line break\n13/09/18, 4:10 pm - Fred Dag: Here is another message";
preg_match_all("/\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)/s", $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => 13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
Hello
[1] => 13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
where someone added a line break
[2] => 13/09/18, 4:10 pm - Fred Dag: Here is another message
)

Related

Regex to find a specific word between two other specific words

when inspecting content of email body I want to detect when a distribution list name contains "DL" in the "To" field or the "CC" field but not in the subject.
Basically i want my text (DL) detected when found between the closest "To:" and the closest "Subject".
The best I can do is the following but it detects everything from the very first instance of "To:" with a subsequent DL until the very last instance of "Subject"
(?<=To: )(?s:.)*?( DL | DL-)(?s:.)*?(?=Subject:)
expected results: "DL-" from DL-Musketeers but not the "DL" in the subject line if the distribution list wasn't present
From: Mouse, Mickey <JMouse#Disney.​com<mailto:JMouse#​Disney.com>>
Sent: Thursday, May 26, 2022 8:14 AM
To: Mouse, Minnie <DMouse#Disney.c​om<mailto:DMouse#Disney.com>>
Cc: Disney, Joseph R <JDisney#Disney.co​m<mailto:JDisney#Disney.com>> DL-Musketeers#Disney.com
Subject: RE: DL commission
Thanks in advance.
I was able to find a solution with help from #Barmar.
What I'm using is:
(?<=To:)(.)*?( DL | DL-)(?s:.)*?(?=Subject:)|(?<=Cc:)(.)*?( DL | DL-)(?s:.)*?(?=Subject:)

Regular Expression for Splunk - extract between two phrases across multiple lines.

I am trying to extract log data in splunk and my current usecase is more complicated that what the "regex builder" will allow for. Consider the below example, I would like to extract all the text between two phrases. I can get small, one line samples to work between two words, but I've not been able to get this to work at all. The separate line breaks are not helping either.
Thanks for any help you can provide!
Phrase1: Stuff.Applications.Business.StuffApi.Common.Exceptions.ValidationException:
Phrase2:
at Stuff.Applications.Business.StuffApi.Web.Controllers.Stuff.Things
Example Data:
02/26/2018 02:17:08 PM
LogName=Stuff
SourceName=StuffApi
EventCode=400
EventType=2
Type=Error
ComputerName=Stuff.things.Words
TaskCategory=%1
OpCode=Info
RecordNumber=3129
Keywords=Classic
Message=2018-02-26 14:17:08,767 [63] ERROR Things [(null)] - Something Number: ; Something Number: 9999999999 ; Source Application: ABCD ; Error Type: Validation ; Response Status Code: 400
Stuff.Applications.Business.StuffApi.Common.Exceptions.ValidationException: Validation Errors: Error:ErrorInfo.Error cannot be greater than the current date: 2/26/2018 12:00:00 AM, Incoming Value:2/27/2018 12:00:00 AM;
at Stuff.Applications.Business.StuffApi.Web.Controllers.Stuff.Things(SomeRequest request) in f:\Builds\348\Policy Systems\V.12_Release.Applications.Business.Things\src\src\Web\Controllers\Stuff.cs:line 288
at lambda_method(Closure , Object , Object[] )
at System.Web.Http.Controllers.ReflectedHttpActionDescriptor.ActionExecutor.<>c__DisplayClass10.<GetExecutor>b__9(Object instance, Object[] methodParameters)
at System.Web.Http.Controllers.ReflectedHttpActionDescriptor.ExecuteAsync(HttpControllerContext controllerContext, IDictionary`2 arguments, CancellationToken cancellationToken)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
Try this regex
Stuff\.Applications\.Business\.StuffApi\.Common\.Exceptions\.ValidationException:(?<text>[\s\S]+)at Stuff\.Applications\.Business\.StuffApi\.Web\.Controllers\.Stuff\.Things

How to match a group of lines that match a pattern

I am trying to filter out a group of lines that match a pattern using a regexp but am having trouble getting the correct regexp to use.
The text file contains lines like this:
transaction 390134; promote; 2016/12/20 01:17:07 ; user: build
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
# some commit comment
/./som/file/path 11745/409 (22269/257)
# merged
version 22269/257 (22269/257)
ancestor: (22133/182)
transaction 390136; promote; 2016/12/20 01:17:08 ; user: najmi
to: DEVELOPMENT ; from: DEVELOPMENT_BUILD
/./some/other/file/path 11745/1 (22269/1)
version 22269/1 (22269/1)
ancestor: (none - initial version)
type: dir
I would like to filter out the lines that start with "transaction", contain "User: build all the way until the next line that starts with "transaction".
The idea is to end up with transaction lines where user is not "build".
Thanks for any help.
If you want only the transaction lines for all users except build:
grep '^transaction ' test_data| grep -v 'user: build$'
If you want the whole transaction record for such users:
awk '/^transaction /{ p = !/user: build$/};p' test_data
OR
perl -lne 'if(/^transaction /){$p = !/user: build$/}; print if $p' test_data
The -A and -v options of grep command would have done the trick if all transaction records had same number of lines.

RegEx for value 0.1 to 100.00

Looking at the xml file created by HitManPro I can see numerous entries like this one;
[Item type="Malware" malwareName="Trojan" score="0.0" status="None"]
This are the false positives.
I would like to replace the existing RegEX query that I use in a script (LabTech) with one that would look for anything like;
score="5.1" up to score="999.0"
I am new to Reg Ex queries, and I am having trouble building the search for digits inside the string score=" " .
Any help would be much appreciated. Below is a sample XML from hitmanPro
regards,
Oscar Romero
<br>
HitmanPro Scan Completed Successfully.
Threats Found!
<hr>
Scan Date: 2015-10-17T15:16:31<BR>
<p>"
[Log computer="computer name" windows="6.1.1.7601.X64/12" scan="Normal" version="3.7.9.246" date="2015-10-17T15:16:31" timeSpentInSecs="125" filesProcessed="15922"]
[Item type="Malware" malwareName="Malware" score="90.0" status="None"]
[Scanners]
[Scanner id="Bitdefender" name="Gen:Variant.Kazy.751212" /]
[/Scanners]
[File path="C:\Program Files (x86)\ESET\ESET Remote Administrator\Server\era.exe" hash="F7BB46D48B994539AFD400641CE8E4F85114FC7BA05A1BAA0D092F3A92817F13" /]
[Startup]
[Key path="HKLM\SYSTEM\CurrentControlSet\Services\ERA_SERVER\" /]
[/Startup]
[/Item]
[/Log]
"</p>
There must be a shorter version than this, but this should work.
score="(0\.[1-9]|[1-9]\.[0-9]|[1-9][0-9]\.[0-9]|[1-9][0-9][0-9]\.[0-9])"
Matches:
0.1
1.0
10.4
100.9
100.0
999.9
99.9
9.9
(etc.)
Does Not Match
0.0
0
(etc.)
Is regex the way to go?
As for whether regex is the right tool for the job, I probably agree with #Makoto that it isn't - unless you're doing a quick scan of the results as an FYI, rather than filtering results as part of a larger tool or application. In other words, except for the simplest cases, I agree with #Makoto that you want some xml parsing tool.
I have no idea on LabTech.
Anyway, the regex query that you can use:
\sscore="((?:5\.[1-9])|(?:[6-9]\.[0-9])|(?:[1-9]{1}[0-9]{1,2}\.[0-9]))"\s
or
\sscore="(5\.[1-9]|[6-9]\.[0-9]|[1-9]{1}[0-9]{1,2}\.[0-9])"\s
if you prefer without the (?: ... )
UPDATE:
Okay, I made further changes to support the 5.1 minimum, and max 999.9
PS: This is my first answer on StackOverflow

Scala Regex help on UCI data set

Hi Guys i'm trying to parse some data in http://kdd.ics.uci.edu/databases/20newsgroups/20_newsgroups.tar.gz using scala regex
Heres the text that im trying to process:
val inputData = ""xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
newsgroups: alt.atheism,soc.motss,rec.scouting
path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!watson.ibm.com!strom
from: strom#watson.ibm.com (rob strom)
subject: re: [soc.motss, et al.] "princeton axes matching funds for boy scouts"
sender: #watson.ibm.com
message-id: <1993apr05.180116.43346#watson.ibm.com>
date: mon, 05 apr 93 18:01:16 gmt
distribution: usa
references: <c47efs.3q47#austin.ibm.com> <1993mar22.033150.17345#cbnewsl.cb.att.com> <n4hy.93apr5120934#harder.ccr-p.ida.org>
organization: ibm research
lines: 15
in article <n4hy.93apr5120934#harder.ccr-p.ida.org>, n4hy#harder.ccr-p.ida.org (bob mcgwier) writes:
|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.
|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.
can somebody reconcile the apparent contradiction between [1] and [2]?
--
rob strom, strom#watson.ibm.com, (914) 784-7641
ibm research, 30 saw mill river road, p.o. box 704, yorktown heights, ny 10598"
Here's the output that i need
in article <n4hy.93apr5120934#harder.ccr-p.ida.org>, n4hy#harder.ccr-p.ida.org (bob mcgwier) writes:
|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.
|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.
can somebody reconcile the apparent contradiction between [1] and [2]?
Here's what i tried:
val docParser = """([\\s\\S]+\\lines: \\d*)([\\s\\S]*\\n\\n)([\\s\\S]*)""".r
val docParser(metadata, content, footer) = inputText
But im getting the following error:
scala.MatchError: [Ljava.lang.String;#62f8fff1 (of class [Ljava.lang.String;)
Online regex builder seems to work though:
Any ideas? :)
I have never programmed in scala before, but from what I can see in http://www.tutorialspoint.com/scala/scala_regular_expressions.htm
you have to escape twice stuff like digits.
So \d would become \\d in scala and so on.