Regex - Match Exact (and only exact) string/phrase - regex

I'm having some trouble using Regex to match an exact string (and only that string, must not contain prefix or suffix) out of a sample with only slight variations.
I've looked over every "duplicate" this has been compared to, and none of the solutions seem to apply to the problem I'm trying to accomplish. If it is indeed a duplicate, I'd love to see how!
Phrase to Match:
Metallica - Master Of Puppets
Sample Text:
Metallica - Master Of Puppets (instrumental)
Metallica - Master Of Puppets
- Metallica - Master Of Puppets
I've tried a few different approaches with this.
There's "the starting point": ^(Metallica - Master Of Puppets)$
The "slightly more involved": ^((?!Lamb of God - Laid to Rest).)*$
The "I'm getting desperate": /(?<=\||\A|\n)(Metallica -
.Master.of.Puppets)(?=\||Z|\n)/g
and the "im out of ideas, so why the hell not": (?=^\s*Metallica - Master Of
Puppets).{29}
None of which will match the correct (the second option, in bold) string. I've dedicated the better part of my free-time from last night on this one little string rather than coding out a new app I've been working on (I really do hate to give up), and am, at this point, out of ideas, examples and patience. Nonetheless, I really would like to get to the bottom of what this seemingly simple Regex need will take to accomplish, both for the app and for my mental well being (I hate Regex, but love a good challenge).
Note: I DO need this to be done in Regex (not grep, not java etc). Sorry to throw such a seemingly menial question up, but my 15 or so months in the programming world only gets me so far. Looking forward to a solution, thanks!5

I believe your first approach was correct (in agreement with #Dmitry Egorov), although you are probably missing the multi-line flag. This sets it so that ^ and $ are set at the beginning and end of each line of your string or file.
In PHP/Js you'll want to use
/^Metallica - Master Of Puppets$/gm
the g flag is 'global' and finds all instances, the mflag is the aforementioned multi-line flag.
Other languages will have similar flags or options for multi-line support.

Related

POSIX ERE Regex - Creating Efficient Regex

I'm working to create some regex entries that are well-formed, and efficient. I'll place an emphasis on efficient, as these regex entries can see thousands of logs per second. Inefficient regex entries can cause severe performance impacts.
Question: Does regex101 (through one flavor) support POSIX ERE Regex? Googling shows that PCRE2 should support BRE+ERE and more.
Regex Type: POSIX ERE
Syslog App: rsyslog (EL7)
Sample Payload (Well formed - Sensitive Information Stripped):
Jul 10 00:00:00 Firewall-Name-Removed CEF:0|Fortinet|FortiGate-removed|1.2.3,build1111 (GA)|0000000013|forward traffic accept|5|start=Jul 10 2022 00:00:00 logver=604091966 deviceExternalId=FG9A9A9A9999999 dvchost=Firewall-Name-Removed ad.vd=root ad.eventtime=1111111111111111111 ad.tz=-9999 ad.logid=0000000013 cat=traffic ad.subtype=forward deviceSeverity=notice src=1.1.1.1 shost=RandomHost1 spt=62119 deviceInboundInterface=DII-Out ad.srcintfrole=lan ad.srcssid=SSID Has Been Removed ad.apsn=ABC123D ad.ap=CHL-07 ad.channel=157 ad.radioband=802.11ac n-only ad.signal=-40 ad.snr=55 dst=2.2.2.2 dpt=53 deviceOutboundInterface=DOI-Out ad.dstintfrole=undefined ad.srccountry=Reserved ad.dstcountry=CountryRemoved externalID=123456789 proto=00 act=accept ad.policyid=000 ad.policytype=policy ad.poluuid=UUID-Removed ad.policyname=policy_name_removed app=DNS ad.trandisp=noop ad.appid=16195 ad.app=DNS ad.appcat=Network.Service ad.apprisk=elevated ad.applist=UTM Name - Removed ad.duration=180 out=0 in=205 ad.sentpkt=0 ad.rcvdpkt=1 ad.utmaction=allow ad.countdns=1 ad.osname=Windows ad.srcswversion=10 ad.mastersrcmac=MAC removed ad.srcmac=MAC removed ad.srcserver=0 tz="-9999"
What I'm attempting to do is remove specific logs that are not required. Normally I'd do this at a SIEM level through something like routing rules (where I can utilize fields), but this isn't possible for the foreseeable future. In this particular case: I'm trying to exclude on the following pieces of information.
Source IP: Is in a specific range
deviceOutboundInterface: is DOI-Out
Current Regex: "\bsrc=1.1.1[4-5]{0,1}.[0-9]{0,3}\b.*?\bdeviceOutboundInterface=DOI-Out\b" (Regex101 link in PCRE2). If that is matched, the log is rejected (through the stop call). Otherwise, it moves onto the other entries to check for unnecessary logs.
Most of my regex entries are in the low double-digits because they're a lot simpler. Is there a better way to make the more complex regex more efficient?
Thank you for any insight you can offer.
You might be able to cut some time with:
src=1\.1\.1[4-5]{0,1}\.[0-9]{0,3}.*?deviceOutboundInterface=DOI-Out
changes:
remove word boundaries
change the . to . in IP address
regex101 has the original efficiency at 383 steps, new is 301 so a potential savings of ~21%. Not terrible but you'll want to make sure any removals were OK.
to be honest, what you have looks pretty good to me.
This RE reduces the number of steps on Reg101 from 383 to 270 (~ -29.5%):
src=1\.1\.1[45]?\.\d{0,3}.*?O[boundIter]*?face=DOI-Out
The original RE already is quite simple, only matching one pattern and one literal string which makes it difficult to optimize. But we can do if we know (from the documentation of the text in question, here the Log Message manual) that an even simpler pattern will not lead to ambiguities.
Changes:
matching literal text whereever possible
replacing range '4-5' with simple elements
instead of matching the long 'deviceOutboundInterface=', use a pattern which will just barely match this string but would possibly match other words if they ever occurred in log messages - but we know they don't.

how to match a sentence having particular word in different patterns

We have a problem here...
We have a text having different patterns of sentences.
We want to get the sentence having a particular word.
Eg:
One further point, by way of providing another model. The analysis in
the second paragraph could lead in the following direction. 'The
Destructors' deals with, obviously, destruction, whilst the book of
Genesis deals with creation. The vocabulary is similar: Blackie
notices that 'chaos had advanced', an ironic reversal of God's
imposing of form on a void. Furthermore, the phrase 'streaks of light
came in through the closed shutters where they worked with the
seriousness of creators', used in the context of destruction, also
parodies the creation of light and darkness in the early passages of
the Biblical book. Greene's ironic use of the vocabulary of the Bible
might be making the point that, for him, the Second World War
signalled the end of a particular Christian era. Now, it is perfectly
arguable that the rise of fascism is linked to this, or that it is the
cause. The cult of personality and secular leadership has, for Greene,
taken over from the key role of the church in Western societies. In
this way the two main themes identified above - the tension between
individual and community, and religion - are linked. In terms of essay
writing this link could well be made after the discussion of the theme
of the individual and the community, and its links with the theme of
leadership. This might be the general conclusion to the essay. After
thoughtful consideration and interpretation a student may well decide
that this is what the (destructors.)' boils down to: Greene is making a
clear link between the rise of fascism and the decline of the Church's
influence. Despite the fact that fascism has been recently defeated,
Greene sees the lack of any contemporary values which could provide
social cohesion as providing the potential for its reappearance.
In the above text, we have bold words (destructors). We want to get the sentences which are having the word "destructors".
The word "destructors" can be present in different formats. Eg: (destructors), (DesTrucTors), (Des.tructors), DESTRUCTORS, destructors, des-tructors.
When we tried writing a regex to match the sentences, we are failing to get the sentences at some conditions(like we are getting half sentences, etc.,).
Could you please help us with this.
If this information doesn't help you to solve, please let us know. Will update it.
Thank you...
I'm not too sure about Python, but I believe this might work:
for match in re.finditer(r"[^.]*destructors[^.]*\.[^\w\s]*", subject, re.IGNORECASE):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
In any case, I think the regex you want is:
[^.]*destructors[^.]*\.[^\w\s]*
with the case insensitive and global flags set.
It will be helpful if you could provide the regex pattern which you have tried with so far. The best I can come up with is,
str_text='your text here containing DESTRUCTORS'
match=re.search('pass all the destructors combination here', str_text, flags=re.IGNORECASE)
Try for more patterns available for string formatting with regex here,https://docs.python.org/3/library/re.html

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

Controlling Word Wrap in a container

I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]

RegEx - Not match inside a text?

I am working with iCal entries:
BEGIN:VEVENT
UID:944f660b-01f8-4e09-95a9-f04a352537d2
ORGANIZER;CN=******
DTSTART;TZID="America/Chicago":20100802T080000
DTEND;TZID="America/Chicago":20100822T170000
STATUS:CONFIRMED
CLASS:PRIVATE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
TRANSP:OPAQUE
X-MICROSOFT-DISALLOW-COUNTER:TRUE
DTSTAMP:20100802T212130Z
SEQUENCE:0
END:VEVENT
BEGIN:VEVENT
UID:aa132e2b-8a8d-4ffc-9e54-b75249e78c72
RRULE:FREQ=DAILY;COUNT=12;INTERVAL=1
SUMMARY:***********
X-ALT-DESC;FMTTYPE=text/html:<html><body><div style='font-family:Times New R
oman\; font-size: 12pt\; color: #000000\;'></div></body></html>
LOCATION:Map Room
ORGANIZER;CN=*********
DTSTART;TZID="America/Chicago":20100730T080000
DTEND;TZID="America/Chicago":20100730T170000
STATUS:CONFIRMED
CLASS:PUBLIC
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
TRANSP:OPAQUE
X-MICROSOFT-DISALLOW-COUNTER:TRUE
DTSTAMP:20100727T025231Z
SEQUENCE:0
EXDATE;TZID="America/Chicago":20100810T080000
EXDATE;TZID="America/Chicago":20100807T080000
BEGIN:VALARM
ACTION:DISPLAY
TRIGGER;RELATED=START:-PT5M
DESCRIPTION:*********
END:VALARM
END:VEVENT
I need to parse out starting and ending times. I have a comparison function that determines if the passed in event is between the two times. Due to the increased complexity in calculating the times I plan on not supporting the recurrance series. I would like to play the safe side and make sure my code only reads the first event as a match and not the second. So I have the following RegEx with the single-line option:
BEGIN:VEVENT.+?
DTSTART;.+?([0-9]{8})T([0-9]{6})
DTEND;.+?([0-9]{8})T([0-9]{6}).+?
END:VEVENT
This gets me the start and end times of both entries. My thought was to only match ones that don't have FREQ= between the BEGIN:VEVENT and DTSTART. I don't understand how to do this, however. I was wondering if someone could help me out here?
I realize at a certain point a full blown parser is a better option, but I am unskilled with parsers and I am under a slight time constraint. I have tried using the !? operator without success.
It's harder to write a regex to match for things you don't want then to match the things you do want. Usually when I run into this situation, I find it easier and faster to do things in two steps. In this case, I'd probably find all events that do contains FREQ=, remove those events, then continue matching on the result for the start and end times I want. Could you post the regex you tried with !?, because maybe it's easy to fix... Also, I assume this is in Objective-C, and I'm guessing the environment you're using does support !? (but not all of them do)...
UPDATE
Ok, try this one:
BEGIN:VEVENT.+?
(?<!FREQ=.+)DTSTART;.+?([0-9]{8})T([0-9]{6})
DTEND;.+?([0-9]{8})T([0-9]{6}).+?
END:VEVENT
Why not use a PHP iCalendar parser?
http://www.phpclasses.org/browse/file/16660.html