I am struggling to find a solution for matching two successive whole words using Regular Expression. I have a text box where the user can type in their search criteria, enclosed by quotations for exact matches. The quotes and space (if any) are then replaced by RegEx expressions. Here is an example:
User enters: "Apple Orange"
Converted to:
\bApple\W+(?:\w+\W+){1,6}?Orange\b
Then, my RegEx match would be based on this converted criteria. The instructions are from www.regular-expressions.info/near.html
Maybe I am going about this entirely the wrong way? I am using visual studio. Any help is appreciated.
if you want an exact match when a user uses quotes, then you should just remove the quotes and do a straight string comparison (equality, not contains)
update:
Based on comments below, you would just do the same thing as with a single word match:
Single word:
\bApple\b
Double word
\bApple Orange\b
The idea is that the user enters in the search term and you match for exactly that, so you wouldn't be doing pattern matching for the term itself, just the boundaries of it (the \b wrapped around it). There's no reason to touch the search term itself (all that stuff in-between Apple and Orange that you were trying to do) because even the space inbetween the two is part of their search...unless you were wanting to make it a bit flexible..for example, if the user were to enter in "Apple[lots of space here]Orange" to just count that as a single space, then you could do
\bApple\s+Orange\b
..but then you're kind of deviating from the whole "exact match" theme...
Sidenote: You said in your comment that for "CrabApple OrangeCrush" you did not want "Apple Orange" to match. Which is why you use the \b word boundaries. But IMO if it were me, I would allow for that to match. Or at least, offer some kind of option to search for it in that manner.
Related
I'm logging all user search queries in a model like this:
class SearchLog(models.Model):
query = models.CharField(max_length=512)
datetime = models.DateTimeField(auto_now_add=True, db_index=True)
To get all queries which has at most one word I make this queryset:
SearchLog.objects.exclude(query__contains=" ")
I want to get queries which has at most two words. is there anyway even with raw sql?
One can use a regular expression (regex) for this. This is a textual pattern you describe.
For example to match at most two words, a regex could look like:
^\S+(\s+\S+)?$
(but depending on the situation, you might have to alter it a bit).
The \S stands for non-space characters (i.e. no space, tab, new line, etc.). We repeat such characters one or more time (with the + quantifier). Next we allow a second word optionally (that is the meaning of the question mark ? at the end). This new words consists out of one or more consecutive spacing chars (with \s+) and one or more non-space characters (with \S+). The caret (^) and dollar ($) anchors denote the begin and end of the string (without it, it would match anything that has at least one word). As said before one of the problems might be what you see as a word, so based on that specifications, you might have to change the regex a bit.
In case for example queries with no words at all should be matched as well, we have to change it to ^(\S+(\s+\S+)?)?$ but then strings with only spacing are still not matched. You see it can be difficult to get the pattern completely right, since it basically depends on what you see as a "match" and what not.
You can test the regex with regex101. The strings that match are lines that are highlighted. The lines with three or more words are not highlighted, hence a regex would exclude those. You can use this tool to test the regex, and change it, until it perfectly matches your requirements.
So we can filter with:
SearchLog.objects.filter(query__regex=r'^\S+(\s+\S+)?$')
Regexes are capable to perform rather advanced matching. In computer science there is however the famous "pumping lemma for regular languages" that specifies that there are certain families of patterns that can not be written as regular expressions (well in fact there are families of patterns that can not be matched by any program at all). Here that does not matter much (I think), but a regex is thus not per se capable to match any pattern a programmer has in mind.
I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.
I am trying to match a specific string, but only when it's not part of a couple specific literal strings. I wish to exclude results falling within the literal strings <span class='highlight'> and </span>. So if I search for "light", "high", "pan", "an", etc. I want to match any other occurrences that are not part of those two literals.
I'm not trying to parse full HTML, only those two strings listed, which will never change. The class value will never change from 'highlight'.
I have tried all manners of lookarounds, capturing groups, non-capturing groups, etc that I can think of and have come up with nothing. Lookarounds don't seem to be working, I'm betting because the position(s) of the string in relation to the cases to be excluded are not guaranteed to be in a certain order.
Is this possible with only regex?
Would this method work for you?
Search-and-replace those two tags with the empty string:
s/(<span class='highlight'>|<\/span>)//g
Search for your string
Of course you might end up with your search string being "around" one of those bits, e.g. searching for abcd and matching ab</span>cd. You could get around that my replacing with some character sequence you are sure is not something that can be searched for.
You'll also lose the context of the situation of the string you're looking for relative to those tags, but not knowing what you're trying to achieve exactly, it's difficult to say whether that is important for you or not.
Oops, I thought I was properly simplifying my question, but it turns out I was wrong. I inherited code that was taking a string and doing a regex replace on a list of search terms by looping through them one at a time and wrapping matches in <span class="highlight"></span>. That resulted in a phrase like "Look into the light" ending up looking incorrect if you searched for "the light". "the" was matched and replaced, then "light" was matched, but would match the newly replaced tag for "the". The trick wasn't to fix the regex that got run on each individual word, but to change it into a regex that processed all of them together. Rather than regex replace using the, then light, the regex just needed to be the|light.
I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
It selects two characters too many in front of the hit.
In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is
this Beetle ugly? It sings at the: La Scala opera house. I have a
dream that I will find work at' Frame Store but not in the USA! This
way ILM could do whatever they pleased. ILM was very sweet. Visual
Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.