QRegExp match lines containing N words all at once, but regardless of order (i.e. Logical AND) - c++

I have a file containing many lines of text, and I want to match only those lines that contain a number of words. All words must be present in the line, but they can come in any order.
So if we want to match one, two, three, the first 2 lines below would be matched:
three one four two <-- match
four two one three <-- match
one two four five
three three three
Can this be done using QRegExp (without splitting the text and testing each line separately for each word)?

Yes it is possible. Use a lookahead. That will check the following parts of the subject string, without actually consuming them. That means after the lookahead is finished the regex engine will jump back to where it started and you can run another lookahead (of course in this case, you use it from the beginning of the string). Try this:
^(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)[^\r\n]*$
The negated character classes [^\r\n] make sure that we can never look past the end of the line. Because the lookaheads don't actually consume anything for the match, we add the [^\r\n]* at the end (after the lookaheads) and $ for the end of the line. In fact, you could leave out the $, due to greediness of *, but I think it makes the meaning of the expression a bit more apparent.
Make sure to use this regex with multi-line mode (so that ^ and $ match the beginning of a line).
EDIT:
Sorry, QRegExp apparently does not support multi-line mode m:
QRegExp does not have an equivalent to Perl's /m option, but this can be emulated in various ways for example by splitting the input into lines or by looping with a regexp that searches for newlines.
It even recommends splitting the string into lines, which is what you want to avoid.
Since QRegExp also does not support lookbehinds (which would help emulating m), other solutions are a bit more tricky. You could go with
(?:^|\r|\n)(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)([^\r\n]*)
Then the line you want should be in capturing group 1. But I think splitting the string into lines might make for more readable code than this.

You can use the MultilineOption PatternOption from the new Qt5 QRegularExpression like:
QRegularExpression("\\w+", QRegularExpression::MultilineOption)

Related

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Using regex to match multiple comma separated words

I am trying to find the appropriate regex pattern that allows me to pick out whole words either starting with or ending with a comma, but leave out numbers. I've come up with ([\w]+,) which matches the first word followed by a comma, so in something like:
red,1,yellow,4
red, will match, but I am trying to find a solution that will match like like the following:
red, 1 ,yellow, 4
I haven't been able to find anything that can break strings up like this, but hopefully you'll be able to help!
This regex
,?[a-zA-Z][a-zA-Z0-9]*,?
Matches 'words' optionally enclose with commas. No spaces between commas and the 'word' are permitted and the word must start with an alphanumeric.
See here for a demo.
To ascertain that at least one comma is matched, use the alternation syntax:
(,[a-zA-Z][a-zA-Z0-9]*|[a-zA-Z][a-zA-Z0-9]*,)
Unfortunately no regex engine that i am aware of supports cascaded matching. However, since you usually operate with regexen in the context of programming environments, you could repeatedly match against a regex and take the matched substring for further matches. This can be achieved by chaining or iterated function calls using speical delimiter chars (which must be guaranteed not to occur in the test strings).
Example (Javascript):
"red, 1 ,yellow, 4, red1, 1yellow yellow"
.replace(/(,?[a-zA-Z][a-zA-Z0-9]*,?)/g, "<$1>")
.replace(/<[^,>]+>/g, "")
.replace(/>[^>]+(<|$)/g, "> $1")
.replace(/^[^<]+</g, "<")
In this example, the (simple) regex is tested for first. The call returns a sequence of preliminary matches delimted by angle brackets. Matches that do not contain the required substring (, in this case) are eliminated, as is all intervening material.
This technique might produce code that is easier to maintain than a complicated regex.
However, as a rule of thumb, if your regex gets too complicated to be easily maintained, a good guess is that it hasn't been the right tool in the first place (Many engines provide the x matching modifier that allows you to intersperse whitespace - namely line breaks and spaces - and comments at will).
The issue with your expression is that:
- \w resolves to this: [a-zA-Z0-9_]. This includes numeric data which you do not want.
- You have the comma at the end, this will match foo, but not ,foo.
To fix this, you can do something like so: (,\s*[a-z]+)|([a-z]+\s*,). An example is available here.

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.

"Multiple pass" RegEx for fixing space indentations

Searched and found some apparently similar questions, that weren't quite.
I often find myself needing to replace leading 4-space indentations with tabs. I always do this with RegEx ^(\t*) {4}, replacing with $1\t. And then I just do multiple passes to catch nested indents. It works, it's easy. But I'm wondering, is it possible to write a RegEx that can do this in one pass (to handle nested indents)?
EDIT
Apologies for lack of input/output examples, I was in a hurry. Here's an example, let s mean space and t mean tab:
SMA
ssssRTP
ssssssssATR
ssssssssOLN
ssssOWH
ssssERE
TOGO
Output:
SMA
tRTP
ttATR
ttOLN
tOWH
tERE
TOGO
Essentially, the RegEx would need to allow for arbitrarily deeply nested chunks of 4 spaces. It does not need to allow for tabs following spaces in the initial input.
PCRE
(^\t*|\G) {4} replace with $1\t or (^|\G)( {4}|\t) replace with \t. You should use multiline mode.
How this works:
^\t* — this match start of string followed by any numbers of tabs.
\G — this match end of previous match.
​ {4} — this match four spaces.
So this regular expression match four spaces at the start of string or four spaces following four spaces already matched by this regular expression.
Tested this with .NET's regex engine. JavaScript's (at least Mozilla's) won't work, though; it relies on lookbehind, which isn't available. PCRE wants fixed-length lookbehinds, so this won't work there either, unfortunately.
(?<=^( {4}|\t)*) {4}
Basic idea is to match four spaces preceded by the beginning of a line plus all the spots where a previous match would naturally go. Since replacement is done atomically, there's no chance of missing such a spot; all such matches are gathered at once. Then make sure you're using Multiline flag and replace with a single tab character and you're good to go.
Test data, which is just random pseudocode in a vaguely Pythonesque style:
def a:
return true
# comment with embedded spaces etc.

Matching line without and with lower-case letters

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.