Find lines with same characters set

Find lines with same characters set - regex

I have situation like this.
Car Driver
Cat Mouse
Door House
Driver Car
I need help with regex to find all lines with same set of characters or words no mater how placed in line.
Car Driver
Driver Car
Edited list:
A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <

EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:
^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)
where \h stand for an horizontal white-character.
Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:
^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)
Previous answer:
You can use this pattern:
^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)
This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.
pattern details:
^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)
The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.
(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)
(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.
$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)

I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:
Car Driver|Driver Car
Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.

Related

How to pick files based on a certain word in a string in NIFI using file filter regex [duplicate]

How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping

If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).

Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop

If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.

Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)

If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W

/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D

If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.

/^stop*$/i
i - in case it is case sensitive.

I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.

can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;

Regular Expression in sas, not matching a word after a matching word

Maybe this is easy, but i could not find a solution.
I am working in Sas 9.3 with perl regex.
I am searching for a regular Expression, which matches only some words which are not followed by a specific other word. For example, it should match all text where you have "the car" and in all other text after this there should be no "not". (Case can be ignored, because i upcase everything in my code)
Should match
This is not the car i want
The car is green
should not match
The car is not green
This is the car i want, but its not available
One solution would be to split it in two matches:
prxmatch("/The car/",mytext) > 0 and prxmatch("/The car.+not/",mytext)=0
But i have to use the logic a lot of times, also in more complex cases, so i dont want to always use 2 prxmatch and instead combine the logic in one prxmatch.
I read a lot about look aheads and tried some examples, but they did not work correct, e.g.:
"/The Car.+[^(not)]/"
or
"/The Car.+(?!not)/"
or
"/^(?!.*not.*).*?The car.*$/"
1st and second return all 4 texts as results, third none result at all.
So can somebody provide me a solution for this, a simple not Operator for a word or a correct look ahead/behind Approach?

You can use
(?im)^.*\bthe car\b(?!.*\bnot\b).*
The regex demo is available here
Pattern breakdown:
(?im)- enable case-insensitive and multiline matching modes
^ - start of a line (since (?m) is used)
.* - match 0+ any characters but a newline
\bthe car\b - 2 whole words "the car" (a sequence of 2 words)
(?!.*\bnot\b) - a negative lookahead that fails the match if there is a whole word "not" somewhere to the right of the car
.* - the rest of the line up to the newline or end of string

Vim multiline regex gives overlapping matches

I was surprised when I noticed that my greedy multiline regex was giving overlapping matches in Vim. The regex is designed to match an entire block of text, or consecutive non-blank lines.
The regex apparently matched everything I expected it to (highlight looked correct), but when using n to skip to the next match instead of skipping to the next block, it went to the next line in the current block.
Here is the regex I was using (equivalent to (.+\n){1,} for most regex engines):
\(.\+\n\)\{1,}
This should match at least one non-empty line, and as many consecutive non-empty lines as possible, here is an example text file:
block 1
some stuff
more stuff
block 2
foo bar
baz qux
After applying this regex (/\(.\+\n\)\{1,}+Enter) the two blocks are highlighted correctly, but I expect there to be only two matches of the regex, one for each block. However when I press n to advance to the next regex match it appears that each non-empty line matches the regex, so my cursor would start on the first line, n would take it to the second line, then third, then to the start of block 2 etc.
How can I change my regex so that I see the expected behavior of each block being a single match so that n advances to the next block, instead of the next line?
I am also interested in knowing if this behavior is in the documentation somewhere, or if there is an option to change this behavior. Note that when using the same regex in a search/replace the behavior is what I expect (replacement would only be applied twice, once for each block).

The following regex seems to work:
\(\%^\|^\n\)\zs\(.\+\n\)\+
Explanation:
\( # start of group
\%^ # beginning of file
\| # OR
^\n # a blank line
\) # end of group
\zs # start matching here
\(.\+\n\)\+ # at least one non-blank line
By using the very magic option the length can be reduced a bit:
\v(%^|^\n)\zs(.+\n)+
Looking forward to seeing if anyone can come up with a shorter solution!
zigdon's answer helped me to understand better why the behavior is the way it is. When n is used to jump to the next match it searches for the first match of the regex from the cursor's current position, even if the next matching position was included in the previous match. This is why anchoring the regex to the start of the block appears to be necessary.
Thanks to Nolen Royalty for helping me get rid of an unnecessary lookahead in the first group.

Since your match says "match one or more non-empty lines" it can certainly match multiple times within the same paragraph. To fix this, you can specify that the cursor should be placed at the end of the match - the means the next match will start from the end of the paragraph. You can do this with the \zs zero-width character, available in vim:
\zs Matches at any position, and sets the start of the match there: The
next char is the first char of the whole match. |/zero-width|
So your match will become:
\(.\+\n\)\{1,}\zs

Regex: optimal syntax for optional combined expression?

I want to match a combination of expressions that is optional. In this specific example, I want to match on the word through. Also, if the words run or swim precede through (with whitespace) then match on the whole phrase. So that combination of expressions preceding through must be optional.
I want all the following lines to be positive matches:
swim through <-- match entire phrase
jump through <-- match entire phrase
hike through <-- match only the word "through"
To do this, I can use the following expression:
(jump\W|swim\W)?through
However, is it possible to accomplish the same thing without having to add \W after jump and swim? I was trying something like this:
(jump|swim)?\W?through
But that wasn't working properly because it would include the space that precedes through on the 3rd example. I only want the word through, not the whitespace around it.

What about this one: (?:(jump|swim)\W)?through

Regular expression to match string starting with a specific word

How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping

If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).

Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop

If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.

Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)

If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W

/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D

If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.

/^stop*$/i
i - in case it is case sensitive.

I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.

can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find lines with same characters set - regex

Related

How to pick files based on a certain word in a string in NIFI using file filter regex [duplicate]

Regular Expression in sas, not matching a word after a matching word

Vim multiline regex gives overlapping matches

Regex: optimal syntax for optional combined expression?

Regular expression to match string starting with a specific word

Categories

Resources