Finding a substring using regex - regex

Disclaimer: This question is more from curiosity and will to learn a bit more about Regex, I know it can be achieved with other methods.
I have a string that represents a list, like so: "egg,eggplant,orange,egg", and I want to search for all the instances of the item egg in this list.
I can't search for the substring egg, because it would also return eggplant.
So, I tried to write a regex expression to solve this and got to this expression ((?:^|\w+,)egg(?:$|,\w+))+ (I used this website to build the regex)
Basically, it searches for the word egg at the beginning of the string, the end of the string and in-between commas (while making sure those aren't trailing commas).
And it works fine, except this edge case: "egg,eggplant,egg"
Based on this site, I can see that the first egg is matched but then the regex engine continues until the last comma. Then for the last egg it has the remaining sting ,egg which doesn't match…
So, what can I do to fix the expression and find all the instances of a word in a string that represent a list?

You can use
(?<![^,])egg(?![^,])
Or its less efficient equivalent:
(?<=,|^)egg(?=,|$)
See the regex demo. Details:
(?<![^,]) - a negative lookbehind that requires start of string or comma to appear immediately to the left of the current location
egg - a word
(?![^,]) - a negative lookahead that requires end of string or comma to appear immediately to the right of the current location.
See the regex graph:

Related

Get Text Starting From Last Occurrence of Certain Substring Leading to Match

Given a long string that generally follows this syntax:
/C=US/foo=bar/var=1/CN=JONES.FRED.R.0123456789:xxj31ZMTZzkVA
/C=US/foo=pop/var=2/CN=BLAKE.DAPHNE.P.1234567890:xxj31ZMTZzkVA
/C=US/foo=bit/var=8/CN=BINKLEY.VELMA.W.2345678901:xxj31ZMTZzkVA
/C=US/foo=hat/var=17/CN=ROGERS.SHAGGY.N.3456789012:xxj31ZMTZzkVA
/C=US/foo=jam/var=39/CN=DOO.SCOOBY.D.4567890123:xxj31ZMTZzkVA
I want to capture what follows the previous occurrence of "/C=US/" that leads up to the last name + dot + first name that follows "CN=", and finally the text that precedes the colon (:). The last name, dot, and first name are not hard-coded but rather passed in from a variable.
For example, given "DOO.SCOOBY", I want to extract this text:
/C=US/foo=jam/var=39/CN=DOO.SCOOBY.D.4567890123
Here is the Regex I am using:
(?<=\/C=US\/)(.*?)(?=DOO.SCOOBY)+(.*?)+:
The problem is, it extracts ALL of the text preceding the match of "DOO.SCOOBY" to the colon, except for the very first "/C=US/". So, I nearly get the entire string back. It's also important to note there are no linebreaks or spaces in this string; it is all bunched together. How can I get text that only goes back as far as the previous "/C=US/"? I've searched plenty on regexes and specifically this scenario, but can't seem to find anything. It looks like I need to implement the positive lookbehind correctly.
You can use
\/C=US\/(?:(?!\/C=US\/).)*?DOO\.SCOOBY[^:]*
See the regex demo.
Details:
\/C=US\/ - a /C=US/ string
(?:(?!\/C=US\/).)*? - any single char, other than line break chars, zero or more but as few as possible occurrences, that does not start a /C=US/ substring
DOO\.SCOOBY - a DOO.SCOOBY string
[^:]* - zero or more chars other than :.

Regex to delete everything behind first letter

I have a regex \b\d+\K[a-z] Replace with: \u$0
This makes letters in front of numbers caps, for example:
123host
1643domain
into
123Host
1643Domain
What I need to figure out now is how can I delete the numbers.
So I need:
123host
to become
host
and so on, all entries have a numbers in front of them like this:
6410james
599stacks
Into
james
stacks
I tried doing \b\d+\K[a-z] replace with nothing, but it just deletes the first letter, I'm a total noob and any help would be appreciated.
You can simply find \d+ or [0-9]+ and replace it with an empty string, if all samples have the digits in the start. ^\d+ or ^[0-9]+ would also work fo our cases, however it would not work if we'd have digits after the letters.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
The pattern you probably want to search for is:
^[^a-zA-Z]*
and then replace with empty string. This is a literal translation of the requirement to remove every non letter from the start of the string.
Demo

How to match text which the part of it is already matched previous?

I have a string like aaa**b***c****ddd, and I want to get a sequence of matched text of pattern [^*]\*+[^*], which should I thank be [a**b, b***c, c***d]. However, when I test this in text editor like vim or emacs, the second (b***c) is not matched.
aaa**b***c***ddd
|--| |---|
first third
|---|
second, which I think should be matched but not
How should I modify the regular expression to match the second?
Yes you can, the trick consists to put all in a capturing group inside a lookahead to allow overlapping results:
(?=([^*]\*+[^*]))
But you can't use this do to replacements since this pattern matches nothing. (or perhaps if you can get the capture group length and the current offset)
EDIT:
it seems to be possible to obtain the capture group length with vim with strlen(submatch(1))
#CommuSoft is correct. One way to approach this problem would be to match the whole string against this regex and then the second time around, you match this regex against the substring that starts at (index_of_first_previous_match + 1) until the end of the string. Hope that is clear.
So if the index of your first match above (a**b) was 2. Then the new substring that you match against the regex the second time should start from index 3 till the end of the string. This will give you the two results.
However, Casimir's answer is much simpler.

Find lines with same characters set

I have situation like this.
Car Driver
Cat Mouse
Door House
Driver Car
I need help with regex to find all lines with same set of characters or words no mater how placed in line.
Car Driver
Driver Car
Edited list:
A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <
EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:
^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)
where \h stand for an horizontal white-character.
Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:
^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)
Previous answer:
You can use this pattern:
^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)
This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.
pattern details:
^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)
The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.
(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)
(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.
$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)
I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:
Car Driver|Driver Car
Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.

Regular Expression for matching a single digital followed by a word exactly in Notepad++

:Statement
Say we have following three records, and we just want to match the first one only -- exactly one digital followed by a specific word, what is the regular expression can be used to make it(in NotePad ++)?
2Cups
11Cups
222Cups
The expressions I tried and their problems are:
Proposal 1:\d{1}Cups
it will find the "1Cups" and "2Cups" substrings in the second and third record respectively, which is what we do not want here
Proposal 2:[^0-9]+[0-9]Cups
same as the above
(PS: the records can be "XX 2Cups", "YY22Cups" and "XYZ 333Cups", i.e., no assumption on the position of the matchable parts)
Any suggestions?
:Reference
[1] The reg definition in NotePad++ (Same as SciTe)
As mentioned in Searching for a complex Regular Expression to use with Notepad++, it is: http://www.scintilla.org/SciTERegEx.html
[2] Matching exact number of digits
Here is an example: regular expression to match exactly 5 digits.
However, we do not want to find the match-able substring in longer records here.
If the string actually has the numbered sequence (1. 2Cups 2. 11Cups), you can use the white space that follows it:
\s\d{1}Cups
If there isn't the numbered list before, but the string will be at the beginning of the line, you can anchor it there:
^\d{1}Cups
Tested in Notepad++ v6.5.1 (Unicode).
It sounds like you want to match the digit only at the start of the string or if it has a space before it, so this would work:
(^|\b)\dCups
Debuggex Demo
Explanation:
(^|\b) Match the start of the string or beginning of a word (technically, word break)
\d Match a digit ({1} is redundant)
Cups Match Cups
This will work:
\b\dCups
If "Cups" must be a whole word (ie not matching 2Cupsizes:
\b\dCups\b
Note that \b matches even if at start or end of input.
I found one possible solution:
Using ^\d{1}Cups to match "Starting with one digital + Cups" cases, as suggested by Ken, Cottrell and Bohemian.
Using [^\d]\dCups to match other cases.
However, haven't found a solution using just one regex to solve the problem yet.
Have a try with:
(?:^|\D)\dCups
This will match xCups only if there aren't digit before.