Regular expression in Vim to match group capture - regex

I want to find the words which contain the same string repeated twice.
(e.g. wookokss(ok/ok), ccsssscc(ss/ss)).
I think the expression is \(\w*\)\0.
Another try is to find the words which consist of the same string repeated twice. My answer is \<\(\w*\)\0\>. (word beginning + grouping(word) + group capture + word ending)
But they don't work. Could anybody help me?

To find a string repeated twice in a word, which is longer than two characters, you can use
/\(\w\{2,}\)\1
To match a whole word which contains beforementioned string, you can use
/\<\w\{-}\(\w\{2,}\)\1\w\{-}\>
Little bit of explanation
\1 - matches the same string that was matched by the first sub-expression in \( and \) (\0 matches the whole matched pattern)
\{n,} - matches at least n of the preceding atom, as many as possible
\{-} - matches 0 or more of the preceding atom, as few as possible
\w - the word character ([0-9A-Za-z_])
\< - the beginning of a word
\> - the end of a word
More in :help pattern

1.) words which contain the same string repeated twice. (e.g. wookokss(ok/ok),
To find words containing two or more repeated word characters try
\(\w\{2,}\)\1
\1 matches what's captured in first group.
2.) find the words which consist of the same string repeated twice...
To capture \w\+ one or more word characters followed by \1 what's captured in first group
\<\(\w\+\)\1\>
should be about it. Have a look at this tutorial.

For the first one use (.{2,})\1 example here: https://regex101.com/r/gK0mM2/2
That is assuming that you only look for duplicate strings that have more than 1 character.
and for the second one ^(.{2,})\1$ example here: https://regex101.com/r/lC2yT7/2
Edit: changed the second expression, it now also looks for strings with at least 2 characters

Related

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regex for replacing anything other than characters, more than one spaces and number only in end with empty char

I want to replace anything other than character, spaces and number only in end with empty string or in other words: we replace any number or spaces comes in-starting or in-middle of the string replace with empty string.
Example
**Input** **Output**
Ndd12 Ndd12
12Ndd12 Ndd12
Ndd 12 Ndd 12
Nav G45up Nav Gup
Attempted Code
regexp_replace(df1[col_name]), "(^[A-Za-z]+[0-9 ])", ""))
You may use:
\d+(?!\d*$)|[^\w\n]+(?!([A-Z]|$))
RegEx Demo
Explanation:
\d+(?!\d*$): Match 1+ digits that are not followed by 0+ digits and end of line
|: OR
[^\w\n]+(?!([A-Z]|$)): Match 1+ non-word characters that are not followed by an uppercase letter or and end of line
if you use python, you can use regular expressions.
You can use the re module.
import re
new_string = re.sub(r"[^a-zA-Z0-9]","",s)
Where ^ means exclusion.
Regular expressions exist in other languages. So it would be helpful to find a regular expression.
I came up with this regex to capture all characters that you want to remove from the string.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+
Do
regexp_replace(df1[col_name]), "^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+", ""))
Regex Demo
Explanation:
^\d+ - captures all digits in a sequence from the start.
(?<=\w)\d+(?![\d\s]) - Positive look behind for a word character with a negative look ahead for a number followed by space and capturing a sequence of digits in the middle. (Captures digits in G45up)
(?<=\s)\s+ - positive look behind for a space followed by one or more spaces, capturing all additional spaces.
Note : This regex could be inefficient when matching large strings as it uses expensive look-arounds.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+|(?<=\w)\W|\W(?=\w)|(?<!\w)\W|\W(?!\w)

Capturing uppercase words in text with regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.
In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101
Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.
You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

How can I replace only the first 2 matches per line, using regex in Notepad++

I'm trying to parse a list of filenames to a CSV file by converting the first 2 - characters per line into a |. The problem is that the filenames themselves also contain the character I'm searching for.
My raw data looks something like this:
12055371-1-Florence - BW Letter of Intent HB Comments 9-4-14-2.DOCX
12057668-2-EB-DUE-M- SBuxbaum FHA Benefit Plans-2.DOCX
12058210-1-Redline Letter of Intent-2.PDF
12058029-3-Florence Hospital--Order Establishing Bid Procedures-HB 9-23-14-2.DOCX
12058020-10-Florence - BW Letter of Intent 10,10,14 Revisions-2.DOCX
Using Notepadd++ to replace on the fly, but I'm not sure what regex will work to identify and replace these items.
Don't match -, match the beginning of the lines up to the second - :
match ^(.*?)-(.*?)-
replace by \1|\2|
Explanation :
^ matches the beginning of the line (0-width match).
(.*?) matches any character in a non-greedy way : if the next token of the regex can match, it will let it do so. The result is grouped so it can be referenced later.
\1 and \2 are back-references and refers to the two (.*?) groups.
Note : for efficiency you could replace the non-greedy matches by the negated class [^\-], which means every character but -, the - being escaped because it's a special character in this context. The groups would then become ([^\-]*). Of course it really does not matter if it's a one-time operation.

Finding a match one after another

How do I find multiple matches that are (and can only be) separated from each other by whitespaces?
I have this regular expression:
/([0-9]+)\s*([A-Za-z]+)/
And I want each of the matches (not groups) to be surrounded by a whitespace or another match. If the condition is not fullfilled, the match should not be returned.
This is valid: 1min 2hours 3days
This is not: 1min, 2hours 3days (1min and 2hours should not be returned)
Is there a simpler way of finding a continuous sequence of matches (in Java preferably) than repeating the whole regex before and after the main one, checking if there is a whitespace, start/end of the string or another match?
I believe this pattern will meet your requirements (provided that only a single space character separates your alphanumeric tokens):
(?<=^|[\w\d]\s)([\w\d]+)(?=\s|$)
^^^^^^^^^^ ^^^^^^^ ^^^^
(2) (1) (3)
A capture group that contains an alphanumeric string.
A look-behind assertion: To the left of the capture group must be a) the beginning of the line or b) an alphanumeric character followed by a single space character.
A look-ahead assertion: To the right of the capture group must be a) a space character or b) the end of the line.
See regex101.com demo.
Here is some sample data that I included in the demo. Each bolded alphanumeric string indicates a successful capture:
1min 2hours 3days
1min, 2hours 3days
42min 4hours 2days
String text = "1min 2hours 3days";
boolean match = text.matches("(?:\\s*[0-9]+\\s*[A-Za-z]+\\s*)*");
This is basically looking for a pattern on your example. Then using * after the pattern its looking for zero or more occurrence of the pattern in text. And ?: means doesn't capture the group.
This will will also return true for empty string. If you don't want the empty string to be true, then change * into +
I've mananged to solve my problem by splitting the string using string.split("\\s+") and then matching the results to the pattern /([0-9]+)\s*([A-Za-z]+)/.
There is an error here the '' will match all characters and ignore your rest
/([0-9]+)\s([A-Za-z]+)/
Change to
/(\d+)\s+(\w+)/g
This will return an array of matches either digits or word characters. There is no need to always write '[0-9]' or '[A-Za-z]' the same thing can be said as '\d' match any 0 to 9 more can be found at this cheat sheet regular expression cheat sheet