Adjustment to this code to stop after finding two words - regex

In my haste to get this working I failed to ask how to stop after the second word in my original post. Grab first 4 characters of two words RegEx
If I have Awesome Sauce Today I would like to have AwesSauc
The code in my first post will capture the first 4 characters of any word and combine them. so Awesome Sauce Today will become AwesSaucToda. I want it to stop capturing after the second word. So in my example Today would be ignored but it will still capture 4 characters of the first two words it encounters to create the new wor AwesSauc

You may still use the Replace Text action and use
Pattern: (?s)^\P{L}*(\p{L}{1,4})\p{L}*\P{L}+(\p{L}{1,4}).*
Replacement text: $1$2
See the regex demo.
The difference between this solution and the previous one is that the pattern is anchored at the start with ^, instead of a \W (that matches any non-word char) I am using a \P{L} that matches any non-letter char (adjust as you see fit), and to match the first and second word beginning, I am using 2 capturing groups now ((\p{L}{1,4})...(\p{L}{1,4})), hence two backreferences in the replacement pattern. The (?s) modifier makes the . pattern to match any char, including a newline. The .* at the end is necessary to remove the rest of the string after the necessary text is captured into the 2 capturing groups.

Related

Deleted everything before the dot

How can I use regex in notepad++ to make a query like this:
I have a list with subdomains containing three words such as
web1.com
test.web2.com
www.test.web3.com
I want to filter so that only three words remain and something like this comes out:
web1.com
test.web2.com
test.web3.com
I was able to delete so that only the domain remains, but this is not what I want
^(?:.+\.)?([^.\r\n]+\.[^.\r\n]+)$
An idea to match until the endpart starts and capture that.
^.*?\.([\w-]+\.[\w-]+\.[\w-]+)$
Replace with $1 (what was captured by the first group)
.*? matches lazily any amount of any characters (besides newline)
[\w-]+ char-class matches one or more word characters and hyphen
See this demo at regex101 (more explanation on the right side)
In Notepad++ be sure to have unchecked: [ ] dot matches newline
Another take at it using a positive lookahead to assert the 3 "words" to the right, allowing for non whitespace chars excluding a dot using [^\s.]
In the replacement use an empty string.
^\S+?\.(?=[^\s.]+\.[^\s.]+\.[^\s.]+$)
See a regex demo.

RegEx to match all sets of items that have part of specific value

I'm trying to use RegEx to filter all sets of items that have part of a specific value in a capture group that I have defined.
I have to check if the fifth capture group contains at least part of a specific text.
My string:
First Item;Second Item;Third Item;Fourth Item;First Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Second Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Can't Capture This
Set;Sixth Item
RegEx that works for exact word:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(Second Word);([^;\?$]+)
The problem is that I need this RegEx to work to capture only part of the word.
Not Working:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(.*Word.*);([^;\?$]+) >
Thanks!
Use [^;]* instead of .* because you have semi-colons as field delimiters:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);([^;]*Word[^;]*);([^;?]+)
See proof. ([^;]*Word[^;]*) will match zero or more characters other than semi-colons, then a Word and zero or more characters other than semi-colons.

Capturing uppercase words in text with regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.
In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101
Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.
You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

RegEx for capturing everything except numbers and one word

I am quite stuck with a regex I can't get to work. It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
I have tried something like (?!\d|fiktiv).* on my sample string 123456788daswqrt fiktiv
https://regex101.com/r/kU8mF3/1
However this does match the fiktiv at the end as well.
One possibility would be to use a neglected character class, which can be used by putting a ^ in [] braces. So you basically say don't match digits, and as many non digits as you can get until a space occurs and the word fiktiv appears.
This capturing will be "saved" in the capturing group 1 for later use.
([^\d]+)\s+fiktiv
Testing could be done here:
https://regex101.com/
It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
So, you want to remove any character that is not a digit (that is, \D or [^0-9] pattern) and not a fiktiv char sequence.
You may use a regex with a capturing group and alternation:
(fiktiv)|[^0-9]
and replace with the contents of Group 1 using a $1 backreference, fiktiv, to restore it in the replaced string.
See the regex demo
C# implementation:
Regex.Replace(input‌​, "(fiktiv)|[^0-9]", "$1")
Also, see Use RegEx in SQL with CLR Procs.

Regex - Matching Strings with a Single Character

I'm fairly new to regex. I'm looking for an expression which will return results which meet the following criteria:
The First word must be 3 letters or more
The last word must be 3 characters or more
If any word or words in-between the first and last word contains ONLY 1 letter, then return that phrase
Every other word in-between the first and last character that (apart from the single letter words) must be 3 letters or more
I would like it to return phrases like:
'Therefore a hurricane shall arrive' and 'However I know I like Michael Smith'
There should be a space between each word.
So far I have:
^([A-Za-z]{3,})*$( [A-Za-z])*$( [A-Za-z]{3,})*$
Any help would be appreciated. Is it something to do with the spacing? I'm using an application called 'Oracle EDQ'.
In a normal regex world you'd use a \b, a word boundary.
^[a-zA-Z]{3,}(\s+|\b([a-zA-Z]|[a-zA-Z]{3,})\b)*\s+[a-zA-Z]{3,}$
^^ ^^
See demo
And perhaps, non-capturing groups (as anubhava shows).
From what I see, there are no word boundaries in Oracle EDQ regex syntax (as well as non-capturing groups). You should rely on the \s pattern, matching whitespace.
So, make it obligatory, either with
^[a-zA-Z]{3,}(\s+|\s([a-zA-Z]|[a-zA-Z]{3,}))*\s+[a-zA-Z]{3,}$
^^
OR
^[a-zA-Z]{3,}(\s+|([a-zA-Z]|[a-zA-Z]{3,})\s)*\s*[a-zA-Z]{3,}$
^^ ^
You can use this regex:
^[a-zA-Z]{3,}(?:\s+|(?:[a-zA-Z]|[a-zA-Z]{3,}))*\s+[a-zA-Z]{3,}$
RegEx Demo