Regex that selects everything after first consecutive capitalized words - regex

I'd like to select everything after the first few consecutive capitalized words. ie:
Terry Smith is a good school teacher. She works tirelessly.
would become;
is a good school teacher. She works tirelessly.
So far this doesn't work work;
(^[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)([\s\S]*)
I'm using it in Drupal's feeds tamper plugin with the "find replace regex" feature in order to replace everything after "Terry Smith" with blank space.

The following expression with match all consecutive capitalized words at the beginning of the sentence.
^(?:(?:[A-Z][a-z]+)(?>\s*))+
Regex101 Demo
If you want to remove that part from the setnence then all you have to do is replace it with the empty string.
If you want to replace the part that comes after it then you can use the following expression:
^((?:(?:[A-Z][a-z]+)(?>\s*))+)([\s\S]+)
and use a replacement string of $1 or whatever in your language that is used to reference the first captured group.
Regex101 Demo

This will find the capital words:
[A-Z][a-z]+(?=\b)\s*
You might want to replace the + with * after [a-z] to also match single-character capital words.
To get all capitalized words at the beginning of the string, add ^( and )+ around it:
^([A-Z][a-z]+(?=\b)\s*)+

Related

Regex equation selects surrounding characters

I have the following text
Cool Title Here 12345
Other title here 13455
That I want to turn into this using Atom's find and replace
Cool Title Here, 12345
Other title here, 13455
My goal is to select the space between the end of a word and the start of a number. My first instinct is this statement
[A-Za-z][\s][0-9]
However that also selects the last letter and the first number which is not good for this replacement as I would loose data.
How would I accomplish finding the space inbetween two sections using pure Regex
You can capture the letter and the number, and in the replacement, use back reference to add them back:
So specify the pattern:
([A-Za-z]) ([0-9])
In the replacement:
$1, $2
I am not familiar with the specifics of Atom regular expression processing but some Googling suggests these general regex techniques should work:
You could use \b to identify the word boundary of the preceding word (without capturing it).
You can use (?=\d) to look ahead to the digit without capturing.
so for your example:
\b\s(?=\d)

Notepad++ regex to capitalise the first word of selected sentence

Relates to customising Notepad++. I know TextFX 'Sentence case.' exists but I wanted to control this using my own regex/macro.
Testing against: hello my name is john. hello my name is john. hello my name is john.
Currently I have this which works fine when nothing is selected/highlighted with the mouse.
Find what: ((?<=^|(?<=[.!?]\s)))(\w)
Replace with: \u$0
However, when I select/highlight the second (middle) sentence only (starting at the h and finishing on the period .), the regex does nothing. Note: I have 'Use selection' ticked in N++ and am using 'Replace All'.
This makes sense because the regex is looking for the start of a line or the char pattern .!? followed by a space.
My question is how to alter the regex so that it works when selecting/highlighting any sentence, no matter if it isn't at the beginning of a line as per my example.
I have tried adding in a negative lookbehind to match when no characters are found but I only managed to uppercase the first word of every sentence.
The ^ matches the start of a line, while your selected region is not at the line start. You may replace it with \A, the start of the matching string. Since it will match at each selected region, you cannot use \w, you need to add + after it so as not to turn each subsequent word char to upper case.
Use
(?<=\A|(?<=[.!?]\s))(\w+)
and replace with \u$0.
Alternative way is to use capturing groups (then, you will be able to match cases where the number of whitespaces between !, ? or . and the next word char is more than one):
(\A|[.!?]\s+)(\w+)
to replace with $1\u$2.

Regular expression to remove parenthesis and space before it

I'm trying to write a regular expression (inside a Google Spreadsheet) to remove parenthesis, the text inside the parenthesis, and space before the parenthesis. Or in other words, I'm trying to extract only the name inside of the text. For example, I'd like the string "A.J. Smith (iOS Developer, San Francisco)" to become "A.J. Smith"
So far I've gotten both =REGEXEXTRACT(D2,"[^()]*") and =REGEXEXTRACT(D2,"^[^(]+") to extract "A.J. Smith " but it leaves that last space at the end. This is probably a really easy problem to solve, I'm just not great with regex.
Just use word boundary.
=REGEXEXTRACT(D2,"^[^(]+\\b")
^[^(]+ greedily matches all the characters upto the first ( symbol including the space which exists before (. Then it backtracks to the last word boundary appears on the matched string because of \b present in the regex.
DEMO
Try this instead:
=REGEXREPLACE(D2,"\s\(.*","")
What I'm doing is replacing everything from a space next to a parenthesis to the end of the string with nothing.
I used https://regoio.herokuapp.com/ to help build a regex to match. This regex would match this example without the space. ^(.+)\s\(
The regex works like this, The ^ matches the beginning of the string, the parenthesis captures whatever expression is inside that you want to use. in this case .+ which matches any character 1 or more times. The \s matchs a whitespace character and \( matches the opening parenthesis.
If you want a regex that removes whitespace at the beginning of the string and any before the parenthesis this should work: ^[\s]*(.+)[\s]+\(
With this regex you can extract all the text you wanted in a single REGEXEXTRACT instead of using multiple ones:
=REGEXEXTRACT(D2,"^[\s]*(.+)[\s]+\(")
I found that =REGEXEXTRACT(D2,"(.*)\s\(") also worked for me.
This should work to remove all parentheses and white space before:
=REGEXTRACT(D2,"\s|\(|\)|\[|]|{|}|")
Feel free to play around with this on rubular.

Regex for a string of capitalized words?

I want to write a regex that will pull out the phrases of capitalized [a-z] words. So if it sees this phrase it should pull out "America Fremerica" and "King" from
there is a land called America Fremerica where regex is King
I have a regex ([A-Z][a-z]+ ?){1,} that pulls out Fremerica and King.
I want it to pick out America Fremerica. Why doesn't it pick out America? Is that why it does not pick out the phrase?
Your regex works, but it's not capturing all of the words. The regex (a)+ will match the string aaa but it will only capture the last a. To capture all three as you'd need to write (a+) with the wildcard inside the parentheses.
So put another set of parentheses around the whole thing. You want to capture the repetitions. You can also change {1,} to +, which is equivalent.
((?:[A-Z][a-z]+ ?)+)
?: stops the inner set of parentheses from being a capture group. It's not necessary, but it's nice to have.
Your regex captures the trailing space. This regex captures a capitalized word followed by 0-n more such words (either as the whole match or group 1 - they are the same), which captures just "America Fremerica" (not "America Fremerica ")
([A-Z][a-z]+(?: [A-Z][a-z]+)*)
See a live demo
It appears to work as expected in Javascript. See this fiddle: http://jsfiddle.net/9X83F/2/
HTML
<p id="result"></p>
Javascript
var phrase = "there is a land called America Fremerica where regex is King";
var matches = phrase.match(/([A-Z][a-z]+ ?){1,}/g);
document.getElementById('result').innerHTML = matches;

How to delete words containing a specific character in RegEx?

Using Notepad++'s find and replace feature using regular expressions, I want to get rid of every word with the number symbol (#) attached to it, particularly in the beginning of the word in my case.
For example, how do I make:
The #7kfe dog likes #9kea to eat pizza
into:
The dog likes to eat pizza
Any help would be greatly appreciated. Thank you.
Most editors that have find-and-replace using regular expressions work similarly... in the 'find' field, look for #\w* and in the replace field, use (empty string). This will leave double spaces (the space that was before your word and the space that was after your word)... you can either tweak the expression above to something like #\w* ? (so that the space is optional, in case the word in question is the last word of the line), or you can do a second search-and-replace that collapses multiple spaces into one.
Most other responses will give you words starting with # which seems to be what you want, but to suit your question in your OP ("particularly in the beginning"), this will select every word with a # in it (anywhere):
/(\w*#\w+|\w+#\w*)/
DEMO
find: (\W)#\w+
replace: \1
(obviously also set it to regex mode)
The \W looks for a non-word character, to ensure the # is at the beginning of a word. The \1 in the replace puts that character back.
#\w*
use this regex.
will match every word after #