Regex for a string of capitalized words? - regex

I want to write a regex that will pull out the phrases of capitalized [a-z] words. So if it sees this phrase it should pull out "America Fremerica" and "King" from
there is a land called America Fremerica where regex is King
I have a regex ([A-Z][a-z]+ ?){1,} that pulls out Fremerica and King.
I want it to pick out America Fremerica. Why doesn't it pick out America? Is that why it does not pick out the phrase?

Your regex works, but it's not capturing all of the words. The regex (a)+ will match the string aaa but it will only capture the last a. To capture all three as you'd need to write (a+) with the wildcard inside the parentheses.
So put another set of parentheses around the whole thing. You want to capture the repetitions. You can also change {1,} to +, which is equivalent.
((?:[A-Z][a-z]+ ?)+)
?: stops the inner set of parentheses from being a capture group. It's not necessary, but it's nice to have.

Your regex captures the trailing space. This regex captures a capitalized word followed by 0-n more such words (either as the whole match or group 1 - they are the same), which captures just "America Fremerica" (not "America Fremerica ")
([A-Z][a-z]+(?: [A-Z][a-z]+)*)
See a live demo

It appears to work as expected in Javascript. See this fiddle: http://jsfiddle.net/9X83F/2/
HTML
<p id="result"></p>
Javascript
var phrase = "there is a land called America Fremerica where regex is King";
var matches = phrase.match(/([A-Z][a-z]+ ?){1,}/g);
document.getElementById('result').innerHTML = matches;

Related

Extra groups in regex

I'm building a regex to be able to parse addresses and am running into some blocks. An example address I'm testing against is:
5173B 63rd Ave NE, Lake Forest Park WA 98155
I am looking to capture the house number, street name(s), city, state, and zip code as individual groups. I am new to regex and am using regex101.com to build and test against, and ended up with:
(^\d+\w?)\s((\w*\s?)+).\s(\w*\s?)+([A-Z]{2})\s(\d{5})
It matches all the groups I need and matches the whole string, but there are extra groups that are null value according to the match information (3 and 4). I've looked but can't find what is causing this issue. Can anyone help me understand?
Your regex expression was almost good:
(^\d+\w?)\s([\w*\s?]+).\s([\w*\s?]+)\s([A-Z]{2})\s(\d{5})
What I changed are the second and third groups: in both you used a group inside a group ((\w*\s?)+), where a class inside a group (([\w*\s?]+)) made sure you match the same things and you get the proper group content.
With your previous syntax, the inner group would be able to match an empty substring, since both quantifiers allow for a zero-length match (* is 0 to unlimited matches and ? is zero or one match). Since this group was repeated one or more times with the +, the last occurrence would match an empty string and only keep that.
For this you'll need to use a non-capturing group, which is of the form (?:regex), where you currently see your "null results". This gives you the regex:
(^\d+\w?)\s((?:\w*\s?)+).\s(?:\w*\s?)+([A-Z]{2})\s(\d{5})
Here is a basic example of the difference between a capturing group and a non-capturing group: ([^s]+) (?:[^s]+):
See how the first group is captured into "Group 1" and the second one is not captured at all?
Matching an address can be difficult due to the different formats.
If you can rely on the comma to be there, you can capture the part before it using a negated character class:
^(\d+[A-Z]?)\s+([^,]+?)\s*,\s*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo
Or take the part before the comma that ends on 2 or more uppercase characters, and then match optional non word characters using \W* to get to the first word character after the comma:
^(\d+[A-Z]?)\s+(.*?\b[A-Z]{2,}\b)\W*(.+?)\s+([A-Z]{2})\s(\d{5})$
Regex demo

RegEx to match all sets of items that have part of specific value

I'm trying to use RegEx to filter all sets of items that have part of a specific value in a capture group that I have defined.
I have to check if the fifth capture group contains at least part of a specific text.
My string:
First Item;Second Item;Third Item;Fourth Item;First Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Second Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Can't Capture This
Set;Sixth Item
RegEx that works for exact word:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(Second Word);([^;\?$]+)
The problem is that I need this RegEx to work to capture only part of the word.
Not Working:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(.*Word.*);([^;\?$]+) >
Thanks!
Use [^;]* instead of .* because you have semi-colons as field delimiters:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);([^;]*Word[^;]*);([^;?]+)
See proof. ([^;]*Word[^;]*) will match zero or more characters other than semi-colons, then a Word and zero or more characters other than semi-colons.

Make regex match only the capturing group

Due to the technology I'm currently working with (PySpark API), I need to adjust a regex so that the full match corresponds to the capturing group.
I want to use it as a delimiter pattern in a split function
This function splits an input string according to the matched substring, not the capturing group.
Hence why I need to match the \s+ caracters (that I currently only capture).
Here is a regex101 example or here: (\s)+(?:\d*\s*)(?=RUE|BOULEVARD|AVENUE)
I tried to extend the positive lookahead to combine the possibility that a \d+\s+ may be present before and therefore match a different \s. Didnt work so far.
The split function's output I wish to obtain is the following:
[7 BOULEVARD LAPIN BLANC,AVENUE MR LIEVRE,18 RUE PIERRE LAPIN]
I don't know pyspark but I guess it supports these things, split on spaces that are not preceded by a digit but followed by an optional digit then the type of street.
(?<!\d)\s+(?=(?:\d+\s)?(?:RUE|BOULEVARD|AVENUE))
In the demo I use a substitution with \n that simulate the split.
Demo & explanation

Extract with regex when the same special character is used

I've been trying to use Regex tools online, but none seem to be working. I am close but not sure what I'm missing.
Here is the Text:
Valencia, Los Angeles, California - Map
I want to extract the first 2 letters of the state (so between "," and "-"), in this case "CA"
What I've done so far is:
[,/](.*)[-/]
$1
The output is:
Los Angeles, California
If anything I thought I would at least just get the state.
,\s*(\w\w)[^,]*-
will capture Ca in group 1.
, comma
\s* whitespace
(\w\w) capture the first two characters
[^,]* make sure there's no comma up to the next dash
-
,\s*(\S{2})[^,]*-
You're going to want to take just the first match.
I assume you use JavaScript.
Your regex fails this particular case because there are two commas in your input.
One possible fix is to modify the middle capture from . (any character) to [^,] (any character except comma). This will force the regex to match California only.
So, try [,/]([^,]*)[-/]. Here's a demo of how it works.
You can use this regex:
.*?,\s(\w\w)[^,]*-
$1 is the first two letters you're looking for.

Regex that selects everything after first consecutive capitalized words

I'd like to select everything after the first few consecutive capitalized words. ie:
Terry Smith is a good school teacher. She works tirelessly.
would become;
is a good school teacher. She works tirelessly.
So far this doesn't work work;
(^[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)([\s\S]*)
I'm using it in Drupal's feeds tamper plugin with the "find replace regex" feature in order to replace everything after "Terry Smith" with blank space.
The following expression with match all consecutive capitalized words at the beginning of the sentence.
^(?:(?:[A-Z][a-z]+)(?>\s*))+
Regex101 Demo
If you want to remove that part from the setnence then all you have to do is replace it with the empty string.
If you want to replace the part that comes after it then you can use the following expression:
^((?:(?:[A-Z][a-z]+)(?>\s*))+)([\s\S]+)
and use a replacement string of $1 or whatever in your language that is used to reference the first captured group.
Regex101 Demo
This will find the capital words:
[A-Z][a-z]+(?=\b)\s*
You might want to replace the + with * after [a-z] to also match single-character capital words.
To get all capitalized words at the beginning of the string, add ^( and )+ around it:
^([A-Z][a-z]+(?=\b)\s*)+