Capturing uppercase words in text with regex

Capturing uppercase words in text with regex - regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.

In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101

Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.

You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.

If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1

The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.

Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regex - Matching Strings with a Single Character

I'm fairly new to regex. I'm looking for an expression which will return results which meet the following criteria:
The First word must be 3 letters or more
The last word must be 3 characters or more
If any word or words in-between the first and last word contains ONLY 1 letter, then return that phrase
Every other word in-between the first and last character that (apart from the single letter words) must be 3 letters or more
I would like it to return phrases like:
'Therefore a hurricane shall arrive' and 'However I know I like Michael Smith'
There should be a space between each word.
So far I have:
^([A-Za-z]{3,})*$( [A-Za-z])*$( [A-Za-z]{3,})*$
Any help would be appreciated. Is it something to do with the spacing? I'm using an application called 'Oracle EDQ'.

In a normal regex world you'd use a \b, a word boundary.
^[a-zA-Z]{3,}(\s+|\b([a-zA-Z]|[a-zA-Z]{3,})\b)*\s+[a-zA-Z]{3,}$
^^ ^^
See demo
And perhaps, non-capturing groups (as anubhava shows).
From what I see, there are no word boundaries in Oracle EDQ regex syntax (as well as non-capturing groups). You should rely on the \s pattern, matching whitespace.
So, make it obligatory, either with
^[a-zA-Z]{3,}(\s+|\s([a-zA-Z]|[a-zA-Z]{3,}))*\s+[a-zA-Z]{3,}$
^^
OR
^[a-zA-Z]{3,}(\s+|([a-zA-Z]|[a-zA-Z]{3,})\s)*\s*[a-zA-Z]{3,}$
^^ ^

You can use this regex:
^[a-zA-Z]{3,}(?:\s+|(?:[a-zA-Z]|[a-zA-Z]{3,}))*\s+[a-zA-Z]{3,}$
RegEx Demo

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD

See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].

Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?

tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.

One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string

This one worked for me
([\w ]+)

Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional

I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.

This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string

Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]

This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$

For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$

If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.

It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.

This regex allow only alphabet and spaces:
^[a-zA-Z ]*$

Try with this one:
result = re.search(r"\w+( )\w+", text)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Capturing uppercase words in text with regex - regex

In your regex, you just need to change the second * for a +: [A-Z](?: +[A-Z]+){4,} Explanation While using (?: +[A-Z]), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after. Demo on regex101

Replace the s by +s, and your regex only matches the words in the first sentence. . also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.

You had to mark at least 1 upper case as [A-Z](?: +[A-Z]+){4,} see updated regex. A better Regex will allow non spaces as [A-Z](?: [A-Z]+){4,}.see better regex After will indicate to allow at least upper case even without spaces.

Related

Using regex to find abbreviations

How to overcome multiple matches within same sentence (regex) [duplicate]

Regex - Matching Strings with a Single Character

Cleaning up a regular expression which has lots of repetition

Regular expression to allow spaces between words

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Capturing uppercase words in text with regex - regex

In your regex, you just need to change the second * for a +: [A-Z]*(?: +[A-Z]+){4,} Explanation While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after. Demo on regex101

Replace the *s by +s, and your regex only matches the words in the first sentence. .* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.

You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex. A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex * After will indicate to allow at least upper case even without spaces.

Related

Using regex to find abbreviations

How to overcome multiple matches within same sentence (regex) [duplicate]

Regex - Matching Strings with a Single Character

Cleaning up a regular expression which has lots of repetition

Regular expression to allow spaces between words

Categories

Resources

In your regex, you just need to change the second * for a +: [A-Z](?: +[A-Z]+){4,} Explanation While using (?: +[A-Z]), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after. Demo on regex101

Replace the s by +s, and your regex only matches the words in the first sentence. . also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.

You had to mark at least 1 upper case as [A-Z](?: +[A-Z]+){4,} see updated regex. A better Regex will allow non spaces as [A-Z](?: [A-Z]+){4,}.see better regex After will indicate to allow at least upper case even without spaces.