Regex: capturing capital word with nothing in front of it - regex

I'm trying to match all proper nouns in some given text.
So far I've got (?<![.?!]\s|^)(?<!\“)[A-Z][a-z]+ which ignores capital words preceded by a .?! and a space as well as words inside a bracket. Can be seen here.
But it doesn't catch capital words at the beginning of sentences. So given the text:
Alec, Prince, so Genoa and Lucca are now just family estates of the “What”. He said no. He, being the Prince.
It successfully catches Prince, Genoa, Lucca but not Alec.
So i'd like some help to modify it if possible, to match any capital word with nothing behind it. (I'm not sure how to define nothing)

You can put the “ as the second alternative in the lookbehind instead of ^ which asserts the start of the string.
Then you can omit (?<!\“)
(?<![.?!]\s|“)[A-Z][a-z]+
Explanation
(?<! Negative lookbehind, assert what is directly to the left if the current position is not
[.?!]\s Match any of . ? ! followed by a whitespace char
| Or
“ Match literally
) Close lookbehind
[A-Z][a-z]+ Match an uppercase char A-Z and 1+ chars a-z
See a regex demo.

The thing you're looking for is called a "word boundary", which is denoted as \b in a lot of regex languages.
Try \b[A-Z][a-z]*\b.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?
There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).
Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+
This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.
You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

RegEx negative lookahead on pattern

I want to find all expressions that don't end with ":"
I tried to do it like that:
[a-z]{2,}(?!:)
On this text:
foobar foobaz:
foobaz
foobaz:
The problem is, that it just takes away the last character befor the ":" and not the whole match.
Here is the example: https://regex101.com/r/jtLRvz/1
How can I get the negative lookahead work for the whole regular expression?
When [a-z]{2,}(?!:) matches baz:, [a-z]{2,} grabs 2 or more lowercase ASCII letters at once (baz) and the negative lookahead (?!:) checks the char immediately to the right. It is :, so the engine asks itself if there is a way to match the string in a different way. Since {2,} can match two chars, not currently matched three, it backtracks, and finds a valid match.
Add a-z to the lookahead pattern to make sure the char right after 2 or more lowercase ASCII letters is not a letter and not a colon:
[a-z]{2,}(?![a-z:])
^^^
See the regex demo
If your regex engine supports possessive modifiers, or atomic groups, you may use them to prevent backtracking into the [a-z]{2,} subpattern:
[a-z]{2,}+(?!:)
(?>[a-z]{2,})(?!:)
See another regex demo.

Regex not capturing group that contains period

I'm working a regex to match anything starting with a letter in a string similar to G71P100Q110U0W0F.01. I've come up with ([A-Z].*?)(?=[A-Z]) which works fine until I reach F.01 where it stops matching. From what I've read, the .*? should match anything lazily but it's not. What do I need to add to include the period?
Edit:
Desired matches for the string G71P100Q110U0W0F.01 would be G71, P100, Q110, U0, W0, and F.01. I can iterate through the matches easily enough in VBA.
You can delete the lookahead: (?=[A-Z]). I.,e. your regex would be simplified to ([A-Z].*?)
This lookahead makes sure that there will be at least one capital character after the end of .*. However, you already match a capital character at the beginning of your regex: ([A-Z]...). So you need two capital characters, but you have only one.
Unfortunately, I don't understand the rules on what you want and don't want to match. It would be cool to have more examples both for matching and not matching strings.
Probably this regex would be good for you:
([A-Z].*?)\.[0-9]+
It makes sure that your text:
starts with a capital letter
ends with a dot, and then one or more numbers
Demo here.
What you are trying to do is:
[A-Z][^A-Z]*
Match an uppercase letter then anything but an uppercase letter.
Live demo
From what I've read, the .*? should match anything lazily...
and it's the exact thing that's happening. It stops right after it finds following character is an uppercase letter.
Try this:
[A-Z]\.?[0-9]+
Period must be escaped.
I assume you are looking for a regex pattern that matches a sequence of non-space character(s) starting with a letter:
\b[a-zA-Z]\S*
[A-Z][^A-Z\s]+
[A-Z] match a single letter
[^A-Z\s]+ match anything that's not whitespace or a letter
Run code sample for demo
var input = "G71P100Q110U0W0F.01"
console.log(input.match(/[A-Z][^A-Z\s]+/g))

Regex to match characters to the right of a colon

I'm stuck on a regex. I'm trying to match words in any language to the right of a colon without matching the colon itself.
The basic rule:
For a line to be valid, it must not begin with or contain any characters outside of [a-z0-9_] until after :.
Any characters to the right of : should match as long as the line begins with the set of characters defined above.
For instance, given a string such as these:
this string should not match
bob_1:Hi. I'm Bob. I speak русский and this string should match
alice:Hi Bob. I speak 한국어 and this string should also match
http://example.com - would prefer to not match URLs
This string:should not match because no spaces or capital letters are allowed left of the colon
Only 2 of the 5 strings above need to match. And only to the right of the colon.
Hi. I'm Bob. I speak русский and this string should match
Hi Bob. I speak 한국어 and this string should also match
I'm currently using (^[a-z0-9_]+(?=:)) to match characters to the left of :. I just can't seem to reverse the logic.
The closest I have at the moment is (?!(?!:)).+. This seems to match everything to right of the colon as well as the colon itself. I just can't figure out how to not include : in the match.
Can one of you regex wizards help me out? If anything is unclear please let me know.
Short regex pattern (case insensitive):
^\w+:(\w.*)
\w - matches any word character (equal to [a-zA-Z0-9_])
https://regex101.com/r/MZhqSL/6
As you marked pcre, here's the pattern you need (only to the right of the colon):
^\w+:\K\w.*
\K - resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
https://regex101.com/r/E1yHVY/1
You can use this regex:
^[a-z0-9_]+:\K(?!//).*
RegEx Demo
RegEx Breakup:
^: Start
[a-z0-9_]+: Match 1+ of [a-z0-9_] characters
:: Match a colon
\K: Reset matched info so far
(?!//): Negative lookahead to disallow // right after colon to avoid matching potential URLs
.*: Match anything until end
You can use the regex: ^.*?:(.*)$
^.*?: - from the beginning of the line, any character until the colon (non-greedy) included
(.*)$ - use a matching group to anything that follows it till the end of the line
Link to DEMO