Using regex to find abbreviations

Using regex to find abbreviations - regex

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.

If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Related

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1

The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.

Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

RegEx more than multiple characters before number

I really don't use RegEx that much. You could say I am RegEx n00b. I have been working on this issue for a half a day.
I am trying to write a pattern that looks backward from a number character. For example:
1. bob1 => bob
2. cat3 => cat
3. Mary34 => Mary
So far I have this (?![A-Z][a-z]{1,})([A-Za-z_])
It only matches for individual characters, I want all the characters before the number character. I tried to add the ^ and $ into my pattern and using an online simulator. I am unsure where to put the ^ and $.
NOTE: I am using RegEx for the .NET Framework

You may use a regex like
[\p{L}_]+(?=\d)
or
[\w-[\d]]+(?=\d)
See the regex demo
Pattern details
[\p{L}_]+ - any 1 or more letters (both lower- and uppercase) and/or _
OR
[\w-[\d]]+ - 1 or more word chars except digits (the -[] inside a character class is a character class subtraction construct)
(?=\d) - a positive lookahead that requires a digit to appear immediately to the right of the current location

If we break down your RegEx, we see:
(?![A-Z][a-z]{1,}) which says "look ahead to find a string that is NOT one uppercase letter followed one or more lowercase letters" and ([A-Za-z_]) which says "match one letter or underscore". This should end up matching any single lowercase letter.
If I understand what you want to achieve, then you want all of the letters before a number. I would write something like that as:
\b([a-zA-Z]+)[0-9]
This will start at a word boundary \b, match one or more letters, and require a digit right after the matched string.
(The syntax I used seems to match this document about .NET RegEx: https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions)
In light of Wiktor Stribizew's comment, here is a pure match RegEx:
\b[a-zA-Z_]+(?=[0-9])
This matches the pattern and then looks ahead for the digit. This is better than my first lookahead attempt. (Thank you Wiktor.)
http://www.rexegg.com/regex-lookarounds.html

Regex, match anything unless just numbers

I have got the following regex expression so far:
used-cars\/((?:\d+[a-z]|[a-z]+\d)[a-z\d]*)
This is sort of working, I need it to match basically ANYTHING apart from JUST numbers after used-cars/
Match:
used-cars/page-1
used-cars/1eeee
used-cars/page-1?*&_-
Not Match:
used-cars/2
used-cars/400
Can someone give me a hand? Been trying get this working for a while now!

There are few shortcomings of your regex used-cars\/((?:\d+[a-z]|[a-z]+\d)[a-z\d]*).
It's checking for used-cars/ followed by multiple digits then one character within a-z OR multiple characters within a-z then one digit.
[a-z\d]* is searching for either characters or digits which is also optional.
It's inaccurate for your pattern.
Try with following regex.
Regex: ^used-cars\/(?!\d+$)\S*$
Explanation:
used-cars\/ searches for literal used-cars/
(?!\d+$) is negative lookahead for only digits till end. If only digits are present then it won't be a match.
\S* matches zero or more characters other than whitespace.
Regex101 Demo

Regex to match 4 letters in a string

I am trying to write some regex that will match a string that contains 4 or more letters in it that are not necessarily in sequence.
The input string can have a mix of upper and lowercase letters, numbers, non-alpha chars etc, but I only want it to pass the regex test if it contains at least 4 upper or lowercase letters.
An example of what I would like to be a valid input can be seen below:
a124Gh0st
I have currently written this piece of regex:
(?(?=[a-zA-Z])([a-zA-Z])| )
Which returns 5 matches successfully but it will currently always pass as long as I have greater than 1 letter in the input string. if I add {4,} to the end of it then it works, but only in situations where there are 4 letters in a row.
I am using the following website to test what I have been doing: regex101
Any help on this would be greatly appreciated.

You may use
(?s)^([^a-zA-Z]*[A-Za-z]){4}.*
or
^([^a-zA-Z]*[A-Za-z]){4}[\s\S]*
See the regex demo.
Details:
^ - start of string
([^a-zA-Z]*[A-Za-z]){4} - exactly 4 sequences of:
[^a-zA-Z]* - 0+ chars other than ASCII letters
[A-Za-z] - an ASCII letter
[\S\s]* - any 0+ chars (same as .* if the DOTALL modifier is enabled).

Why don't you just match the zero or more characters between each letter? For example,
(?:[A-Za-z].*){4}
You'll recognize the [A-Za-z]. The . matches any character, so .* is a run of any number (including zero) of any character. The group of a letter followed by any number of any characters is repeated four times, so this pattern matches if and only if at least four letters appear in the string. (Note that the trailing .* of the fourth repeat of the pattern is mostly inconsequential, since it can match zero characters).
If you are using a regex language that supports reluctant quantifiers, then using them will make this pattern considerably more efficient. For example, in Java or Perl, one might prefer to use
(?:[A-Za-z].*?){4}
The .*? still matches any number of any character, but the matching algorithm will match as few characters as possible with each such run. This will reduce the amount of backtracking it needs to perform. For this particular pattern, it will reduce the needed backtracking to zero.
If you do not have reluctant quantifiers in your regex dialect, then you can achieve the same desirable effect a bit more verbosely:
(?:[A-Za-z][^A-Za-z]*?){4}
There, only non-letters are matched for the runs between letters.
Even with this, the pattern uses some regex features not present in all regex flavors -- non-capturing groups, enumerated quantifiers -- but these are present in your original regex. For a maximally-compatible form, you might write
[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z]

Regex matching on word boundary OR non-digit

I'm trying to use a Regex pattern (in Java) to find a sequence of 3 digits and only 3 digits in a row. 4 digits doesn't match, 2 digits doesn't match.
The obvious pattern to me was:
"\b(\d{3})\b"
That matches against many source string cases, such as:
">123<"
" 123-"
"123"
But it won't match against a source string of "abc123def" because the c/1 boundary and the 3/d boundary don't count as a "word boundary" match that the \b class is expecting.
I would have expected the solution to be adding a character class that includes both non-Digit (\D) and the word boundary (\b). But that appears to be illegal syntax.
"[\b\D](\d{3})[\b\D]"
Does anybody know what I could use as an expression that would extract "123" for a source string situation like:
"abc123def"
I'd appreciate any help. And yes, I realize that in Java one must double-escape the codes like \b to \b, but that's not my issue and I didn't want to limit this to Java folks.

You should use lookarounds for those cases:
(?<!\d)(\d{3})(?!\d)
This means match 3 digits that are NOT followed and preceded by a digit.
Working Demo

Lookarounds can solve this problem, but I personally try to avoid them because not all regex engines fully support them. Additionally, I wouldn't say this issue is complicated enough to merit the use of lookarounds in the first place.
You could match this: (?:\b|\D)(\d{3})(?:\b|\D)
Then return: \1
Or if you're performing a replacement and need to match the entire string: (?:\b|\D)+(\d{3})(?:\b|\D)+
Then replace with: \1
As a side note, the reason \b wasn't working as part of a character class was because within brackets, [\b] actually has a completely different meaning--it refers to a backspace, not a word boundary.
Here's a Working Demo.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using regex to find abbreviations - regex

Related

How to overcome multiple matches within same sentence (regex) [duplicate]

RegEx more than multiple characters before number

Regex, match anything unless just numbers

Regex to match 4 letters in a string

Regex matching on word boundary OR non-digit

Categories

Resources