Finding all possible Acronyms - regex

I have creating a script using VBA to go through a Word document to find all word that could possibly be an acronym but I found that my regEx pattern is not find all of them.
The regEx pattern I am using is "([A-Z]{2,})(-([A-Z]{2,})[A-Za-z0-9])"
With this pattern I am able to find
AA
AAA
AA-BB
AA-BBB
AAA-BB
AAA-BBB
AAA-1234
AAA-BBB-1234
but it does not find these words
B2B
B2B-1234
B2B-A1A-1234
The expectation of the word match should be that the first character is a letter and must contains at least two uppercase letters and at least one number. In addition, if there are dashes in the the word then the characters before the dash must match the expectation of the word match.
Is there is a way to use the regEx pattern above to also include the letter-digit-letter acronyms too?

Milco, welcome to StackOverflow. I think that the following regex will work for you:
([A-Z][A-Z0-9]+)(-[A-Z0-9]{2,})*
This regex accommodates digits and an optional number of hyphenated terms and matches each of your cases above. I tested it out at regextesteronline.com - I'm assuming that VB.net regexes are the same as VBA, which they should be, at least for basic regexes.

Related

Regex to replace first lowercase character in a line into uppercase

I have a very large file containing thousands of sentences. In all of them, the first word of each sentence begins with lowercase, but I need them to begin with uppercase.
I looked through the site trying to find a regex to do this but I was unable to. I learned a lot about regex in the process, which is always a plus for my job, but I was unable to find specifically what I am looking for.
I tried to find a way of compiling the code from several answers, including the following:
Convert first lowercase to uppercase and uppercase to lowercase (regex?)
how to change first two uppercase character to lowercase character on each line in vim
Regex, two uppercase characters in a string
Convert a char to upper case using regular expressions (EditPad Pro)
But for different reasons none of them served my purpose.
I am working with a translation-specific application which accepts regex.
Do you think this is possible at all? It would save me hours of tedious work.
You can use this regex to search for the first letters of sentences:
(?<=[\.!?]\s)([a-z])
It matches a lowercase letter [a-z], following the end of a previous sentence (which might end with one of the following: [\.!?]) and a space character \s.
Then make a substitution with \U$1.
It doesn't work only for the very first sentence. I intentionally kept the regex simple, because it's easy to capitalize the very first letter manually.
Working example: https://regex101.com/r/hqwK26/1
UPD: If your software doesn't support \U, you might want to copy your text to Notepad++ and make a replacement there. The \U is fully supported, just checked.
UPD2: According to the comments, the task is slightly different, and just the first letters of each line should be capitalized.
There is a simple regex for that: ^([a-z]), with the same substitution pattern.
Here is a working example: https://regex101.com/r/hqwK26/2
Taking Ildar's answer and combining both of his patterns should work with no compromises.
(?<=[\.!?]\s)([a-z])|^([a-z])
This is basically saying, if first pattern OR second pattern. But because you're now technically extracting 2 groups instead of one, you'll have to refer to group 2 as $2. Which should be fine because only one of the patterns should be matched.
So your substitution pattern would then be as follows...
\U$1$2
Here's a working example, again based on Ildar's answer...
https://regex101.com/r/hqwK26/13

Using a Regex Pattern that finds Abbrevations

I am looking through volumes of data and need to identify certain patterns one of which is abbreviations. The basic rules to identify them in the content I am going through is
They are all is capital letters.
They are separated by dots.
They may be one or more alphabets
They may or may not end with a dot.
I am looking at individual words therefore looking for multiple occurrences in the string is not required.
Examples
U.S., U.S, U.S.S.R., V.
Can someone help construct a regex search pattern for me?
Many thanks
MS
You can use this regex:
^([A-Z]\.)*[A-Z]\.?$
RegEx Demo
This should do the trick:
\b(?:\p{Lu}\.)*\p{Lu}\b\.?
Demo
I've used \p{Lu} (unicode uppercase letters) since you want to match any alphabet.
If you can't make \b unicode aware in your dialect, here's an alternative:
(?<!\p{L})(?:\p{Lu}\.)*\p{Lu}(?!\p{L})\.?
This will work. it also matches the ending dots.
\b([A-Z]\.)*[A-Z]\b\.?

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

Regex word count - matching words with apostrophe

I'm trying to count words using Regex, with the following pattern:
#"\\w+"
This works, however it's matching it's as:
it
s
Is there a better way to match words that contain punctuation?
Also, words surrounded by punctuation, for example 'word' should also be matched (withhout the ')
The one way to test for such cases is:
#"\\w+(?:'\\w+)?"
So it will match both its and it's, but only its in its'.
I find this style readable, this is with hyphenated words.
'?([a-zA-z'-]+)'?
this is without hyphenation
'?([a-zA-z']+)'?
if you want quick and dirty regex testing with visual feedback you can use one of the many online regex testing tools, i like rubular.com (even for non ruby regex testing)

Regexp matching a string pattern surrounded by capital letters

I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.
These are the structure rules to match the invalid URLs:
a defined website
zero, or more uppercase letters if zero uppercase letters after the pattern
a pattern
zero, or more uppercase letters if zero uppercase letters before the pattern
To be explicit with examples:
http://website/uppeRcase/pattern/upperCase // match it, uppercase before and after pattern
http://otherweb/WhatevercAse/pattern/whatevercase // do not match, no website
http://website/lowercase/pattern/lowercase // do not match, no uppercase before or after pattern
http://website/lowercase/pattern/uppercasE // match it, uppercase after pattern
http://website/Uppercase/pattern/lowercase // match it, uppercase before pattern
http://website/WhatevercAse/asdasd/whatEveRcase // do not match it, no pattern
Thanks in advance for your help!
Mario
I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of
http://website#otherweb/uppeRcase/pattern/upperCase
is actually otherweb, and should be excluded, even though it begins with website. similarly:
http://website/actual/path/component?uppeRcase/pattern/upperCase
should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.
http://website/uppe%52case/%70attern/upper%43ase
is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.
Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.
To match uppercase letters you simply need [A-Z]. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.
This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"
^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$
This expression will bath on either uppercase-case
^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$
UPDATE:
To #TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.
(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,#^%:/~\+#]*[\w\-\#^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?
*it worked in all my tests, but I can't claim I was exhaustive.