Using a Regex Pattern that finds Abbrevations - regex

I am looking through volumes of data and need to identify certain patterns one of which is abbreviations. The basic rules to identify them in the content I am going through is
They are all is capital letters.
They are separated by dots.
They may be one or more alphabets
They may or may not end with a dot.
I am looking at individual words therefore looking for multiple occurrences in the string is not required.
Examples
U.S., U.S, U.S.S.R., V.
Can someone help construct a regex search pattern for me?
Many thanks
MS

You can use this regex:
^([A-Z]\.)*[A-Z]\.?$
RegEx Demo

This should do the trick:
\b(?:\p{Lu}\.)*\p{Lu}\b\.?
Demo
I've used \p{Lu} (unicode uppercase letters) since you want to match any alphabet.
If you can't make \b unicode aware in your dialect, here's an alternative:
(?<!\p{L})(?:\p{Lu}\.)*\p{Lu}(?!\p{L})\.?

This will work. it also matches the ending dots.
\b([A-Z]\.)*[A-Z]\b\.?

Related

How to find capital letter using RegExp?

I need a simple solution. I have a text that is improperly punctuated and in many places a comma is followed by a capital letter. Example: Here you are, You sicko. A comma followed by a cap. Any string to find these? ,\w doesn't work. I only want caps.
I only know basic regex. I'll use it to search in Notepad++
Thank you.
Try this one:
, [A-Z]
In general case, for any punctuation,
[.,!?\\-]+ [A-Z]+
See image below:
Link: https://regex101.com/r/BrGZmF/1

Finding all possible Acronyms

I have creating a script using VBA to go through a Word document to find all word that could possibly be an acronym but I found that my regEx pattern is not find all of them.
The regEx pattern I am using is "([A-Z]{2,})(-([A-Z]{2,})[A-Za-z0-9])"
With this pattern I am able to find
AA
AAA
AA-BB
AA-BBB
AAA-BB
AAA-BBB
AAA-1234
AAA-BBB-1234
but it does not find these words
B2B
B2B-1234
B2B-A1A-1234
The expectation of the word match should be that the first character is a letter and must contains at least two uppercase letters and at least one number. In addition, if there are dashes in the the word then the characters before the dash must match the expectation of the word match.
Is there is a way to use the regEx pattern above to also include the letter-digit-letter acronyms too?
Milco, welcome to StackOverflow. I think that the following regex will work for you:
([A-Z][A-Z0-9]+)(-[A-Z0-9]{2,})*
This regex accommodates digits and an optional number of hyphenated terms and matches each of your cases above. I tested it out at regextesteronline.com - I'm assuming that VB.net regexes are the same as VBA, which they should be, at least for basic regexes.

Regex for extracting each word between hyphens

I am learning regex and trying to write a pattern that exactly matches each of the strings without'-' so that I can iterate for each of the groups and print the respective strings.
I have a string that looks like "Abcd001-wd2s-vwe1-20180e3103.txt"
I was able to write a regex for extracting Abcd001, wd2s and .txt from above text as shown below
(\A[^-]+)=> Abcd001
(-[^-]+-)=> wd2s
(\..*)=>.txt
However, I was unable to come up with the correct pattern for extracting the exact strings vwe1 and 20180e3103
It will be really helpful if you can guide me on this or if there is a better approach to achieve this?
Please note: [^-.]+ may give me all the words separately but I am looking for an option where I have a group defined for each of these strings so that its one to one mapping.
Thanks!
To get vwe1 or 20180e3103 from the example data, you might use a quantifier {2} or {3} to repeat matching one or more word charcters followed by a hyphen (?:\w+-){2}.
Then you could capture in a group ([^-.]+) matching not a hyphen or a dot.
(?:\w+-){2}([^-.]+)
Try the below regex
/\-([^\)]+)\-/gmi;
Also check the similar implementation:
https://stackoverflow.com/a/50336050/8179245

Regular Expression to match two words near each other on a single line

Hi I am trying to construct a regular expression (PCRE) that is able to find two words near each other but which occur on the same line. The near examples generally provided are insufficient for my requirements as the "\W" obviously includes new lines. I have spent quite a bit of time trying to find an answer to this and have thus far been unsuccessful. To exemplify what I have so far, please see below:
(?i)(?:\b(tree)\b)\W+(?:\w+\W+){0,5}?\b(house)\b.*
I want this to match on:
here is a tree with a house
But not match on
here is a tree
with a house
Any help would be greatly appreciated!
How about
\btree\b[^\n]+\bhouse\b
Just add a negative lookahead to match all the non-word characters but not of a new line character.
(?i)(?:\b(tree)\b)(?:(?!\n)\W)+(?:\w+\W+){0,5}?\b(house)\b.*
DEMO
Dot matches anything except newlines, so just:
(?i)\btree\b.{1,5}\bhouse\b
Note it is impossible for there to be zero characters between the two words, because then they wouldn't be two words - they would be the one word and the \b wouldn't match.
Just replace \W with [^\w\r\n] in your regex:
(?i)(?:\b(tree)\b)[^\w\r\n]+(?:\w+[^\w\r\n]+){0,5}?\b(house)\b.*
To get the closest matches of both words on the same line, an option is to use a negative lookahead:
(?i)(\btree\b)(?>(?!(?1)).)*?\bhouse\b
The . dot default does not match a newline (only with s DOTALL modifier)
(?>(?!(?1)).)*? As few as possibly of any characters, that are not followed by \btree\b
(?1) pastes the first parenthesized pattern.
Example at regex101.com; Regex FAQ
Maybe this helps, found here https://www.regular-expressions.info/near.html
\bword1\W+(?:\w+\W+){1,6}?word2\b.

Here a word is a string of letters, preceded and followed by nonletters

I asked his question earlier but none of the responses solved the problem. Here is the full question:
Give a single UNIX pipeline that will create a file file1 containing all the words in file2, one word per line.Here a word is a string of letters, preceded and followed by nonletters.
I tried every single example that was given below, but i get "syntax error"s when using them.
Does anyone know how I can solve this??
Thanks
if your regex flavor support it you can use lookarounds:
(?<![a-zA-Z])[a-zA-Z]+(?![a-zA-Z])
(?<!..): not preceded by
(?!..): not followed by
If it is not the case you can use capturing groups and negated character classes:
(^|[^a-zA-Z])([a-zA-Z]+)($|[^a-zA-Z])
where the result is in group 2
^|[^a-zA-Z]: start of the string or a non letter characters (all character except letters)
$: end of the string
or the same with one capturing group and two non capturing groups:
(?:^|[^a-zA-Z])([a-zA-Z]+)(?:$|[^a-zA-Z])
(result in group 1)
In order to be unicode compatible, you could use:
(?:^|\PL)\pL+(?:\PL|$)
\pL stands for any letter in any language
\PL is the opposite of \pL
When your objective is to actually find words, the most natural way would be
\b[A-Za-z]+\b
However, this assumes normal word boundaries, like whitespaces, certain punctuations or terminal positions. Your requirement suggests you want to count things like the "example" in "1example2".
In that case, I would suggest using
[A-Za-z]+
Note that you don't actually need to look for what precedes or follows the alphabets. This already captures all alphabets and only alphabets. The greedy requirement (+) ensures that nothing is left out from a capture.
Lookarounds etc should not be necessary because what you want to capture and what you want to exclude are exact inverses of each other.
[Edit: Given the new information in comments]
The methods below are similar to Casimir's, except that we exclude words at terminals (which we were explicitly trying to capture, because of your original description).
Lookarounds
(?<=[^A-Za-z])[A-Za-z]+(?=[^A-Za-z])
Test here. Note that this uses negated positive lookarounds, and not Negative lookarounds as they would end up matching at the string terminals (which are, to the regex engine as much as to me, non-alphabets).
If lookarounds don't work for you, you'd need capturing groups.
Search as below, then take the first captured group.
[^A-Za-z]([A-Za-z]+)[^A-Za-z]
When talking about regex, you need to be extremely specific and accurate in your requirements.