Regex To Find Names in All Caps - regex

I am trying to come up with a Regular Expression that I can use to find lines in a txt file that contain names in ALL CAPS using Notepad++ or similar tool. Once I find a line that matches I want to add three line breaks.
I have various conditions since the lines are names. Some of the names are only two characters. Some have hyphens. Some have multiple names. Some don't have spaces after their last name and comma. Here are some examples:
DOE, JOHN L
DOE-SMITH, JOHN L
DO, JO L
DOE, JOHN BOB L
DOE,JOHN L
I can run this in other programs as well. Just trying to figure this out so I can get it finished.
EDIT: I was using [A-Z]+, [A-Z]+ but it didn't select the whole line and it didn't account for spaces and hyphens.
ANSWER: The following regex met my needs:
^(?!.*[a-z])(?!.*[0-9]).+$
Part 2 ANSWER: I also made an adjustment in order to do the second part of my request which was to add three line breaks ahead of the matched item.
^((?!.*[a-z\d]).+)$
I also made sure Match Case was selected. It was using Regular Expression. and replaced with the following:
\n\n\n\1
Thanks Everyone!

Use a negative look ahead for a lowercase char:
^(?!.*[a-z]).+$
This matches "any line that doesn't contain a lowercase letter".
To also disallow numbers:
^(?!.*[a-z\d]).+$

Use Extended Regular Expressions with POSIX Character Classes
This will work for your provided corpus using GNU grep. Adapt to suit any changes to your data.
$ grep \
--extended-regexp \
--only-matching \
--regexp='[[:upper:]-]+, ?[[:upper:]]+' \
/tmp/corpus
DOE, JOHN
DOE-SMITH, JOHN
DO, JO
DOE, JOHN
DOE,JOHN
Adding Newline Characters with GNU Sed
You can perform this operation with the append operation in GNU sed. For example:
$ sed \
--regexp-extended '/[[:upper:]-]+, ?[[:upper:]]+/a\\n\n\n' \
/tmp/corpus
DOE, JOHN L
DOE-SMITH, JOHN L
DO, JO L
DOE, JOHN BOB L
DOE,JOHN L

Related

Remove artificially generated words from a word list in Linux?

Hello everyone and good day!
I have the following question: I have a word list that consists of normal words as well as artificially generated words.
example:
Ford
09mKGmaePnCmjkxm
Opel
0AACyvG0FtRHAU7i
Audi
0AR6V7cCy2phgXcv
BMW
0bDOlBY5VGAe5Vai
Alfa-Romeo
Mercedes
Pegout-323
0BDTwSCCrCy4VgEc
0cmolI8g4CerXKaH
0dL2m36014PmOetH
0dqjCZU7ZeRuovFF
0ekelbAnWcGC1c7n
Lada 2109
Lada 2106
0ER4tS8jhESXuISp
0Gao8qHgbEyZ06Bh
0j1pjZBAW2avxU6Z
0j5zBVhdPDyaVoZL
Toyouta
0Jn0qoKdnM6neGdx
0KlzXttiw81AvU2C
0kXzuEtHxiWfECw7
mitsubisi
0l8qW9Uv0V1DZPei
0LJQxUNuEp42txme
jeep
0m8G1GUytcETbtWv
0MexVW3TQ2sRqLjr
I want to remove all artificially generated words from this list.
I have converted such words to REGEX and saved them in a new file "Generic.txt":
[0-9][0-9][a-z][A-Z][A-Z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][a-z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][a-z][A-Z][0-9][A-Z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][A-Z][0-9][a-z][A-Z][a-z][0-9][a-z][a-z][a-z][A-Z][a-z][a-z]
[0-9][a-z][A-Z][A-Z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][A-Z][a-z][0-9][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][a-z][0-9][A-Z][a-z][A-Z][a-z]
[0-9][a-z][a-z][a-z][a-z][A-Z][0-9][a-z][0-9][A-Z][a-z][a-z][A-Z][A-Z][a-z][A-Z]
[0-9][a-z][A-Z][0-9][a-z][0-9][0-9][0-9][0-9][0-9][A-Z][a-z][A-Z][a-z][a-z][A-Z]
[0-9][a-z][a-z][a-z][A-Z][A-Z][A-Z][0-9][A-Z][a-z][A-Z][a-z][a-z][a-z][A-Z][A-Z]
[0-9][a-z][a-z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][A-Z][A-Z][0-9][a-z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][a-z][A-Z][0-9][a-z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][0-9][0-9][A-Z][a-z]
[0-9][a-z][0-9][a-z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z][a-z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][0-9][a-z][A-Z][A-Z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z]
[0-9][A-Z][a-z][0-9][a-z][a-z][A-Z][a-z][a-z][A-Z][0-9][a-z][a-z][A-Z][a-z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][a-z][a-z][a-z][a-z][0-9][0-9][A-Z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z][a-z][0-9]
[0-9][a-z][0-9][a-z][A-Z][0-9][A-Z][a-z][0-9][A-Z][0-9][A-Z][A-Z][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z][A-Z][a-z][0-9][0-9][a-z][a-z][a-z][a-z]
[0-9][a-z][0-9][A-Z][0-9][A-Z][A-Z][a-z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][0-9][a-z][A-Z][a-z][A-Z][a-z][a-z]
Now I want to delete from the word list "base.txt" all words that match this regex. They can also be larger than 16 characters!
I use the following command:
LC_ALL=C grep -F -f generic.txt base.txt > test.txt
Unfortunately I get no results, but also no error messages. What am I doing wrong?
Basically I want grep to check the file "base.txt" for every line from the file "generic.txt" and extract these lines into a new file.
The following list should remain at the end:
Ford
Opel
Audi
BMW
Alfa-Romeo
Mercedes
Pegout-323
Lada 2109
Lada 2106
Toyouta
mitsubisi
jeep
TIA
Sergio
The immediate error is that the -F option disables regular expressions entirely, and requires the text to match the pattern literally. (So for example [0-9] matches the literal string [0-9] and no other strings.)
Probably a better approach entirely is to try to generalize this absurd list of patterns to a single pattern, or a very small list of patterns. How did you come up with this list?
For example
grep -E '^[A-Za-z0-9]{16}$' base.txt
seems to extract only the (apparent) generated patterns in your example.
Problem is the definition of a "word", meaning why should Ford be a valid word while e.g. F0rd is not? That said, for your given list, you could use
^[a-zA-Z]+(?:[- ]\w+)?$
See a demo on regex101.com.
Another solution would be to emphasize that a word cannot start with a digit, thus anything that starts with a digit does not contain valid words:
^[0-9].{15}$(*SKIP)(*FAIL)|^.+
See another demo for this one on regex101.com.

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

How to use zgrep to display all words of a x size from a wordlist?

I want to display all the words from my wordlist who start with a w and are 9 letters long. Yesterday I learnt a bit more on how to use zgrep so I came with :
zgrep '\(^w\)\(^.........$\)' a.gz
But this doesn't work and I think it's because I don't know how to do a AND between the two conditions. I found that it should be (?=expr)(?=expr) but I can't figure out how to build my command then
So how can I build my command using the (?=expr) ?
for example if I have a wordlist like this:
Washington
Sausage
Walalalalalaaaa --> shouldn't match
Wwwwwwwww --> should match
You may use
zgrep '^w[[:alpha:]]\{8\}$' a.gz
The POSIX BRE pattern will match a string that
^w - starts with w
[[:alpha:]]\{8\} - then has eight letters
$ - followed with with the end of string marker.
Also, see the 9.3 Basic Regular Expressions.

Convert MS Outlook formatted email addresses to names of attendees using RegEx

I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.

Regex to match a few possible strings with possible leading and/or trailing spaces

Let's say I have a string:
John Smith (auth.), Mary Smith, Richard Smith (eds.), Richie Jack (ed.), Jack Johnny (eds.)
I would like to match:
John Smith(auth.),Mary Smith,Richard Smith(eds.),Richie Jack(ed.),Jack Johnny(eds.)
I have came up with a regex but I have a problem with the | (or character) because my string contains characters that have to be escaped like ().. This is what I'm not able deal with. My regex is:
\s+\((auth\.\)|\(eds\.\))?,\s+
EDIT: I think now that the most universal solution would be to assume that in () could be anything.
Try this:
\s*\((auth|eds?)?\.\)?,?\s*
\s+ means one or more
\s* means zero or more
Based on your comment, I modified the regex:
\s*((\([^)]*\))|,)\s*