Using REGEX to remove duplicates when entire line is not a duplicate - regex

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada

How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

Related

Match first and then all equal occurrences with regex

Lets say we have the string:
one day, when Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.
is there a way with regex to match the first name, and then match and all other occurrences of the same name in the string?
Given the name-matching regex /[A-Z][a-z]+ (with /g maybe?), can the regex matcher be made to remember the first match, and then use that match EXACTLY for the rest of the string? Other subsequent matches to the name-matching regex should be ignored (except for Anne in the example).
The result would be (if matches are replaced with "Foo"):
one day, when Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.
Please ignore the fact that the sentence start uncapitalized, or add an example that also handles this.
Using a script to get the first match and then using that as input for a second iteration works of course, but that's outside the scope of the question (which is limited to ONE regex expression).
The only way I could think of is with non-fixed width lookbehinds. For example through Pypi's regex module, and maybe Javascript too? Either way, assuming a name is capture through [A-Z][a-z]+ as per your question try:
\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)
See an online demo
\b([A-Z][a-z]+)\b - A 1st capture group capturing a name between two word-boundaries;
(?<=^[^A-Z]*\b\1\b.*) - A non-fixed width positive lookbehind to match start of line anchor followed by 0+ characters other than uppercase followed by the content of the 1st capture group and 0+ characters.
Here is a PyPi's example:
import regex as re
s= 'Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.'
s_new = re.sub(r'\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)', 'Foo', s)
print(s_new)
Prints:
Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.

Convert MS Outlook formatted email addresses to names of attendees using RegEx

I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.

Regex to match a few possible strings with possible leading and/or trailing spaces

Let's say I have a string:
John Smith (auth.), Mary Smith, Richard Smith (eds.), Richie Jack (ed.), Jack Johnny (eds.)
I would like to match:
John Smith(auth.),Mary Smith,Richard Smith(eds.),Richie Jack(ed.),Jack Johnny(eds.)
I have came up with a regex but I have a problem with the | (or character) because my string contains characters that have to be escaped like ().. This is what I'm not able deal with. My regex is:
\s+\((auth\.\)|\(eds\.\))?,\s+
EDIT: I think now that the most universal solution would be to assume that in () could be anything.
Try this:
\s*\((auth|eds?)?\.\)?,?\s*
\s+ means one or more
\s* means zero or more
Based on your comment, I modified the regex:
\s*((\([^)]*\))|,)\s*

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Regex: Match all, but ignore a specific word

Sample 1 String:
Aquaman Figure, XL DC Comics
Sample 2 String:
Rocket Raccoon, Mini Marvel
Regex:
/(DC Comics|Marvel)/
Match Sample 1:
DC Comics
Match Sample 2:
Marvel
Works perfectly in Regex101
How do I reverse this?
I want to match Aquaman Figure, XL and Rocket Raccoon, Mini only.
Edit:
/(.+)(?=Marvel)/ seems to do the job. It excludes Marvel from Rocket Raccon! How do I make this also work with DC comics?
/(.+)(?=Marvel)/ (or /(.+)(?=DC Comics|Marvel)/ for both) isn't going to work for something like:
John Marvel Bob
For which I presume you want the result to be:
John Bob
You'll only get John in the first match, and you'll get Marvel Bob in the second match (since look-ahead doesn't consume the looked-ahead characters).
Or something that doesn't contain either of the strings (since you require that the next characters match some given characters to get a match).
The easiest solution is probably just replacing the two desired sub-strings with empty strings. Replace:
DC Comics|Marvel
with:
(empty string)
Or you can repeatedly search for:
/(.*?)(DC Comics|Marvel|$)/
And just extract the first group (which will correspond to what matches .*, which is everything starting from the end of the last match up to just before "DC Comics", "Marvel" or the end of the string).
The reluctant quantifier ? is needed to prevent the .* from matching John Marvel Bob, rather than just John in John Marvel Bob Marvel.
re.findall(r"(.*)(?=Marvel|Comics)",input)
This does exactly what you are looking for.Its in python.input will be your string.