Regex deselect anything in the line before match - regex

I need to deselect anything in the line before a certain character. My example looks like this.
28.01.2023 11:00
GERMANY
OWVA2
07.01.2023 06:00
JAPAN
940-750
Laden
28.01.2023 12:00
FRANCE
ANDY
-
07.01.2023 09:30
NORWAY and SWEDEN
-
07.01.2023 09:30
SPAIN
I already have a Regex that selects anything in the line that follows a date (dd.MM.yyyy followed by HH:mm). Now I want to select what follows the date ONLY if the match isn't followed by - or NUMBER-NUMBER in the NEXT line. So basically deselect anything in the line before - or NUMBER-NUMBER. So in this case I want to select Germany, France and Spain. I don't want to select Japan and NORWAY and SWEDEN because they are followed by - or NUMBER-NUMBER.
My regex looks like this:
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+(?![\r\n](\d+)?-(\d+)?)
Sorry for this weird example but that is the best I have. Thanks in advance

Currently your negative lookahead only prevents matching the last character of the line, because if it's not matched then the match isn't followed by a linefeed and a number dash number anymore.
You can fix that by matching the lookahead before the content you want to capture :
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)(?!.*[\r\n](\d+)?-(\d+)?)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+
You can test it here.

Related

Regex for putting comma before city name in address

Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1
You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

Regex to find capitalized words and code string?

I'm looking to change a simple dash to an em dash in some obituaries we receive. But it's only after the City of Death where that em dash should go.
The text looks like this:
#M_DeathNoticeHed:Alex <\n>Ornelas
#M_DeathNoticeBod:ALAMO <\!-> Alex Ornelas <\n>, 25, died Tuesday, Aug. <\n>16, 2016 at Alamo. Me<\h>morial Funeral Home of <\n>San Juan is in charge of ar<\h>rangements.
#M_DeathNoticeHed:Almaquire Cadena
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Almaquire <\n>Cadena , 87, died Tues<\h>day, Aug. 16, 2016 at Pax <\n>Villa Hospice, in McAllen, <\n>TX. Sanchez Funeral Home <\n> of Rio Grande City is in <\n>charge of arrangements.
#M_DeathNoticeHed:AnaRose <\n>Collazi
#M_DeathNoticeBod:MISSION <\!-> AnaRose <\n>Collazo , 44, died Wednes<\h>day, Aug. 17, 2016 at Mis<\h>sion Regional Medical Cen<\h>ter in Mission. Virgil Wilson <\n>Mortuary of Mission is in <\n>charge of arrangements.
#M_DeathNoticeHed:Andy Garza
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Andy <\n>Garza , 21, died Tuesday, <\n>Aug. 16, 2016 at Chicago, <\n>IL. Rodriguez Funeral <\n>Home of Roma is in <\n>charge of arrangements.
Notice that after every "#M_DeathNoticeBod: CITY" is "<\!->" which symbolizes the dash that I need changed to an em dash.
My regex code is not getting the "<\!->" selected along with the preceding city and "#M_DeathNoticeHed:".
#M_DeathNoticeBod:([^A-Za-z]*?[A-Z][A-Za-z]*)([^A-Za-z]*?[A-Z][A-Za-z]*) [<\!->]
It is also not selecting cities with 3 names in it like "RIO GRANDE CITY". I'm selecting this because the dash appears in other spots in the file that I do not want replaced.
If I can select that section I can replace the dash here.
If the lines you care about always start with "#M_DeathNoticeBod:" followed by the city of death, followed by the <!-> you wish to replace, I think something simple would do the job:
(#M_DeathNoticeBod:.*)<\\!->
Capture group 1 will contain everything up until that first "<\!->", so you if you're doing a search and replace you can just replace each occurrence of that regex with the contents of group 1 (usually denoted by '\1') followed by an em dash.
This regex should do:
#M_DeathNoticeBod:([A-Z ]*) (<\\!->)
I think this is what you are actually looking for:
(?<=#M_DeathNoticeBod:).+<\\!->
To explain things, the first portion, (?<=#M_DeathNoticeBod:) within the parenthesis is the positive lookbehind that doesn't participate in matching but ensures the trailing portion shall always be preceded by that expression.
I the trailing portion .+ should capture any city name containing any character sequence followed by your <!-> delimiter, which is captured by <\\!-> regex.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Regex find Proper Nouns or Phrases that are NOT first word in a sentence

I've found several questions that touch on this, but none that seem to answer it. I am trying to build a Regex that will allow me to identify Proper Nouns in a group of text.
I am defining a Proper Noun as follows: A word or group of words that begin with a capital letter, are longer than 1 digit (to exclude things like I, A, etc), and are NOT the first word of a new sentence.
So, in the following text
"Susan Dow stayed at the Holiday Inn on Thursday. She met Tom and Shirley Temple at the bar where they ordered Green Eggs and Ham"
I would want the following returned
Holiday Inn
Thursday
Tom
Shirley Temple
Green Eggs
Ham
Right now, [A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)* is what I have, but it's returning Susan Dow and She in addition to the ones listed above. How can I get my . look-up to work?
You can use:
(?<!^|\. |\. )[A-Z][a-z]+
per this rubular
Update: Integrated the two negative looks using alternation. Also added check for two spaces between sentences. Note that repetition operators cannot be used in negative lookbehinds per notes in http://www.regular-expressions.info/lookaround.html