Regex to find capitalized words and code string?

Regex to find capitalized words and code string? - regex

I'm looking to change a simple dash to an em dash in some obituaries we receive. But it's only after the City of Death where that em dash should go.
The text looks like this:
#M_DeathNoticeHed:Alex <\n>Ornelas
#M_DeathNoticeBod:ALAMO <\!-> Alex Ornelas <\n>, 25, died Tuesday, Aug. <\n>16, 2016 at Alamo. Me<\h>morial Funeral Home of <\n>San Juan is in charge of ar<\h>rangements.
#M_DeathNoticeHed:Almaquire Cadena
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Almaquire <\n>Cadena , 87, died Tues<\h>day, Aug. 16, 2016 at Pax <\n>Villa Hospice, in McAllen, <\n>TX. Sanchez Funeral Home <\n> of Rio Grande City is in <\n>charge of arrangements.
#M_DeathNoticeHed:AnaRose <\n>Collazi
#M_DeathNoticeBod:MISSION <\!-> AnaRose <\n>Collazo , 44, died Wednes<\h>day, Aug. 17, 2016 at Mis<\h>sion Regional Medical Cen<\h>ter in Mission. Virgil Wilson <\n>Mortuary of Mission is in <\n>charge of arrangements.
#M_DeathNoticeHed:Andy Garza
#M_DeathNoticeBod:RIO GRANDE CITY <\!-> Andy <\n>Garza , 21, died Tuesday, <\n>Aug. 16, 2016 at Chicago, <\n>IL. Rodriguez Funeral <\n>Home of Roma is in <\n>charge of arrangements.
Notice that after every "#M_DeathNoticeBod: CITY" is "<\!->" which symbolizes the dash that I need changed to an em dash.
My regex code is not getting the "<\!->" selected along with the preceding city and "#M_DeathNoticeHed:".
#M_DeathNoticeBod:([^A-Za-z]*?[A-Z][A-Za-z]*)([^A-Za-z]*?[A-Z][A-Za-z]*) [<\!->]
It is also not selecting cities with 3 names in it like "RIO GRANDE CITY". I'm selecting this because the dash appears in other spots in the file that I do not want replaced.
If I can select that section I can replace the dash here.

If the lines you care about always start with "#M_DeathNoticeBod:" followed by the city of death, followed by the <!-> you wish to replace, I think something simple would do the job:
(#M_DeathNoticeBod:.*)<\\!->
Capture group 1 will contain everything up until that first "<\!->", so you if you're doing a search and replace you can just replace each occurrence of that regex with the contents of group 1 (usually denoted by '\1') followed by an em dash.

This regex should do:
#M_DeathNoticeBod:([A-Z ]*) (<\\!->)

I think this is what you are actually looking for:
(?<=#M_DeathNoticeBod:).+<\\!->
To explain things, the first portion, (?<=#M_DeathNoticeBod:) within the parenthesis is the positive lookbehind that doesn't participate in matching but ensures the trailing portion shall always be preceded by that expression.
I the trailing portion .+ should capture any city name containing any character sequence followed by your <!-> delimiter, which is captured by <\\!-> regex.

Related

Using regex to match groups that may not contain given text patterns

I'd like to use regex to extract birth dates and places, as well as (when they're defined) death dates and places, from a collection of encyclopedia entries. Here are some examples of such entries which illustrate the patterns that I'm trying to codify:
William Wordsworth, (born April 7, 1770, Cockermouth, Cumberland, England—died April 23, 1850, Rydal Mount, Westmorland), English poet...
Jane Goodall, in full Dame Jane Goodall, original name Valerie Jane Morris-Goodall, (born April 3, 1934, London, England), British ethologist...
Kenneth Wartinbee Spence, (born May 6, 1907, Chicago, Illinois, U.S.—died January 12, 1967, Austin, Texas), American psychologist...
I was hoping that the following regex pattern would identify the desired capture groups:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(?:—died )?(\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
But sadly, it does not. (Use https://regexr.com/ to view the results.)
Note: When I restructure the pattern around the phrase —died in the following way, the pattern does produce the expected results for entries like the first and third above (those with given death dates/places), but it obviously does not work in all cases.
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?)\)
What am I missing?

In general, you can mark the whole --died section as optional:
\(born (\w+ \d{1,2}, \d{4})(?:, )(.*?)(—died (\w+ \d{1,2}, \d{4})?(?:, )?(.*?))?\)
https://regexr.com/765fc

You can use the negative lookahead assertion (?!) to match groups that do not contain a given text pattern. The syntax for this is (?!pattern) where "pattern" is the text you want to exclude from the match. For example, if you want to match all groups of characters that do not contain the letter "a", you would use the regex "(?!a).*" This will match any group of characters that does not have an "a" in it.

Regex deselect anything in the line before match

I need to deselect anything in the line before a certain character. My example looks like this.
28.01.2023 11:00
GERMANY
OWVA2
07.01.2023 06:00
JAPAN
940-750
Laden
28.01.2023 12:00
FRANCE
ANDY
-
07.01.2023 09:30
NORWAY and SWEDEN
-
07.01.2023 09:30
SPAIN
I already have a Regex that selects anything in the line that follows a date (dd.MM.yyyy followed by HH:mm). Now I want to select what follows the date ONLY if the match isn't followed by - or NUMBER-NUMBER in the NEXT line. So basically deselect anything in the line before - or NUMBER-NUMBER. So in this case I want to select Germany, France and Spain. I don't want to select Japan and NORWAY and SWEDEN because they are followed by - or NUMBER-NUMBER.
My regex looks like this:
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+(?![\r\n](\d+)?-(\d+)?)
Sorry for this weird example but that is the best I have. Thanks in advance

Currently your negative lookahead only prevents matching the last character of the line, because if it's not matched then the match isn't followed by a linefeed and a number dash number anymore.
You can fix that by matching the lookahead before the content you want to capture :
(?<=\d+.\d+.\d+ \d+:\d+[\r\n]+)(?!.*[\r\n](\d+)?-(\d+)?)([ a-zA-ZäöüÄÖÜßé0-9'-]{3,})+
You can test it here.

Capturing string parts in RegEx

I would like to map different parts of a string, some of them are optionally presented, some of them are always there. I'm using the Calibre's built in function (based on Python regex), but it is a general question: how can I do it in regex?
Sample strings:
!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf
The strings' structure is the following:
[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]
Finally I created this regular expression, but it is not perfect, because if ISBN presented the title will contain the ISBN part too...
(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()
Here is my sandbox: https://regex101.com/r/K2FzpH/1
I really appreciate any help!

Instead of using an alteration, you could use:
^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()
^ Start of the string
!* Match optional exclamation marks
(?P<title>[A-Za-z0-9].+?) Named group title, match of the ranges in the character class followed by matching as least as possible chars
(?:\s+(?P<isbn>[0-9]{10}))? Optionally match 1+ whitespace chars and named group isbn which matches 10 digits
\s+by\s+ Match by between 1 or more whitspace chars
(?P<author>[A-Z0-9][^(]+) Named group author Match either A-Z or 0-9 followed by 1+ times any char except (
(?=\s\() Positive lookahead to assert ( directly to the right
Regex demo

Matching location with regular expressions | Python

I scraped several articles from a websites. Now I am trying to extract the location of the news. The location are written either capitalized with just the capital of the country (e.g. "BRUSSELS-") or in some cases along with the country (e.g. "BRUSELLS, Belgium-")
This is a sample of the articles:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...]
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]
The regular expression I used is this one:
text_open = open("Training_News_6.csv")
text_read = text_open.read()
pattern = ("[A-Z]{1,}\w+\s\—")
result = re.findall(pattern,text_read)
print(result)
The reason why I used the score sign (-) is because is a recurrent pattern that links to the location.
However, this regular expression manage to extract "BRUSSELS -" but when it comes to "KABUL, Afghanistan -" it only extract the last part, namely "Afghanistan -".
In the second case I would like to extract the whole location: the capital and the country. Any idea?

You may use
([A-Z]+(?:\W+\w+)?)\s*—
See the regex demo
Details:
([A-Z]+(?:\W+\w+)?) - Capture Group 1 (the contents of which will be returned as the result of re.findall) capturing
[A-Z]+ - 1 or more ASCII uppercase letters
(?:\W+\w+)? - 1 or 0 occurrences (due to ? quantifier) of 1+ non-word chars (\W+) and 1+ word chars (\w+)
\s* - 0+ whitespaces
— - a — symbol
Python demo:
import re
rx = r"([A-Z]+(?:\W+\w+)?)\s*—"
s = "|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo"
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']

One thing you could do is adding , and \s to your first chars selection, and then stripping all whitespaces and commas from the left.,[A-Z,\s]{1,}\w+\s\—
Or even something simpler, like this:,(.+)\—. $1 would be your match, containing extra symbols. Another option which might work: ,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\— or simplified versions: ,\s*([A-Za-z,\s]*)\s\—. Once again $1 is your match.

Python Regex. Capturing text between matches issue

Quite new to regex and having trouble getting the right match.
I have the following string:
The AGM will be held at the Company's registered office at Unity House, Telford Road, Basingstoke, Hampshire, RG21 6YJ on 13 January 2016 at 10.00 a.m.
The Company announces that its 2016 Annual General Meeting will be held on 11 February 2016 at 10.00 a.m. at Hangar 89, London Luton Airport, Luton, Bedfordshire, LU2 9PF.
I am trying to extract the address from the last occurrence of 'at' till the postcode. So Unity House, Telford Road, Basingstoke, Hampshire, RG21 6YJ and Hangar 89, London Luton Airport, Luton, Bedfordshire, LU2 9PF
This is what I use (at)(?!.*at)(.*)\s([A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2})
it extracts only the second address. Any thoughts?
Thanks

Looks like you want to use ((?:(?!at).)*) instead (?!.*at)(.*) for avoiding to skip over at
(at)((?:(?!at).)*)\s([A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2})
See demo at regex101
If you use (at)(?!.*at)(.*) with s flag, there is only at the last at not another at ahead. So it is expected that only the last one would match. (at)((?:(?!at).)*) will not skip over another at.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to find capitalized words and code string? - regex

This regex should do: #M_DeathNoticeBod:([A-Z ]*) (<\\!->)

Related

Using regex to match groups that may not contain given text patterns

Regex deselect anything in the line before match

Capturing string parts in RegEx

Matching location with regular expressions | Python

Python Regex. Capturing text between matches issue

Categories

Resources