Excluding text at the beginning of a string - regex

I'm new to using RegEx and I'm still stumbling around a bit, so I'm sorry if this is a basic question. I'm trying to extract a string from between two parenthesis and I can't seem to figure out how to exclude the first part from my match.
This is my regex pattern:
(.+?)(?= -)
I want to extract a birth date, for example, excluding the "b." and the training "-". Here's a sample set:
( b. circa 1883 - d. Mar 03, 1960 )
( b. May 21, 1887 - d. Jan 24, 1979 )
( b. May 28, 1902 Zembin, BELARUS - d. Dec 22, 1998 Florida, USA )
( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA - d. May 17, 1969 New York, New York, USA )
My regex matches ( b. Jan 09, 1886 Philadelphia, Pennsylvania, USA (for example) but also includes "( b. " prefix, which I want to exclude.
The regex also matches the following text, which I would like to exclude as well:
Husband of Sarah Wilder (August 2000
Also, I cannot get the following string to match, presumably because of the dot and space in St. Louis.
( b. Jun 28, 1920 St. Louis, Missouri, USA )
I've been banging my head for several hours and just can't quite get the rest of it. Any help or guidance would be very much appreciated. I've already gotten a lot of help from reading many of the posts here.
Thanks so much!

Assuming that your data always contains a hyphen followed by d., you can try this: (?<=b\. )(.*) - d\.
(?<=b\. ) matches the b. text without it being added to the matching text.
(.*) is a capturing group that contains the match. It captures everything until the terminating - d. is hit. Note that the . characters must be escaped to match correctly as they are regex special characters.

If it always starts with ( b. and end with - d. <something> ), you can simply do
(?<=^\( b\. ).*(?= - d\..*\))
Which actually means you are match any characters (.*), with <start of line>( b. in front of it ((?<=^\( b\. )), and with - d. <something>) behind it ((?= - d\..*\))). https://regex101.com/r/vB2fmP/1
Or, if you don't mind using matching group:
^\( b\. (.*) - d \..*\)$
^ start of line
\( b\. open parenthesis, space, b, dot, space
( ) capture group
.* any char, any occurence
- d \..*\) space, hyphen, space, d, dot,
then any char any occurrence,
close parenthesis,
$ end of line
and capture group 1 is the value you need (personally I prefer this one instead).

To prevent capturing the leading ( b. you could prefix your regex with \(\s*b\.\s* which will match the ( and the b. surrounded by zero or more whitespace characters \s*.
Then from that point you would capture your values in a group (.*?) and you could update your positive lookahead (?= (?:\-|\))) to include a whitespace with either a - or a ).
\(\s*b\.\s*(.*?)(?= (?:\-|\)))

You can do this be making two passes through the search string. On the first pass you capture all text inside brackets, and on the second you clean up your results by removing the unwanted expressions. You don't say what language you are using, so I will use PHP.
$want = "/\(.+?\)/";
$dontWant = "/(b/.|/-)/";
$desiredResult = array();
$result = preg_match_all($want, $searchText, $matches); // Get all text inside brackets
if (count($matches[0])>0) { // $matches[0] holds all the matches
foreach ($matches[0] as $match) { // Loop through the matches
$desiredResult[] = preg_replace( $dontWant, "", $match); // Remove unwanted text
}
}
You can adjust this to whatever language you are using.

Related

Regular Expression to get substring from main string

I am trying to get sub-string based on some pattern. trying to fetch first number which should not be in first character of main string.
Strings:
BRUSPAZ 8MG
BRUSPAZ MG
BRUSPAZ 10 MG
BRUSPAZ10 MG
AVAS 40
AVAS 40 TEST 2TABS
MICROCEF CV 200 TABS
1CROCIN DS 240 MG / 5 ML SUSPENSION
My Regular Expression : /(\d+)( )?(MG)?/
Required Output:
This is the regex:
(?<!^)(\d+)(\s*MG)?
I changed the ( )? to \s* so as to account for other kinds of whitespace and more than one of them.
I added a (?<!^). This is a negative lookbehind, looking for ^ - the start of the string. Basically it says that "there should not be the start of the string before the digits".
If you run this regex line by line, and turning of the global modifier, you will not match the 5 in the last line.
If you want to match decimals as well, use this:
(?<!^)(\d+\.\d+)(\s*MG)?

Extract nested string from text column

I have following SQL result entries.
Result
---------
TW - 5657980 Due Date updated : to <strong>2017-08-13 10:21:00</strong> by <strong>System</strong>
TW - 5657980 Priority updated from <strong> Medium</strong> to <strong>Low</strong> by <strong>System</strong>
TW - 5657980 Material added: <strong>1000 : Cash in Bank - Operating (Old)/ QTY:2</strong> by <strong>System</strong>#9243
TW - 5657980 Labor added <strong>Kelsey Franks / 14:00 hours </strong> by <strong>System</strong>#65197
Now I am trying to extract a short description from this result and trying to migrate it to the another column in the same table.
Expected result
--------------
Due Date Updated
Priority Updated
Material Added
Labor Added
Ignore first 13 characters. For most of the cases it ends with 'updated'. Few ends with 'added'. It should be case insensitive.
Is there any way to get the expected result.
Solution with substring() using a regular expression. It skips the first 13 characters, then takes the string up to the first ' updated' or ' added', case-insensitive, with leading blank. Else NULL:
SELECT substring(result, '(?i)^.{13}(.*? (?:updated|added))')
FROM tbl;
The regexp explained:
(?i) .. meta-syntax to switch to case-insensitive matching
^ .. start of string
.{13} .. skip the first 13 characters
() .. capturing parenthesis (captures payload)
.*? .. any number of characters (non-greedy)
(?:) .. non-capturing parenthesis
(?:updated|added) .. 2 branches (string ends in 'updated' or 'added')
If we cannot rely on 13 leading characters like you later commented, we need some other reliable definition instead. Your difficulty seems with hazy requirements more than with the actual implementation.
Say, we are dealing with 1 or more non-digits, followed by 1 or more digits, a space and then the payload as defined above:
SELECT substring(result, '(?i)^\D+\d+ (.*? (?:updated|added))') ...
\d .. class shorthand for digits
\D .. non-digits, the opposite of \d

Matching location with regular expressions | Python

I scraped several articles from a websites. Now I am trying to extract the location of the news. The location are written either capitalized with just the capital of the country (e.g. "BRUSSELS-") or in some cases along with the country (e.g. "BRUSELLS, Belgium-")
This is a sample of the articles:
|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded two police officers with a knife in Brussels around...]
[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016
, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Monday and wounded]
The regular expression I used is this one:
text_open = open("Training_News_6.csv")
text_read = text_open.read()
pattern = ("[A-Z]{1,}\w+\s\—")
result = re.findall(pattern,text_read)
print(result)
The reason why I used the score sign (-) is because is a recurrent pattern that links to the location.
However, this regular expression manage to extract "BRUSSELS -" but when it comes to "KABUL, Afghanistan -" it only extract the last part, namely "Afghanistan -".
In the second case I would like to extract the whole location: the capital and the country. Any idea?
You may use
([A-Z]+(?:\W+\w+)?)\s*—
See the regex demo
Details:
([A-Z]+(?:\W+\w+)?) - Capture Group 1 (the contents of which will be returned as the result of re.findall) capturing
[A-Z]+ - 1 or more ASCII uppercase letters
(?:\W+\w+)? - 1 or 0 occurrences (due to ? quantifier) of 1+ non-word chars (\W+) and 1+ word chars (\w+)
\s* - 0+ whitespaces
— - a — symbol
Python demo:
import re
rx = r"([A-Z]+(?:\W+\w+)?)\s*—"
s = "|[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016 \n, BRUSSELS — A man wounded two police officers with a knife in Brussels around...] \n[<p>Advertisement , By DAVID JOLLY FEB. 8, 2016 \n, KABUL, Afghanistan — A Taliban suicide bomber killed at least three people on Mo"
print(re.findall(rx, s)) # => ['BRUSSELS', 'KABUL, Afghanistan']
One thing you could do is adding , and \s to your first chars selection, and then stripping all whitespaces and commas from the left.,[A-Z,\s]{1,}\w+\s\—
Or even something simpler, like this:,(.+)\—. $1 would be your match, containing extra symbols. Another option which might work: ,\s*([A-Za-z]*[,\s]*[A-Za-z]*)\s\— or simplified versions: ,\s*([A-Za-z,\s]*)\s\—. Once again $1 is your match.

String Split AND Replace

I am trying to replace a string based on the split portion. This string is a date, where the year should be formatted as a superscript.
Eg. Jan 24, 2014 needs to be split at 2014 then replaced with Jan 24, ^2014^ where 2014 is the superscript.
Example pseudo:
mydate.Split(" ", 2).Replace("^2014^")
But, instead of replacing the new split string, it should be the original (or copy of original). I can't just edit based on index because the formatting may not always be the same, at times the date may be expanded to January 24th, 2014 which would then break the traditional replace by index.
You can try
(?<=[A-Z][a-z]{2} \d{2}, )(\d{4})
Replaced with ^$1^ or ^\1^
Here is online demo and tested it on regexstorm
If you want to match January 24th, 2014 as well then try
([A-Z][a-z]{2,9} \d{2}[a-z]{0,2}, )(\d{4})
Replaced with $1^$2^
Here is demo
You can use a combination of lookarounds to achieve your result.
Regex.Replace(input, "(?<=\d{4})|(?=\d{4})", "^")
Explanation:
(?<= # look behind to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-behind
| # OR
(?= # look ahead to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-ahead
Live Demo
Normalize you date string by assigning it to a Date variable, then do the formatting from there.
Dim dt As Date = "Jan 24, 2014"
Dim s As String = dt.ToShortDateString.Replace("2014", "^2014^")
MsgBox(s)
' or '
s = dt.Month.ToString & "/" & dt.Day.ToString & "/^" & dt.Year.ToString & "^"
MsgBox(s)
IMO RegEx is write once code and is difficult to debug/maintain.

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.