Don't get number if precede by month python - regex

I'm pulling some data out of the web utilizing python in the Jupyter notebook. I have pulled down the data, parsed, and created the data frame. I have extracted a number out of a string that I have in a variable in the data frame. I utilizing this regex to do it:
number = []
for note in df["person_notes"]:
match = re.search(r'\d+', note)
if match:
number.append(note[match.start(): match.end()])
else:
number.append("")
df["number"] = number
Some strings are missing the number I'm looking for. For those cases, I will like to number.append(""). Those strings have instead a full date like so... "September 20, 2016" and my re.search() is pulling the number 20 out of that full date. If the string has a data like so, I want to ignore the 20 and instead I want to number.append("").
How can I modify the re.search() to ignore the number if the number is preceded by a month?

I suggest useing the old JS regex trick: enclose the pattern you wouldenclose with a negative lookbehind with an optional capturing group, and if it is a success, discard the match (here, append a ""). Else, grab the other capturing group contents (here, the digits).
See the Python demo:
import re
number = []
p = re.compile(r'((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? *(\d+)')
match = p.search('September 20, 2016')
if match and not match.group(1): # Did the string match and did Group 1 fail?
number.append(match.group(2)) # Yes, then add digits
else:
number.append("") # Else, add an empty value
print(number)
If you do not care about the shortened month names and keep it readable, you may use a simpler regex:
p = re.compile(r'(January|February|March|April|May|June|July|August|September‌​|October|November|De‌​cember)? *(\d+)')
The regex matches:
((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? - months (full or short names)
* - zero or more spaces
(\d+) - Group 2: one or more digits.

Related

Match number between 2 date

I got a text and I need to extract a number that is between 2 dates. I can't show the full text so I will only use the part I need, but keep in mint it's part of a bigger text.
12/14/2020 355345 12/14/2020
From that, I need to get '355345', I currently don't have anything to show of what I was doing because I was working on getting the text before a sentence, until I realized it the only place where the number is between 2 dates.
Thanks!
Here's a snippet that might help:
Suppose the input is this:
Imports System.Text.RegularExpressions
'...
Dim input As New StringBuilder
input.AppendLine("12/14/2020 355345 12/14/2020")
input.AppendLine("12/13/2020 425345 12/13/2020")
input.AppendLine("12/20/2020 93488557 12/20/2020")
input.AppendLine("12/21/2020 4 12/21/2020")
input.AppendLine("12/20/2020 3443 12/20/2020")
'...
Use RegEx to extract the numbers between the two dates as follows:
Dim patt = "(\d+\/\d+\/\d+)\s?(\d+)\s?(\d+\/\d+\/\d+)"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(2).Value)
Next
This will capture three groups. Example for the first match:
m.Groups(1).Value : 12/14/202 the first date.
m.Groups(2).Value : 355345 the number in between.
m.Groups(3).Value : 12/14/2020 the second date.
If you have no use for the captured dates, then no need to get theme grouped and use the following pattern instead:
Dim patt = "\d+\/\d+\/\d+\s?(\d+)\s?\d+\/\d+\/\d+"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
And you will get the number between the two dates in Group 1.
The output of both is:
355345
425345
93488557
4
3443
regex101
Also, using the quantifiers in RegEx patterns is a good idea as Mr. #AndrewMorton mentioned in his appreciated comments, and that to skip any possible things like 1234/239994/2293 in the input:
Dim patt = "\d{1,2}/\d{1,2}/\d{4}\s(\d{1,})\s\d{1,2}/\d{1,2}/\d{4}"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
The quantifiers-way test is here.
If you can safely check for numbers and slashes, then a pattern like this should work:
\d\d/\d\d/\d\d\d\d +(\d+) +\d\d/\d\d/\d\d\d\d
...where capture group 1 would hold the number being sought. If you need to validate that the values are actually dates, well... you can do it with regex to a degree, but the pattern becomes very difficult to read.

How to handle redundant cases in regex?

I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.
What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]
Regex demo | Java demo
You need the negative lookahead.
Try using this regex: ^\d*::[^(]+?\s*\(\d{4}\)::(?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1)((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$.
Explanation:
The initial string ^\d*::[^(]+?\s*\(\d{4}\):: is just an optimized one to match Alex.jr example (your version did not respect any non-alphabetic symbols in names)
The negative lookahead block (?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1) stands for "look forth for any disease name, encountered twice, and reject the string, if found any. Its distinctive feature is the (?! ... ) signature.
Finally, ((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$ is also an optimized version of your block (HIV|Cancer|flu|Arthritis|OCD)(\|(HIV|Cancer|flu|Arthritis|OCD))*, oriented to avoid redundant listing.
Probably the easier to maintain method is that you use a bit changed regex,
like below:
^\d*::[a-zA-Z.]+\s\(\d{4}\)::((?:HIV|Cancer|flu|Arthritis|OCD|\|(?!\|))+)$
It contains:
^ and $ anchors (you want that the entire string is matched,
not its part).
A capturing group, including a repeated non-capturing group (a container
for alternatives). One of these alternatives is |, but with a negative
lookahead for immediately following | (this way you disallow 2 or
more consecutive |).
Then, if this regex matched for a particular row, you should:
Split group No 1 by |.
Check resulting string array for uniqueness (it should not contain
repeating entries).
Only if this check succeeds, you should accept the row in question.

Get segment of string in between characters

I have a giant data set that includes lots of file names with various parts of strings that I need to grab.
I have this code segment currently:
def fps(data):
for i in data:
pattern = r'.(\d{4}).' # finds data in between the periods
frames = re.findall(pattern, ' '.join(data)) #puts info into frames list
frames.sort()
for i in range(len(frames)): #Turns the str into integers
frames[i] = int(frames[i])
return frames
This is great and all but it only returns 4 characters after and before a period.
How would I grab part of the string after a period and before the next period.
Preferably without using regular edit because it's a little too complex for a simpleton like me.
For example:
One string may look like this
string = ['filename.0530.extension']
while the others may look like this
string2 = ['filename.042.extension']
string3 = [filename.045363.extension']
I would need to output the numbers in between the periods on the terminal so:
0530, 042, 045363
To match your example data your could match a dot, capture in a group one or more digits \d+ (instead of exactly 4 \d{4}) followed by matching a dot:
\.(\d+)\.
If you want to match all between the dots you might use a negating character class [^.] to match not a dot:
\.([^.]+)\.
Note that if you want to match a literal dot you should escape it \.
Demo
To match the numbers between your periods in your example, you can use this:
^.*\.[^.\s]*?\.?(\d+)\..*$
Here's an online example

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results: