Regular expression: Matching multiple occurrences in the same line - regex

I have a string that I need to match using regex. It works perfectly fine when I have a single occurrence in a single line, however, when there are multiple occurrences of the same string in a single line I'm not getting any matches. Can you please help?
Sample strings:
MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T
Regex that I tried:
(([A-Z]{2}[0-9]{8,9}[A-Z]{1})|([A-Z]{2}[0-9]{8,9}))

This seems to work fine:
a = '''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
import re
patterns = ['[A-Z]{2}[0-9]{8,9}[A-Z]{1}','[A-Z]{2}[0-9]{8,9}']
pattern = '({})'.format(')|('.join(patterns))
matches = re.findall(pattern, a)
print([match for sub in matches for match in sub if match])
#['MS17010314', 'MS00030208', 'IL00171198', 'IH09850115', 'IH99400409',
# 'IH99410409', 'IL01771010', 'IL01791002', 'IL01930907', 'IL02360907',
# 'CM00010904', 'IH09520115', 'MS00201285', 'MS19050708', 'MS00370489',
# 'MS19011285T']
I've added a way to combine all patterns.

i tried using python and the following code worked
import re
s='''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
lst_of_regex = [a,b]
pattern = '|'.join(lst_of_regex)
print(re.findall(pattern,s))

Related

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Correction in Regex for unicode

I need help for regex. My regex is not producing the desired results. Below is my code:
import re
text='<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one
on the spot<u+26a1>'
regex=re.compile(r'[<u+\w+]+>')
txt=regex.findall(text)
print(txt)
Output
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', 'loved<u+2764>', '<u+fe0f>', 'spot<u+26a1>']
I know, regex is not correct. I want output as:
'<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>'
import re
regex = re.compile(r'<u\+[0-9a-f]+>')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', '<u+2764>', '<u+fe0f>', '<u+26a1>']
That is not exactly what you want, but its almost there.
Now, to achieve what you are looking for, we make our regex more eager:
import re
regex = re.compile(r'((?:<u\+[0-9a-f]+>)+)')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>']
Why won't you add optional 2nd tag search:
regex=re.compile(r'<([u+\w+]+>(<u+fe0f>)?)')
This one works fine with your example.

python regex to remove extra characters from papers' doi

i am new to regex and i have a list of some papers' DOIs. some of the DOIs include some extra characters or strings. I want to remove all those extras. Here is the sample data:
10.1038/ncomms3230
10.1111/hojo.12033
blog/uninews #ivalid
article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2FPLoSONE+%28PLOS+ONE+Alerts%3A+New+Articles%29
#want to extract 10.1371/journal.pone.0076852
utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2 #invalid
10.1002/dta.1578
enhanced/doi #invalid
doi/pgen.1005204
doi:10.2135/cropsci2014.11.0791 # =want to remove "doi:"
10.1126/science.aab1052
gp/about-springer
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
now some of the entries don't have DOIs at all. I want to replace them with "".
Here is my regex expression that i came up with:
for doi in doi_lst:
doi = re.sub(r"^[^10\.][^a-z0-9//\.]+", "", doi)
but it does nothing. i searched in many other stack overflow questions but couldn't get the one for my case. Kindly help me out here.
P.s. i am working with Python 3
Assuming the pattern for DOIs is a substring starting with 10. and more digits, / and then 1+ word or . chars, you may convert the strings using urlib.parse.unquote first (to convert entities to literal strings) and then use re.search with \b10\.\d+/[\w.]+\b pattern to extract each DOI from the list items:
import re, urllib.parse
doi_list=["10.1038/ncomms3230", "10.1111/hojo.12033", "blog/uninews", "article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852? ", "utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2",
"10.1002/dta.1578", "enhanced/doi", "doi/pgen.1005204", "doi:10.2135/cropsci2014.11.0791", "10.1126/science.aab1052", "gp/about-springer", "10.1038/srep14556","10.1002/rcm.7274", "10.1177/0959353515592899"]
new_doi_list = []
for doi in doi_list:
doi = urllib.parse.unquote(doi)
m = re.search(r'\b10\.\d+/[\w.]+\b', doi)
if m:
new_doi_list.append(m.group())
print(m.group()) # DEMO
Output:
10.1038/ncomms3230
10.1111/hojo.12033
10.1371/journal.pone.0076852
10.1002/dta.1578
10.2135/cropsci2014.11.0791
10.1126/science.aab1052
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
To include empty items upon no match add else: new_doi_list.append("") condition to the above code.

python regex search for elimitation

i have the following case
import re
target_regex = '^(?!P\-[5678]).*'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = pattern.findall(''.join(mylists))
print "target_object_is_found:", target_object_is_found
this will give
target_object_is_found: ['P-1.1P-5']
but from my regex what i need is P-1.1 alone eliminating P-5
You joined the items in mylist and P-5 is no longer at the start of the string.
You may use
import re
target_regex = 'P-[5-8]'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = [x for x in mylists if not pattern.match(x)]
print("target_object_is_found: {}".format(target_object_is_found))
# => target_object_is_found: ['p-1.1']
See the Python demo.
Here, the P-[5-8] pattern is compiled with re.IGNORECASE flag and is used to check each item inside mylist (see the [...] list comprehension) with the regex_objext.match method that looks for a match at the start of string only. The match result is reversed, see not after if.
So, all items are returned that do not start with (?i)P-[5-8] pattern.

Splitting a string in Python based on a regex pattern

I have a bytes object that contains urls:
> body.decode("utf-8")
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
I need to split it into a list with each url as a separate element:
import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'
urls = re.compile(pattern).split(body.decode("utf-8"))
What I get is a list of one element with all urls pasted together:
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']
How do I split each url into a separate element?
Try splitting it with \s+
Try this sample python code,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)
This outputs,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']
Does this result looks ok? Or we can work on it and make as you desire.
In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)
This gives following output,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']