python regex to remove extra characters from papers' doi - regex

i am new to regex and i have a list of some papers' DOIs. some of the DOIs include some extra characters or strings. I want to remove all those extras. Here is the sample data:
10.1038/ncomms3230
10.1111/hojo.12033
blog/uninews #ivalid
article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2FPLoSONE+%28PLOS+ONE+Alerts%3A+New+Articles%29
#want to extract 10.1371/journal.pone.0076852
utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2 #invalid
10.1002/dta.1578
enhanced/doi #invalid
doi/pgen.1005204
doi:10.2135/cropsci2014.11.0791 # =want to remove "doi:"
10.1126/science.aab1052
gp/about-springer
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
now some of the entries don't have DOIs at all. I want to replace them with "".
Here is my regex expression that i came up with:
for doi in doi_lst:
doi = re.sub(r"^[^10\.][^a-z0-9//\.]+", "", doi)
but it does nothing. i searched in many other stack overflow questions but couldn't get the one for my case. Kindly help me out here.
P.s. i am working with Python 3

Assuming the pattern for DOIs is a substring starting with 10. and more digits, / and then 1+ word or . chars, you may convert the strings using urlib.parse.unquote first (to convert entities to literal strings) and then use re.search with \b10\.\d+/[\w.]+\b pattern to extract each DOI from the list items:
import re, urllib.parse
doi_list=["10.1038/ncomms3230", "10.1111/hojo.12033", "blog/uninews", "article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852? ", "utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2",
"10.1002/dta.1578", "enhanced/doi", "doi/pgen.1005204", "doi:10.2135/cropsci2014.11.0791", "10.1126/science.aab1052", "gp/about-springer", "10.1038/srep14556","10.1002/rcm.7274", "10.1177/0959353515592899"]
new_doi_list = []
for doi in doi_list:
doi = urllib.parse.unquote(doi)
m = re.search(r'\b10\.\d+/[\w.]+\b', doi)
if m:
new_doi_list.append(m.group())
print(m.group()) # DEMO
Output:
10.1038/ncomms3230
10.1111/hojo.12033
10.1371/journal.pone.0076852
10.1002/dta.1578
10.2135/cropsci2014.11.0791
10.1126/science.aab1052
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
To include empty items upon no match add else: new_doi_list.append("") condition to the above code.

Related

Tokenize a sentence where each word contains only letters using RegexTokenizer Scala

I am using spark with scala and trying to tokenize a sentence where each word should only contain letters. Here is my code
def tokenization(extractedText: String): DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
val textDataFrame = existingSparkSession.createDataFrame(Seq(
(0, extractedText))).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.select("sentence", "words").show(false)
return regexTokenized;
}
If I provide senetence as "I am going to school5" after tokenization it should have only [i, am, going, to] and should drop school5. But with my current pattern it doesn't ignore the digits within words. How am I suppose to drop words with digits ?
You can use the settings below to get your desired tokenization. Essentially you extract words which only contain letters using an appropriate regex pattern.
val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setGaps(false).setPattern("\\b[a-zA-Z]+\\b")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.show(false)
+---+---------------------+------------------+
|id |sentence |words |
+---+---------------------+------------------+
|0 |I am going to school5|[i, am, going, to]|
+---+---------------------+------------------+
For the reason why I set gaps to false, see the docs:
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
You want to repeatedly match the regex, rather than splitting the text by a given regex.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

How to make TfidfVectorizer only learn alphabetical characters as part of the vocabulary (exclude numbers)

I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:
max_df_param = .003
use_idf = True
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
unigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
bigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
trigrams = vectorizer.get_feature_names()
vocab = np.concatenate((unigrams, bigrams, trigrams))
However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0
101
110
12
15th
16th
180c
180d
18th
190
1900
1960s
197
1980
1b
20
200
200a
2d
3d
416
4th
50
7a
7b"
I try to only include words with alphabetical characters using the token_pattern parameter with the following regex:
vectorizer = TfidfVectorizer(max_df = max_df_param,
token_pattern=u'(?u)\b\^[A-Za-z]+$\b',
stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words
I have also tried only removing numbers but I still get the same error.
Is my regex incorrect? or am I using the TfidfVectorizer incorrectly? (I have also tried removing max_features argument)
Thank you!
Thats because your regex is wrong.
1) You are using ^ and $ which are used to denote string start and end. That means this pattern will only match complete string with only alphabets in it (no numbers, no spaces, no other special chars). You dont want that. So remove that.
See the details about special characters here: https://docs.python.org/3/library/re.html#regular-expression-syntax
2) You are using raw regex pattern without escaping the backslash which will itself be used for escaping the characters following it. So when used in conjuction with regular expressions in python, this will not be valid as you want to. You can either properly format the string by using double backslashes instead of single or use r prefix.
3) u prefix is for unicode. Unless your regex pattern have special unicode characters, this is also not needed.
See more about that here: Python regex - r prefix
So finally your correct token_pattern should be:
token_pattern=r'(?u)\b[A-Za-z]+\b'

Regular expression: Matching multiple occurrences in the same line

I have a string that I need to match using regex. It works perfectly fine when I have a single occurrence in a single line, however, when there are multiple occurrences of the same string in a single line I'm not getting any matches. Can you please help?
Sample strings:
MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T
Regex that I tried:
(([A-Z]{2}[0-9]{8,9}[A-Z]{1})|([A-Z]{2}[0-9]{8,9}))
This seems to work fine:
a = '''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
import re
patterns = ['[A-Z]{2}[0-9]{8,9}[A-Z]{1}','[A-Z]{2}[0-9]{8,9}']
pattern = '({})'.format(')|('.join(patterns))
matches = re.findall(pattern, a)
print([match for sub in matches for match in sub if match])
#['MS17010314', 'MS00030208', 'IL00171198', 'IH09850115', 'IH99400409',
# 'IH99410409', 'IL01771010', 'IL01791002', 'IL01930907', 'IL02360907',
# 'CM00010904', 'IH09520115', 'MS00201285', 'MS19050708', 'MS00370489',
# 'MS19011285T']
I've added a way to combine all patterns.
i tried using python and the following code worked
import re
s='''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
lst_of_regex = [a,b]
pattern = '|'.join(lst_of_regex)
print(re.findall(pattern,s))