Conditionally extracting the beginning of a regex pattern - regex

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)

The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".

You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]

I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Related

regular expression

I am trying to find a regular expression which should satisfy the following needs.
It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.
2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController : z rest as async texting: json, special character, spacses.....
I would like to have the separators her identified as following (Separator shown as X)
2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....
2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....
Exactly 8 separtors are found here.
Any ideas how to do this via regular expression?
My current approach does not work as I tried to to this like the following
Any ideas about this?
Update:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s
That's my current syntax to identify the separators.
Update 2:
Regex above is faulty.
Update 3:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
This is the current regex. Targetapproach is to call
df = pd.read_csv(file_name,
sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
engine='python',
nrows=100)
This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.
The last column line is not identified correctly. For unknown reasons yet.
If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.
The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:
def separate(logline):
fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
if len(fields) > 8:
# Fix up the ninth field. Perhaps you want to remove the colon:
fields[8] = fields[8][1:]
# or perhaps you want the text starting at the first non-whitespace
# character after the colon:
#
# if fields[8][0] == ':':
# fields[8] = fields[8].split(maxsplit=1)[1]
#
# etc.
return fields
>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
... + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
... + " c.w.f.w.NiceController"
... + " : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
'[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
'c.w.f.w.NiceController',
' z rest as async texting: json, special character, spaces.....']
Solution
The current outcome of my problem can be solved via the following regular expression.
(?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
Maybe minor adaptions have to be done maybe but for now it works pretty good.

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

python regex to remove extra characters from papers' doi

i am new to regex and i have a list of some papers' DOIs. some of the DOIs include some extra characters or strings. I want to remove all those extras. Here is the sample data:
10.1038/ncomms3230
10.1111/hojo.12033
blog/uninews #ivalid
article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2FPLoSONE+%28PLOS+ONE+Alerts%3A+New+Articles%29
#want to extract 10.1371/journal.pone.0076852
utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2 #invalid
10.1002/dta.1578
enhanced/doi #invalid
doi/pgen.1005204
doi:10.2135/cropsci2014.11.0791 # =want to remove "doi:"
10.1126/science.aab1052
gp/about-springer
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
now some of the entries don't have DOIs at all. I want to replace them with "".
Here is my regex expression that i came up with:
for doi in doi_lst:
doi = re.sub(r"^[^10\.][^a-z0-9//\.]+", "", doi)
but it does nothing. i searched in many other stack overflow questions but couldn't get the one for my case. Kindly help me out here.
P.s. i am working with Python 3
Assuming the pattern for DOIs is a substring starting with 10. and more digits, / and then 1+ word or . chars, you may convert the strings using urlib.parse.unquote first (to convert entities to literal strings) and then use re.search with \b10\.\d+/[\w.]+\b pattern to extract each DOI from the list items:
import re, urllib.parse
doi_list=["10.1038/ncomms3230", "10.1111/hojo.12033", "blog/uninews", "article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852? ", "utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2",
"10.1002/dta.1578", "enhanced/doi", "doi/pgen.1005204", "doi:10.2135/cropsci2014.11.0791", "10.1126/science.aab1052", "gp/about-springer", "10.1038/srep14556","10.1002/rcm.7274", "10.1177/0959353515592899"]
new_doi_list = []
for doi in doi_list:
doi = urllib.parse.unquote(doi)
m = re.search(r'\b10\.\d+/[\w.]+\b', doi)
if m:
new_doi_list.append(m.group())
print(m.group()) # DEMO
Output:
10.1038/ncomms3230
10.1111/hojo.12033
10.1371/journal.pone.0076852
10.1002/dta.1578
10.2135/cropsci2014.11.0791
10.1126/science.aab1052
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
To include empty items upon no match add else: new_doi_list.append("") condition to the above code.

Matlab regexp; I would like to catch words between specific words

I would like to catch words between specific words in Matlab regular expression.
For example, If line = 'aaaa\bbbbb\ccccc....\wwwww.xyz' is given,
I would like to catch only wwwww.xyz.
aaaa ~ wwwww.xyz does not represent specific words and number of character- it means they can be any character excluding backslash and number of character can be more than 1. wwwww.xyz is always after last backslash. My problem is regexp(line,'\\.+\.xyz','match') does not always work since wwwww sometimes contain special character such as '-'.
Any suggestion is appreciated.
If you Must use regex for this, this regex should work:
[\\]?(?!.+\\)([^.]+\.[a-z]{3})
Working regex example:
http://regex101.com/r/fL5oS5
Example data:
aaaa\bbbbb\ccccc\ww%20-www.xyz
www-654_33.xyz
Matches:
1. ww%20-www.xyz
2. www-654_33.xyz
No solution provided here is likely to be 100% reliable unless you know that your data is carefully formatted (has the path string been escaped?). The question boils down to finding a word that is a valid path in line of text. It not so easy. We'll assume that all files have file extensions (this is not necessarily true in the context of paths). An arbitrary path can then might look like any of the following:
'wwwww.x'
'wwwww.xyz'
'\wwwww.xyz'
'ccccc\wwwww.xyz'
'\ccccc\wwwww.xyz'
...
str = 'The quick brown fox aaaa\bbbbb\ccccc\wwwww.xyz jumped over the lazy dog.';
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.\w+)\s','tokens');
file_name = matches{1}(2)
which returns (for all of the cases above the extension is slightly different for the first case though)
file_name =
'wwwww.xyz'
If you know the filename extension is '.xyz', then you can use this instead:
matches = regexp(str,'\s\\?([^.\s\\]+\\)*([^.\s]+\.xyz)\s','tokens');
By the way, for a path, the fileparts function can be used:
str = 'aaaa\bbbbb\ccccc\wwwww.xyz'; % A Windows-only path
% str = 'aaaa/bbbbb/ccccc/wwwww.xyz'; % A UNiX or OS X path (works on Windows too)
[path_str,file_name,file_ext] = fileparts(str)
which returns
path_str =
aaaa\bbbbb\ccccc
file_name =
wwwww
file_ext =
.xyz
You can then get the filename with extension via
file_name_ext = [file_name file_ext];
Note also that that path_str omits the trailing file separator.
Assuming that the only thing that your strings have in common is that there is a file path separator, and you are interested in everything "from the last file path separator to the first whitespace", then you could try
['[\' filesep ']([^\' filesep ']+?)(?:\s|$)']
which on Windows platform would reduce to
\\([^\\]+?)(?:\s|$)
Demo:
http://regex101.com/r/jW5tT1
If you want to match the extension literally (.xyz in your example), change it to
\\([^\\]+?\.xyz)(?:\s|$)
"Find a backslash followed by the fewest (+?) number of "not backslash" until literal .xyz followed by a white space or end of string"

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))