Get ISBN number with regex - regex

ISBN numbers come with random dash positions
978-618-81543-7-7
9786-18-81-5437-7
97-86-18-81-5437-7
How could i get them every time without knowing dash positions?

Just delete every - with your language of choice.
With Ruby :
"978-618-81543-7-7".delete('-')
#=> "9786188154377"
If you really want to use a regex :
"978-618-81543-7-7".gsub(/-/,'')
If you have multiple lines with isbns :
isbns = "978-618-81543-7-7
9786-18-81-5437-7
97-86-18-81-5437-7"
p isbns.scan(/\b[-\d]+\b/).map{|number_and_dash| number_and_dash.delete('-')}
#=> ["9786188154377", "9786188154377", "9786188154377"]

Pretty easy use of regex, google it to learn more. In Python:
import re
nums=re.findall('\d+',isbnstring)
This will give a list of the numbers. To join them to a string:
isbn=''.join(nums)
As per comments below, if you're working a file you could work it line by line:
with open(isbnfile) as desc:
for isbnstring in desc:
#Do the above and more.
As one example. There are a ton of ways to do this. I just realized from the command line sed is a good choice as well:
sed 's/-//g' isbnfile > newisbnfile

Related

Python3 regex: Keep some Emojis, discard the rest

noob here. I have strings where I want to keep some emoji and to discard the rest.
INPUT:
This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
DESIRED OUTPUT:
This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
I have the re.compile that matches:
my emoji
all emoji Removing Emoticons from..... See David Mabodo answer
I don't know how to put it together in re.compile that excludes one from the other. Alternatively keep alphanumeric, punctuation, and my emoji, and substitute the rest to "".
mytext = This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
# Desired out put:
# u'This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
print ("Original text:")
print (mytext, "\n")
# Strip out emoticon modifiers, leaving a simplified emoticon to work with.
# https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
# https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
Emoji_Modifiers = re.compile(u'([\U0000FE00-\U0000FE0F])|([\U000E0100-\U000E0100])')
mytext_mod_gone = Emoji_Modifiers.sub(r'', mytext)
print ("Modifiers Removed:")
print (mytext_mod_gone, "\n")
# All emoticons
find_regex = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
# Heart emoticons
#find_regex = re.compile(u"([\U00002619])|([\U00002661])|([\U00002665])|([\U00002763])|([\U00002764])|([\U00002765])|([\U00002766])|([\U00002767])|([\U00002E96])|([\U00002E97])|([\U00002F3C])|([\U0001F394])|([\U0001F48C])|([\U0001F48F])|([\U0001F491])|([\U0001F493])|([\U0001F494])|([\U0001F495])|([\U0001F496])|([\U0001F497])|([\U0001F498])|([\U0001F499])|([\U0001F49A])|([\U0001F49B])|([\U0001F49C])|([\U0001F49D])|([\U0001F49E])|([\U0001F49F])|([\U0001F4D6])|([\U0001F5A4])|([\U0001F60D])|([\U0001F618])|([\U0001F63B])|([\U0001F970])|([\U0001F9E1])")
# Alphanumeric + punctuation for an alternative solution
#find_regex = re.compile(r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]") #
mytext_emoji_gone = find_regex.sub(r'', mytext)
I am falling down at:
Negating unicode with a Negative Lookbehind (?<!...). I don't understand the operands well enough, and regex101.com only works with r', not u'.
Combining multiple regex together in a re.compile. Say if I wanted to keep alphanumeric and my emoji, it complains when I do re.compile(u'(\Uxxxx)' | r'(regex)' ). unsupported operand type(s) for |: 'str' and 'str', so a OR type statement does not work here...and an OR gives undesired results.
Could I have some help with either:
Ignoring a subset of emoticons and deleting the rest (my preferred solution)
Keeping (alphanumeric, punctuation, and my emoticons), and deleting the rest.
A specific question: Can you 'stack' re.compiles? IE create 2 different re.compiles to match (or not match) things, then join them together.
regex101 has a Unicode option, it is a flag you can turn on from the right side of the regex box.
I think the easiest thing to do is to find all the emojis in the string except for the ones you want to keep and replace them with an empty string like you wanted to do. To do that you can use a regex that will find any emoji (for this example I'll use [\U00010000-\U0010ffff] but I'm sure there are better ones out there so use one of those) and add a negative look ahead to ignore the emoji you wish to keep.
The finale regex should look similar to this:
(?![\u2764])[\U00010000-\U0010ffff]
The first part (?![\u2764]) will make sure the match is not an emoji you wish to keep and the second part [\U00010000-\U0010ffff] will make sure it's an emoji
You can add all the other emojis you wish to keep in the square brackets (?![\u2764
here ])
I went with:
find_regex = re.compile(u"(?![\U00002619])(?![\U00002661])(?![\U00002665])(?![\U00002763])(?![\U00002764])(?![\U00002765])(?![\U00002766])(?![\U00002767])(?![\U00002E96])(?![\U00002E97])(?![\U00002F3C])(?![\U0001F394])(?![\U0001F48C])(?![\U0001F48F])(?![\U0001F491])(?![\U0001F493])(?![\U0001F494])(?![\U0001F495])(?![\U0001F496])(?![\U0001F497])(?![\U0001F498])(?![\U0001F499])(?![\U0001F49A])(?![\U0001F49B])(?![\U0001F49C])(?![\U0001F49D])(?![\U0001F49E])(?![\U0001F49F])(?![\U0001F4D6])(?![\U0001F5A4])(?![\U0001F60D])(?![\U0001F618])(?![\U0001F63B])(?![\U0001F970])(?![\U0001F9E1])"r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]")
mytext_emoji_gone = find_regex.sub(r'', mytext)
which stripped out all other emoji, leaving only the heart and book emojis, and alphanumeric and punctuation.
As part of my original question, is there a way to stack those? Currently, that is one huge long line of code. Could we do something like:
regex = re.compile(a)
regex += re.compile(b)
That would use vertial real estate but I am ok with that

can regex be used to index/slice parts of string?

So I have a list of serial numbers in the following format:
Serial Number: CN073GTT74445714892L
I was wondering if regex can be used to extract just the last 6 chars?
So in this case, it is 14892L
forget to mention, there is other unrelated text in the document, so how would i make so the match pattern is always after "serial Number: " ?
EDIT - this worked (?<=\s.{29}).{6}$
You can do it with a regex:
.{6}$
Demo
But you can do it without it, and it's an advisable solution. E.g. in Ruby:
"CN073GTT74445714892L"[-6..-1]
in Python:
In [4]: "CN073GTT74445714892L"[-6:]
Out[4]: '14892L'
Regex is ideally used to identify patterns. If it's only the last 6 digits you're interested in, then a normal string manipulation will work too.
e.g in Python, you could use:
str = "CN073GTT74445714892L"
str[-6:]

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'