Python3 regex: Keep some Emojis, discard the rest - regex

noob here. I have strings where I want to keep some emoji and to discard the rest.
INPUT:
This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
DESIRED OUTPUT:
This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
I have the re.compile that matches:
my emoji
all emoji Removing Emoticons from..... See David Mabodo answer
I don't know how to put it together in re.compile that excludes one from the other. Alternatively keep alphanumeric, punctuation, and my emoji, and substitute the rest to "".
mytext = This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
# Desired out put:
# u'This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
print ("Original text:")
print (mytext, "\n")
# Strip out emoticon modifiers, leaving a simplified emoticon to work with.
# https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
# https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
Emoji_Modifiers = re.compile(u'([\U0000FE00-\U0000FE0F])|([\U000E0100-\U000E0100])')
mytext_mod_gone = Emoji_Modifiers.sub(r'', mytext)
print ("Modifiers Removed:")
print (mytext_mod_gone, "\n")
# All emoticons
find_regex = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
# Heart emoticons
#find_regex = re.compile(u"([\U00002619])|([\U00002661])|([\U00002665])|([\U00002763])|([\U00002764])|([\U00002765])|([\U00002766])|([\U00002767])|([\U00002E96])|([\U00002E97])|([\U00002F3C])|([\U0001F394])|([\U0001F48C])|([\U0001F48F])|([\U0001F491])|([\U0001F493])|([\U0001F494])|([\U0001F495])|([\U0001F496])|([\U0001F497])|([\U0001F498])|([\U0001F499])|([\U0001F49A])|([\U0001F49B])|([\U0001F49C])|([\U0001F49D])|([\U0001F49E])|([\U0001F49F])|([\U0001F4D6])|([\U0001F5A4])|([\U0001F60D])|([\U0001F618])|([\U0001F63B])|([\U0001F970])|([\U0001F9E1])")
# Alphanumeric + punctuation for an alternative solution
#find_regex = re.compile(r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]") #
mytext_emoji_gone = find_regex.sub(r'', mytext)
I am falling down at:
Negating unicode with a Negative Lookbehind (?<!...). I don't understand the operands well enough, and regex101.com only works with r', not u'.
Combining multiple regex together in a re.compile. Say if I wanted to keep alphanumeric and my emoji, it complains when I do re.compile(u'(\Uxxxx)' | r'(regex)' ). unsupported operand type(s) for |: 'str' and 'str', so a OR type statement does not work here...and an OR gives undesired results.
Could I have some help with either:
Ignoring a subset of emoticons and deleting the rest (my preferred solution)
Keeping (alphanumeric, punctuation, and my emoticons), and deleting the rest.
A specific question: Can you 'stack' re.compiles? IE create 2 different re.compiles to match (or not match) things, then join them together.

regex101 has a Unicode option, it is a flag you can turn on from the right side of the regex box.
I think the easiest thing to do is to find all the emojis in the string except for the ones you want to keep and replace them with an empty string like you wanted to do. To do that you can use a regex that will find any emoji (for this example I'll use [\U00010000-\U0010ffff] but I'm sure there are better ones out there so use one of those) and add a negative look ahead to ignore the emoji you wish to keep.
The finale regex should look similar to this:
(?![\u2764])[\U00010000-\U0010ffff]
The first part (?![\u2764]) will make sure the match is not an emoji you wish to keep and the second part [\U00010000-\U0010ffff] will make sure it's an emoji
You can add all the other emojis you wish to keep in the square brackets (?![\u2764
here ])

I went with:
find_regex = re.compile(u"(?![\U00002619])(?![\U00002661])(?![\U00002665])(?![\U00002763])(?![\U00002764])(?![\U00002765])(?![\U00002766])(?![\U00002767])(?![\U00002E96])(?![\U00002E97])(?![\U00002F3C])(?![\U0001F394])(?![\U0001F48C])(?![\U0001F48F])(?![\U0001F491])(?![\U0001F493])(?![\U0001F494])(?![\U0001F495])(?![\U0001F496])(?![\U0001F497])(?![\U0001F498])(?![\U0001F499])(?![\U0001F49A])(?![\U0001F49B])(?![\U0001F49C])(?![\U0001F49D])(?![\U0001F49E])(?![\U0001F49F])(?![\U0001F4D6])(?![\U0001F5A4])(?![\U0001F60D])(?![\U0001F618])(?![\U0001F63B])(?![\U0001F970])(?![\U0001F9E1])"r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]")
mytext_emoji_gone = find_regex.sub(r'', mytext)
which stripped out all other emoji, leaving only the heart and book emojis, and alphanumeric and punctuation.
As part of my original question, is there a way to stack those? Currently, that is one huge long line of code. Could we do something like:
regex = re.compile(a)
regex += re.compile(b)
That would use vertial real estate but I am ok with that

Related

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Complex regex situation

I have a results list that looks like this:
1lemon_king9mumu (2-1), YearofHell (2-0), kriswithak (2-1)0.44440.75000.4444
2mumu6lemon_king (1-2), MogwaiAC (2-0), Dathanja (2-1)0.66670.62500.5655
3MogwaiAC6Dathanja (2-0), mumu (0-2), Jebnarf (2-1)0.55560.57140.5417
4Jebnarf6YearofHell (2-1), kriswithak (2-0), MogwaiAC (1-2)0.44440.62500.4266
5YearofHell3Jebnarf (1-2), lemon_king (0-2), Mig82 (2-1)0.66670.37500.6012
6Dathanja3MogwaiAC (0-2), Mig82 (2-1), mumu (1-2)0.55560.37500.5417
7Mig823Bye, Dathanja (1-2), YearofHell (1-2)0.33330.42860.3750
8kriswithak0Jebnarf (0-2), lemon_king (1-2)0.83330.20000.6875
I want to be able to pull the username of the person AFTER the rank (first number) but it is mashed together with points gained by the player, as well as their first opponent.
For example, the first persons name is "Lemon_king", and his opponents were "Mumu", "YearofHell" and "Kriswithak". The numbers on the right are irrelevant for me, but the major problem I have is that the number of points won by the player is there. Lemon_King wins 9 points for first place. I would normally just get the name by looking for the string between 1 and 9, but players usernames can have a 9 in it as well.
Can anyone think of a good solution to this problem to be able to grab the persons username?
Thanks
I think you'd need a list of the usernames to compare against; it doesn't look like the results list is "regular" enough for a regular expression.
For example the line
7Mig823Bye, Dathanja
Could be "Mig82" 3 points vs "Bye, Dathanja", but it could also be "Mig8", 23 points, "Bye, Dathanja" or "Mig8", 2 points, "3Bye, Dathanja".
Is that correct? Because if it is, you aren't going to get away with a simple solution.
Edit: Wilson commented that getting the list of usernames might be an option. In that case, something like the following might work:
/^\d+?(username1|username2|username3)\d+?(username1|username2|username3)/
It will probably take some fiddling to get right.
Here's a plnkr demonstrating it with the data you provided: http://plnkr.co/edit/nJeGfbfHgvh5zJcTWRXS?p=preview
That said, a regex might not be the right tool for this job.
As far as I can tell, you want something like
(?x) # allow whitespace and comments just like
# any real programming language
^ # beginning of line
( \d+ ) # starts with one or more digits: CAPTURE 1
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 2
( \d ) # next is a single digit: CAPTURE 3
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 4
# now add things for the rest of the line if you want
Your username should now be in the second capture. I’ve been a tad more careful than strictly necessary, but if you end up munging this, you may need that. I’ve alos put all the captures in case you want to move stuff around or pull more stuff out.
Please provide a bit more information, if you want the thing between the first number and second number:
[0-9]+([^0-9])
The first group will contain the first username.
Please comment on this (so I check) an edit your question with more detail though.
I wouldnt use regex. It will be a pain to debug it, and you'll never be 100% certain you've covered all the edge cases.
Try doing 'manual' parsing using your language of choice's built in string manipulation functions.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

not autolinking all-numeric twitter hashtags in perl?

I'm producing HTML from twitter search results. Happily using the Net::Twitter module :-)
One of the rules in Twitter is that all-numeric hashtags are not links.
This allows to unambiguously tweet things like "ur not my #1 anymore", as in here: http://twitter.com/natarias2007/status/11246320622
The solution I came up with looks like:
$tweet =~ s{#([0-9]*[A-Za-z_]+[0-9]*)}{#$1}g;
It seems to work (let's hope), but I'm still curious... how would you do it?
EDIT: that regex i came up earlier was not correct!
see below for a better answer :-)
Your regexp wouldn't capture anchors that contain more than one letter separated by numbers, e.g. #a0a:
my #anchors = ($tweet =~ m/#(\w+)/g);
foreach my $anchor (#anchors)
{
next unless $anchor =~ m/[a-z]/i;
$tweet =~ s{#$anchor}{#$anchor}g;
}
e.g. consider my $tweet = "hello #123 hello #abc1a hello #a0a";
Your code produces hello #123 hello #abc1a hello #a0a
and mine produces hello #123 hello #abc1a hello #a0a
I didn't realize how complex twitter text is!
http://engineering.twitter.com/2010/02/introducing-open-source-twitter-text.html
I found these hashtag-related lines in the Ruby library that's linked in that blog post. Don't know much Ruby -- there may be more...
# Latin accented characters (subtracted 0xD7 from the range, it's a confusable multiplication sign. Looks like "x")
LATIN_ACCENTS = [(0xc0..0xd6).to_a, (0xd8..0xf6).to_a, (0xf8..0xff).to_a].flatten.pack('U*').freeze
REGEXEN[:latin_accents] = /[#{LATIN_ACCENTS}]+/o
# Characters considered valid in a hashtag but not at the beginning, where only a-z and 0-9 are valid.
HASHTAG_CHARACTERS = /[a-z0-9_#{LATIN_ACCENTS}]/io
REGEXEN[:auto_link_hashtags] = /(^|[^0-9A-Z&\/]+)(#|#)([0-9A-Z_]*[A-Z_]+#{HASHTAG_CHARACTERS}*)/io
I can't see a reason for handling `LATIN_ACCENTS' separately. If configured correctly, the \w shortcut should catch all those accented characters. Maybe it's different in Ruby... Maybe they had other reasons...
For now, I'm settling for something that looks like this
$tweet =~ s{#([0-9A-Z_]*[A-Z_]+\w+)}{#$1}gi
Can't say that it's solved yet...