How can I identify sentences within a text? - regex

I have text that looks like this:-
"I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "
Here, "ASP.NET" and "Node.js" are to be treated as words. Also, there is no space before "But I...", but it should be treated as a separate sentence.
The expected output is:
["I am an engineer"," I am skilled in ASP.NET","I also know Node.js","But I don't have much experience"]
Is there a way of doing this?

For your current input you may use the following approach with re.split() function and specific regex pattern:
import re
s = "I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "
result = re.split(r'\.(?=\s?[A-Z][^.]*? )', s)
print(result)
The output:
['I am an engineer', ' I am skilled in ASP.NET', ' I also know Node.js', "But I don't have much experience. "]
(?=\s?[A-Z][^.]*? ) - lookahead positive assertion, ensures that sentence delimiter . is followed by word from next sentence

Related

Build regex with unknown chars/words in String between to known chars/words

I am relatively new to regex building and I am stuck at an example, where I am not sure if that is even possible. At least I didn't find a working solution with regex101 or a similiar question on SO.
What I am trying to do:
> String1 = "NOTE: This is an examplary note about apples and other fruits."
> String2 = "NOTE: Here we have a simple note about bananas and other fruits."
> String3 = "NOTE: note about cherrys and other fruits."
For all these strings the regex should catch the specific fruit (apples, bananas, cherrys), which is not the problem.
The string always begins with the same word/chars (NOTE:), afterwards there could be any amount of chars until another group of words that is always the same (note about). Then the desired catch comes and another group of words that is always the same. So my problem is the regex for the "unknown" part between "NOTE:" and "note about ...". Is that even possible to put in a regex?
I hope you understand what I mean. Please respond if any clarification is needed here.
You could catch everything lazily to get the unknown part ?.*. So the regex could look like:
NOTE: ?.* note about (\w+) and other fruits\.
https://regex101.com/r/HKkyOq/1
Let me know if this answered your question.

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Find repeated pattern in a string of characters using R

I have a large text that contains expressions such as: "aaaahahahahaha that was a good joke". after processing, I want the "aaaaahahahaha" to disappear, or at least, change it to simply "ha".
At the moment, I am using this:
gsub('(.+?)\\1', '', str)
This works when the string with the pattern is at the beginning of the sentence, but not where is located anywhere else. So:
str <- "aaaahahahahaha that was a good joke"
gsub('(.+?)\\1', '', str)
#[1] "ha that was a good joke"`
But
str <- "that was aaaahahahahaha a good joke"
gsub('(.+?)\\1', '', str)
#[1] "that was aaaahahahahaha a good joke"
This question might relate to this: find repeated pattern in python, but I can't find the equivalence in R.
I am assuming is very simple and perhaps I am missing something trivial, but since regular expressions are not my strength and I have already tried a bunch of things that have not worked, I was wondering if someone could help me. The question is: How to find, and substitute, repeated patterns in a string of characters in R?
Thanks in advance for your time.
\b(\S+?)\1\S*\b
Use this.See demo.
https://regex101.com/r/sJ9gM7/46
For r use \\b(\\S+?)\\1\\S*\\b with perl=TRUE option.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Regex to pull out Suburbs from block of text

this may be an impossible task, I can't seem to find a useful answer on good old Google.
What I want to do is pull out the suburbs from a block of text. There is a general format, so I think it should be possible.
i.e. "Services in the areas of landsdale (WA) may be disrupted"
It is not always properly capitalised, may contain suburbs with multiple words (such as "South Coogee") or it may contain multiple suburbs. The suburbs always come after "area of" or "areas of" and the suburbs always preceed "(WA)".
I have very limited experience with regex, so I've got no idea where to even start. A solution would be great, but I am happy to be pointed in the right direction if no one here has the time/patience to develop a regex string query for this.
To be honest, Regex seems to me like overkill, so I wouldn't even bother, and just use native VBA string manipulation functions.
s = "Services in the area of landsdale (WA) may be disrupted"
prefix1 = "area of"
prefix2 = "areas of"
suffix = "(WA)"
' Is it "area" or "areas"?
If InStr(s, prefix1) > 0 Then
prefix = prefix1
Else
prefix = prefix2
End If
suburb = Trim(Mid(s, InStr(s, prefix) + Len(prefix) + 1, _
InStr(s, suffix) - InStr(s, prefix) - Len(prefix) - 1))
Also, "the areas of landsdale (WA)" doesn't really make syntactical sense (why the plural?), which makes me suspect that you sometimes have phrases of the form: "the areas of landsdale (WA) and crumpetville (WA)" or "the areas of landsdale, crumpetville and metawan (WA)". But this is just speculation on my part.
I'd like to offer you the full-blown regex example for your reference. Personally I don't think it's very scary in this case :) I apologize that I'm not sure how this needs to be modified (if ata ll) for use in Outlook, but this is the function as it would be written in Excel.
Function ExtractSuburb(ByVal text As String)
Dim RE As Object, allMatches As Object
Set RE = CreateObject("vbscript.regexp")
RE.pattern = "areas? of (.+) \(WA\)"
RE.Global = True
Set allMatches = RE.Execute(text)
ExtractSuburb = allMatches.Item(0).submatches.Item(0)
End Function
Quite literally this pattern is telling the function to grab whatever is between "area/areas of " and " (WA)". I can see how the inner workings of Regex can be confusing, though, so hats off to Jean for offering a different solution.
Depending on your data you could probably ignore the first and last parts and only deal with "areas of landsdale (WA)". Using that the following regex works:
areas? of (.+?) \(WA\)
It matches 'area' or 'areas' of (the suburb) followed by '(WA)'.
I hope this helps and I can extend it to better fit your data if need be.
You are not indicating which regex dialect you want to use, but something like /areas? of (\w+(\s\w+)*?) \(WA\)/ should work in any reasonably Perl-flavored implementation. The *? selects as few repeated words as possible between "of" and "(WA)". If your text may have irregular spacing, you'll have to tweak for that.