Regular Expression to match sentences - regex

I'm trying to make a regular expression in python that matches sentences. The one I see that mostly works is: [^\.\?\!].*?[\.\?\!] ,but with the test sentences below it has a few errors. You can see using the site https://regex101.com/. I'm looking for a regular expression that encompasses all the problems below such as ellipsis, honorifics, and the i.e. thing.
For performing tokenization in languages other than English, we can
load the respective language pickle file found in tokenizers/punkt and
then tokenize the text in another language, which is an argument of
the tokenize() function. For the tokenization of French text, we will
use the french.pickle file as follows: Mr. Smith bought cheapsite.com
for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam
Jones Jr. thinks he didn't. In any case, this isn't true... Well, with
a probability of .9 it isn't.
p.s. If you're wondering I got the above sentences from a natural language processing book and another stack overflow question on the same subject.

the easiest way is to split it in 3 operations.
substitute i.e., ellipsis and what ever you want with other markers without dots like ###ie### and ###ellipsis###.
match sentences.
After that rebuild i.e. and ellipsis.
Update: Some code how to do it. You have to do the substitutions for each item with dots you want to exclude from the sentence-matcher.
sentences = re.sub(r'i\.e\.', "###ie###", sentences);
matches = re.match(r'[^\.\?\!].*[\.\?\!]', sentences);
matches = re.sub(r'###ie###', "i.e.", matches);

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex to find 1st part of string in 2nd part?

I'm trying to write a regex for use in Calibre (python) to find ebooks that have the series name in brackets in the title. I have a custom column with the series name and title separated by a "~", for example:
"The Series~The Book Title (The Series)"
Best I can come up with finds anything with at least one letter from the series name in brackets in the title:
(.+)~.*[\(\1\)].*
I only want to find those that have the whole of the first part of the string in brackets at the end of the second part, it can contain extra info.
Thanks.
This works in Notepad++:
(.+)~[^\(]*\(\1\).*
I'm not sure it will work the same in python, but regexp processors are usually very similar, so try it out.
Your regex is pretty close, you can change a little your regex and have this:
(.+?)~.*[([]\1[)\]].*
Working demo
This will match strings like:
The Series~The Book Title (The Series)
The Series~The Book Title [The Series]
However, if you just want to match words with paretheses, then you can have:
(.+?)~.*[(]\1[)].*
or
(.+?)~.*\(\1\).*
Working demo
Thanks for the suggestions. They work perfectly in the python demo but for some unknown reason don't work in Calibre. Seems like one character is the most it will match from the capture group. Must be a limitation in the regex system Calibre uses.

Regex to match sentences with jumbled words but preserving sentence order

I want to match sentences in such a way that words with the sentence can be any order but the sentences should be in same order.
e.g.
My name is Sam. I love regex.
Acceptable input:
My Sam is name. regex I love.
name is My Sam. I regex love.
Invalid input:
I love regex. My name is Sam.
regex I love. is My name Sam.
sample regex I have come up so far to solve the above problem
^((?=.*\bMy\b)(?=.*\bSam\b)(?=.*\bis\b)(?=.*\bname\b))((?=.*\bregex\b)(?=.*\bI\b)(?=.*\blove\b)).*$
Which is not working as expected.
Can this problem be solved by regex? What would be the recommended approach to solve this?
Note: Please ignore . I am using it just for clarity.
I think you are looking for something else than regex. If you would want to do this, the most efficient way would be to compare an array of expected words and 'check' if they all appear once in a sentence. This is completely dependent on which context you are using. If you need a regex that literally finds what you stated in your example, you could use something like this:
/(My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam)\. (I|love|regex) (I|love|regex) (I|love|regex)./g
But as you can see, this regex would grow exponentially the more words your sentence has. Also, it's really inefficient compared to parsing it with something else.
I couldn't achieve with a single regex, instead I did the following:
Virtually divided the sentence into multiple blocks.
Maintained a sentence block -> regex configuration.
regex configuration depends on the rule applicable on that sentence block.
Applied the regex on the sentence to identify whether such block is existing or not.
At last verifying whether the blocks are appearing in the configured order or not.

Intelligent pattern matching in string

Let's say I have filenames which are formatted differently. I want to be able to extract certain aspects from said filename like a human would; pattern recognition.
Obviously I can bruteforce myself through with regular expressions but that's not what I'm after. Let's say I have these 4 strings:
[MAS] Hayate no Gotoku!! 20 [BD 720p] [21D138F8].mkv
[Leopard-Raws] Akatsuki no Yona - 05 RAW (MX 1280x720 x264 AAC).mp4
[BLAST] Wolf Girl and Black Prince - 05 [720p] [C1252A5E].mkv
[sage]_Mobile_Suit_Gundam_AGE_-_36_[720p][10bit][45C9E0D0].mkv
As you can see all these filenames have certain pattern in them but are not quite the same. So a silver bullet regular expression wouldn't cut it. Instead I want to look at computational intelligence techniques such as ANN's or another smart idea to solve this problem.
Let's say we want to extract the filenames. Humans would return these values:
Hayate no Gotoku!!
Akatsuki no Yona
Wolf Girl and Black Prince
Mobile Suit Gundam AGE
Or episode numbers: 20, 05, 05, 36. You get where I'm going with this.
What suggested techniques would be useful to achieve the desired result, or is this something that is being researched at universities and still has no solution?
What you are looking for is called grammar induction and it works but making a program figure out a regular expression (or some other type of pattern) that matches certain strings but not others. You have to give it the strings yourself however, called a training set, with positive examples (strings that should be matched) and negative examples (strings that shouldn't be matched).
An interesting technique is called boosting where you learn a lot of simple patterns which are precise (do not match negative examples) but match only a few positive examples; however when combined together will match a large amount of positive examples.
Since you want to extract substrings rather than just match strings, the way I would go about it is to take prefixes of the file names and try to match those. In this way you'd know where the substring starts. Here's an example:
Positives:
[MAS]
[Leopard-Raws]
[BLAST]
[sage]_
Negatives:
[MAS] H
[Leopard-Raws] Akat
[BL
[sage]_Mobile_Suit_Gundam_AGE_
If done correctly, you should obtain a regular expression which you can use on prefixes of the file names. By growing the prefix one letter at a time you can know where the content of interest starts. Like this:
[ False
[s False
[sa False
[sag False
[sage False
[sage] True
[sage]_ True
[sage]_M False
What happened here is that I increased the prefix of the file name one character at a time until the regular expression I learnt matched it. But I also wanted to find the longest prefix that matches (because otherwise I would have missed the underscore since [sage] is an acceptable prefix as well) so I continued moving forward until the regular expression stopped matching. In this way I would know that the prefix before the actual content starts is "[sage]_". You can do the same for matching where it ends as well by using prefixes which include the content of interest.
To learn about regular expression learning see this post. Keep in mind that automated learning will never be perfect but the more examples you use the more accurate it will be.

RegEx: Find Matches Excluding a Specific Phrase

I have tried a few of the answers to similar questions, but none work for what I am trying to do.
I am trying to find text that matches a specific phrase (with wildcard), but excludes any that include a second phrase.
Correct: John yawns.
Incorrect: John opens his mouth wide and yawns.
Essentially, I want to match off "(Someone) yawns." but not off "(Someone) opens his mouth wide and yawns." So the "opens his mouth wide and" is the match for exclusion, but I can't seem to get it to work.
Sadly, I am working with a log parsing application so I do not know what language is being used.
You probably want a negative lookbehind, as in (?<!opens his mouth wide and )yawns. Beware that these can slow down the regex matching algorithm, and are not available everywhere.
You really should reduce your test-case to simple patterns (e.g. .*bc but not abc)