Intelligent pattern matching in string - regex

Let's say I have filenames which are formatted differently. I want to be able to extract certain aspects from said filename like a human would; pattern recognition.
Obviously I can bruteforce myself through with regular expressions but that's not what I'm after. Let's say I have these 4 strings:
[MAS] Hayate no Gotoku!! 20 [BD 720p] [21D138F8].mkv
[Leopard-Raws] Akatsuki no Yona - 05 RAW (MX 1280x720 x264 AAC).mp4
[BLAST] Wolf Girl and Black Prince - 05 [720p] [C1252A5E].mkv
[sage]_Mobile_Suit_Gundam_AGE_-_36_[720p][10bit][45C9E0D0].mkv
As you can see all these filenames have certain pattern in them but are not quite the same. So a silver bullet regular expression wouldn't cut it. Instead I want to look at computational intelligence techniques such as ANN's or another smart idea to solve this problem.
Let's say we want to extract the filenames. Humans would return these values:
Hayate no Gotoku!!
Akatsuki no Yona
Wolf Girl and Black Prince
Mobile Suit Gundam AGE
Or episode numbers: 20, 05, 05, 36. You get where I'm going with this.
What suggested techniques would be useful to achieve the desired result, or is this something that is being researched at universities and still has no solution?

What you are looking for is called grammar induction and it works but making a program figure out a regular expression (or some other type of pattern) that matches certain strings but not others. You have to give it the strings yourself however, called a training set, with positive examples (strings that should be matched) and negative examples (strings that shouldn't be matched).
An interesting technique is called boosting where you learn a lot of simple patterns which are precise (do not match negative examples) but match only a few positive examples; however when combined together will match a large amount of positive examples.
Since you want to extract substrings rather than just match strings, the way I would go about it is to take prefixes of the file names and try to match those. In this way you'd know where the substring starts. Here's an example:
Positives:
[MAS]
[Leopard-Raws]
[BLAST]
[sage]_
Negatives:
[MAS] H
[Leopard-Raws] Akat
[BL
[sage]_Mobile_Suit_Gundam_AGE_
If done correctly, you should obtain a regular expression which you can use on prefixes of the file names. By growing the prefix one letter at a time you can know where the content of interest starts. Like this:
[ False
[s False
[sa False
[sag False
[sage False
[sage] True
[sage]_ True
[sage]_M False
What happened here is that I increased the prefix of the file name one character at a time until the regular expression I learnt matched it. But I also wanted to find the longest prefix that matches (because otherwise I would have missed the underscore since [sage] is an acceptable prefix as well) so I continued moving forward until the regular expression stopped matching. In this way I would know that the prefix before the actual content starts is "[sage]_". You can do the same for matching where it ends as well by using prefixes which include the content of interest.
To learn about regular expression learning see this post. Keep in mind that automated learning will never be perfect but the more examples you use the more accurate it will be.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Too Many Characters Included in Attempt to Parse a CSV File

Background
I am attempting to parse a CSV file using PCRE regular expressions. That is, making out (or extracting) the various different "cells" available in the CSV, to then put them in a somewhat nicely organized array containing all the parts that the process of parsing managed to make out.
The following regular expression is what I have come up with so far:
/(?:;|^)(?:(?:"(?:(?!"(;|$)).)*)|(?:([^;]*)))/g
I would highly recommend that you put this in a tester for regular expressions. Here is a slight bit of test data, that should match to a great extent.
"There; \"be";"but; someone spoke";hence the young man;hence the son;"test;"
The Problem
The regular expression manages to extract the correct number of parts. It is meant for the regular expression to retrieve the text from inside each and every "cell" available in the CSV (use the CSV provided above for reference). It does to some extent.
Here is the result of the groups in the regular expression above:
"There; \"be
;"but; someone spoke
hence the young man
hence the son
;"test;
As we can clearly see, the lines that are "escaped" using double-quotation marks include the " inside its group for the match, also selects the ", and sometimes even the semi-colon. From my understanding, the group for the negative lookahead should not include those.
I have probably missed something very essential here. Perhaps someone can point me in the right direction towards a fix.
Edit and Potential Solution
It appears as though I might have managed to solve it. As opposed to what I said above, the negative lookahead does not actually appear to create a capture group, which I initially thought. As such, adding yet another group to the equation seems to parse out the segments I am after.
/(?:;|^)(?:(?:"((?:(?!"(;|$)).)*))|(?:([^;]*)))/g
I will, however, leave the question open for now, and will answer it myself if no other answer comes tumbling in. As not to make it opinion based, I would therefore further inquire as to whether there might be a more efficient way in terms of speed than that in which I am using above.

Regular Expression to match sentences

I'm trying to make a regular expression in python that matches sentences. The one I see that mostly works is: [^\.\?\!].*?[\.\?\!] ,but with the test sentences below it has a few errors. You can see using the site https://regex101.com/. I'm looking for a regular expression that encompasses all the problems below such as ellipsis, honorifics, and the i.e. thing.
For performing tokenization in languages other than English, we can
load the respective language pickle file found in tokenizers/punkt and
then tokenize the text in another language, which is an argument of
the tokenize() function. For the tokenization of French text, we will
use the french.pickle file as follows: Mr. Smith bought cheapsite.com
for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam
Jones Jr. thinks he didn't. In any case, this isn't true... Well, with
a probability of .9 it isn't.
p.s. If you're wondering I got the above sentences from a natural language processing book and another stack overflow question on the same subject.
the easiest way is to split it in 3 operations.
substitute i.e., ellipsis and what ever you want with other markers without dots like ###ie### and ###ellipsis###.
match sentences.
After that rebuild i.e. and ellipsis.
Update: Some code how to do it. You have to do the substitutions for each item with dots you want to exclude from the sentence-matcher.
sentences = re.sub(r'i\.e\.', "###ie###", sentences);
matches = re.match(r'[^\.\?\!].*[\.\?\!]', sentences);
matches = re.sub(r'###ie###', "i.e.", matches);

Extracting data using regex from bank feed

I am looking to extract some text from a raw credit card feed for a workflow. I have gotten almost where I want to but am struggling with the final piece of information I'm trying to extract.
An example of the raw feed is:
LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
I am looking to extract this from the above:
(ICGROUP,INC.MELBOURNE)June5UNITEDSTATESDOLLARAUD(50.07)includesconversioncommissionof
with the brackets representing the two groups I am after. The consistent parts across all instances of what I'm trying to extract is:
DIGITS (TEXT) DATE TEXT AMOUNT includesconversioncommissionof
I have been able to use the regex:
([A-Z][a-z]\d)[A-Z]AUD(\d\,?\d+?.\d*)includesconversioncommissionofAUD
to get me the date and the amount. I am struggling to find a way to get as per the example above the words ICGROUP,INC.MELBOURNE
I have tried putting \d\d(.*) before the above regex but that doesn't work for some reason.
Would appreciate if anyone is able to help with what I'm after!
The closest I think we can get (PCRE) is something like:
/
[\d,.]+ # a currency value to bookend
(.+?) # capture everything in-between
[A-Z][a-z]+\d+ # a month followed by a day, e.g. "June5"
.+? # everything in-between
([\d,.]+) # capture a currency value
includesconversioncommissionof # our magic token to bookend
/x
The technique here is to pit greedy expressions against non-greedy expressions in a very deliberate way. Let me know if you have any questions about it. I would be extremely hesitant to put this in production—or even trust its output as an ad-hoc pass—without rigorous testing!
I'm using the pattern [\d,.] for currency, but you can replace that with something more sophisticated, especially if you expect weird formats and currency symbols. The biggest potential pitfall here is if the ICGROUP,INC.MELBOURNE token might start with a number. Then you'll definitely need a more sophisticated currency pattern!
Here's what I've got (in php).
$string = "LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE";
$cleaned = preg_replace("/^(LEO'SFINEFOOD&WINEHARTWELL)([A-Za-z]{3,9})(\.|\d)*/", "", $string);
echo $cleaned;
what it returns is: ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
Which you can then use and run your own little regex on.
Explanation:
The \w{3,9} is used to remove the month which may be 3-9 characters long. Then the (\.|\d)* is to remove the digits and dots. I'm thinking that we could parse the month/date better using your regex to extract that June 5 part but from your example given, it shouldn't be necessary.
However, it would be much more helpful if you could provide at least 3 examples, optimally 5, so we can get a good feel of the pattern. Otherwise this is the best I can do with what you've given.

RegEx: Find Matches Excluding a Specific Phrase

I have tried a few of the answers to similar questions, but none work for what I am trying to do.
I am trying to find text that matches a specific phrase (with wildcard), but excludes any that include a second phrase.
Correct: John yawns.
Incorrect: John opens his mouth wide and yawns.
Essentially, I want to match off "(Someone) yawns." but not off "(Someone) opens his mouth wide and yawns." So the "opens his mouth wide and" is the match for exclusion, but I can't seem to get it to work.
Sadly, I am working with a log parsing application so I do not know what language is being used.
You probably want a negative lookbehind, as in (?<!opens his mouth wide and )yawns. Beware that these can slow down the regex matching algorithm, and are not available everywhere.
You really should reduce your test-case to simple patterns (e.g. .*bc but not abc)