regex: Match at least two search terms

regex: Match at least two search terms - regex

I have a list of search terms and I would like to have a regex that matches all items that have at least two of them.
Terms: war|army|fighting|rebels|clashes
Match: The war between the rebels and the army resulted in several clashes this week. (4 hits)
Non-Match: In the war on terror, the obama administration wants to increase the number of drone strikes. (only 1 hit)
Background: I use tiny-tiny rss to collect and filter a large number of feeds for a news reporting project. I get 1000 - 2000 feed items per day and would like to filter them by keywords. By just using |OR expression, I get to many false positives, so I figured I could just ask for two matches in a feed item.
Thanks!
EDIT:
I know very little about regex, so I stuck with using the simple |OR operator so far. I tried putting the search terms in parenthesis (war|fighting|etc){2,}, but that only matches if an item uses the same word twice.
EDIT2: sorry for the confusion, I'm new to regex and the like. Fact is: the regex queries a mysql database. It is entered in the tt-rss backend as a filter, which allows only one line (although theoretically unlimited number of characters). The filter is employed upon importing of the feed item into the mysql database.

(.*?\b(war|army|fighting|rebels|clashes)\b){2,}
If you need to avoid matching the same term, you can use:
.*?\b(war|army|fighting|rebels|clashes).*?(\b(?!\1)(war|army|fighting|rebels|clashes)\b)
which matches a term, but avoids matching the same term again by using a negative lookahead.
In java:
Pattern multiword = Pattern.compile(
".*?(\\b(war|army|fighting|rebels|clashes)\\b)" +
".*?(\\b(?!\\1)(war|army|fighting|rebels|clashes)\\b)"
);
Matcher m;
for(String str : Arrays.asList(
"war",
"war war war",
"warm farmy people",
"In the war on terror rebels eating faces"
)) {
m = multiword.matcher(str);
if(m.find()) {
logger.info(str + " : " + m.group(0));
} else {
logger.info(str + " : no match.");
}
}
Prints:
war : no match.
war war war : no match.
warm farmy people : no match.
In the war on terror rebels eating faces : In the war on terror rebels

This isn't (entirely) a job for regular expressions. A better approach is to scan the text, and then count the unique match groups.
In Ruby, it would be very simple to branch based on your match count. For example:
terms = /war|army|fighting|rebels|clashes/
text = "The war between the rebels and the army resulted in..."
# The real magic happens here.
match = text.scan(terms).uniq
# Do something if your minimum match count is met.
if match.count >= 2
p match
end
This will print ["war", "rebels", "army"].

Regular expressions could do the trick, but the regular expression would be quite huge.
Remember, they are simple tools (based on finite-state automata) and hence don't have any memory that would let them remember what words were already seen. So such regex, even though possible, would probably just look like a huge lump of or's (as in, one "or" for every possible order of inputs or something).
I recommend to do the parsing yourself, for instance like:
var searchTerms = set(yourWords);
int found = 0;
foreach (var x in words(input)) {
if (x in searchTerms) {
searchTerms.remove(x);
++found;
}
if (found >= 2) return true;
}
return false;

If you want to do it all with a regex it's not likely to be easy.
You can however do something like this:
<?php
...
$string = "The war between the rebels and the army resulted in several clashes this week. (4 hits)";
preg_match_all("#(\b(war|army|fighting|rebels|clashes))\b#", $string, $matches);
$uniqueMatchingWords = array_unique($matches[0]);
if (count($uniqueMatchingWords) >= 2) {
//bingo
}

Related

Perl MongoDB API - regular expression in filters

I am working in Perl with a MongoDB. I have a collection with documents that have a big text field that I need to be able to find all rows that contain multiple strings in the field.
So for instance, if this is a database of movie quotes one row would have value:
We must totally destroy all spice production on Arrakis. The Guild and
the entire Universe depends on spice. He who can destroy a thing,
controls a thing.
I want to be able to match that row with terms "spice", "Arrakis", and "Guild" where ALL of those terms have to be in the text.
My current approach can only achieve matches if the terms provided happen to be in the correct order, i.e.:
$db->get_collection( 'quotes' )->find( { quote => qr/spice.*Arrakis.*Guild/i } );
That's a match, but
$db->get_collection( 'quotes' )->find( { quote => qr/Guild.*spice.*Arrakis/i } );
is not a match.
If I were working with a SQL database I could do:
... WHERE quote LIKE '%spice%' and quote LIKE '%Arrakis%' and quote LIKE '%Guild%'
but with the MongoDB interface you only get one shot per field.
Is there a way to match multiple words where all are required in one regex, or is there another way to get more than one crack at a field in the MongoDB interface?

One way: A bunch of positive lookahead assertations:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my #tests = ("The Guild demands that the spice must flow from Arrakis",
"House Atreides will be transported to Arrakis by the Guild.");
for my $test (#tests) {
if ($test =~ m/^(?=.*spice)
(?=.*Guild)
(?=.*Arrakis)/x) {
say "$test: matches";
} else {
say "$test: fails";
}
}
produces:
The Guild demands that the spice must flow from Arrakis: matches
Duke Leto will be transported to Arrakis by the Guild.: fails

Perl, regular expression, matching exactly 2 spaces does not work

Working on the parser for STA/SSTA timing reports. The following cases of "Arrival Time" occurrence are possible:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
The goal is to match cases 1st and 2nd, but ignore 3rd case.
I tried two matching patterns in my Perl code:
1) if (m/^-?\s{1,2}Arrival\sTime/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
2) if (m/^-\sArrival\sTime/ || m/^\s{1,2}Arrival\sTime/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
Both of them pick up the 3rd case as well. I do not understand why.
I defined specifically one or two space characters \s{1,2}, no more than that. As the 3rd line contains more than two whitespace character it should not match the pattern. How is this possible?

The data you have published is not the same as you used in your test.
This program checks both of the regex patterns against the data copied directly from an edit of your original post. Neither pattern matches any of the lines in your data
use strict;
use warnings;
use 5.010;
my (%STA_DATA, $file, $path);
while ( <DATA> ) {
if ( /^-?\s{1,2}Arrival\sTime/ ) {
say 'match1';
$STA_DATA{$file}{$path}{Arrival_Time} = m/\sArrival\sTime\s+(.*)\s+$/
}
if ( /^-\sArrival\sTime/ or m/^\s{1,2}Arrival\sTime/ ) {
say 'match2';
$STA_DATA{$file}{$path}{Arrival_Time} = m/\sArrival\sTime\s+(.*)\s+$/
}
}
__DATA__
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report

Here is a possible workaround you can try:
if (m/^-?\s{1,2}Arrival\sTime\s{2,}/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s+(.*)\s+$/ }
You can match the string "Arrival Time " with two or more spaces after it, ruling out the string "Arrival Time Report"

Can you confirm your regex is inside a loop reading the input line by line ?
In case $_ contains the whole text your observation would be expected because you anchored the extracting regex to the end of the text by using a $.

It should help to replace spaces in your data with Unicode U+2423 OPEN BOX that is commonly used to signify a space using a visible character.
␣␣␣␣␣␣Arrival␣Time␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣3373.000
␣␣␣␣-␣Arrival␣Time␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣638.700␣|␣100.404
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣Arrival␣Time␣Report␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣

As rightfully requested by Borodin, for the learning of others I'm gong to explain the mistake I have done and show the solution.
The mistake that I have done is following:
I wrongly assumed that my matching pattern is being applied on the text as seen in the .rpt file.
Three cases (relevant for my matching pattern) that can occur in such a file are following:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
But, I have forgotten that somewhere in the code I have implemented following line:
s/->//g; s/\s\S+\s[v\^]\s//g; s/\s+/ /g;
It is namely the last substitution in this series of substitutions that changes the original text into:
Arrival Time 3373.000
- Arrival Time 638.700 | 100.404
Arrival Time Report
There for my matching patterns (that are presented in the question above) did not work.
Knowing this, the solution is simple. I have adjusted matching pattern as follows:
if (m/^\-?\sArrival\sTime\s\d+/) { ($STA_DATA{$file}{$path}{Arrival_Time}) = m/\sArrival\sTime\s(.*)\s?$/ }
I appreciate all the help and feedback received, and I truly sorry for wasting everyone's time with this ill defined problem.

Regex Pattern to Match, Excluding when... / Except between

--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K
This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.
Example
I want to match five digits using \b\d{5}\b but not in three situations s1 s2 s3:
s1: Not on a line that ends with a period like this sentence.
s2: Not anywhere inside parens.
s3: Not inside a block that starts with if( and ends with //endif
I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K in PHP.
For instance
s1 (?m)(?!\d+.*?\.$)\d+
s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+
But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.
The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.
So I came here to see if someone think of a flexible recipe.
Hints are okay: you don't need to give me full code. :)
Thank you.

Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.
First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.
Surprise
Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...
Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.
A Better-Known Variation
There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).
Thanks for all the background, zx81... But what's the recipe?
Key Fact
The method returns the match in Group 1 capture. It does not care at
all about the overall match.
In fact, the trick is to match the various contexts we don't want (chaining these contexts using the | OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.
The general recipe is
Not_this_context|Not_this_either|StayAway|(WhatYouWant)
This will match Not_this_context, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.
In your case, with your digits and your three contexts to ignore, we can do:
s1|s2|s3|(\b\d+\b)
Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a | )
The whole expression can be written like this:
(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)
See this demo (but focus on the capture groups in the lower right pane.)
If you mentally try to split this regex at each | delimiter, it is actually only a series of four very simple expressions.
For flavors that support free-spacing, this reads particularly well.
(?mx)
### s1: Match line that ends with a period ###
^.*\.$
| ### OR s2: Match anything between parentheses ###
\([^\)]*\)
| ### OR s3: Match any if(...//endif block ###
if\(.*?//endif
| ### OR capture digits to Group 1 ###
(\b\d+\b)
This is exceptionally easy to read and maintain.
Extending the regex
When you want to ignore more situations s4 and s5, you add them in more alternations on the left:
s4|s5|s1|s2|s3|(\b\d+\b)
How does this work?
The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".
The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.
I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.
Debuggex Demo
Perl/PCRE Variation
In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as #CasimiretHippolyte and #HamZa. It is:
(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant
In your case:
(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b
This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant
Note that (*F), (*FAIL) and (?!) are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)
demo for this version
Applications
Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.
How can I match foo except anywhere in a tag like <a stuff...>...</a>?
How can I match foo except in an <i> tag or a javascript snippet (more conditions)?
How can I match all words that are not on this black list?
How can I ignore anything inside a SUB... END SUB block?
How can I match everything except... s1 s2 s3?
How to Program the Group 1 Captures
You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.
If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.
Alternatives
Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant) recipe, if only because clarity always wins out.
1. Replace then Match.
A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of ###. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive ### strings.
2. Lookarounds.
Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex module to replace re in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.
Recycling the regex you had for s3 in C#, the whole pattern would look like this.
(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b
But by now you know I'm not recommending this, right?
Deletions
#HamZa and #Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant. You remember that the recipe to match WhatYouWant (capturing it into Group 1) was s1|s2|s3|(WhatYouWant), right? To delete all instance of WhatYouWant, you change the regex to
(s1|s2|s3)|WhatYouWant
For the replacement string, you use $1. What happens here is that for each instance of s1|s2|s3 that is matched, the replacement $1 replaces that instance with itself (referenced by $1). On the other hand, when WhatYouWant is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you #HamZa and #Jerry for suggesting this wonderful addition.
Replacements
This brings us to replacements, on which I'll touch briefly.
When replacing with nothing, see the "Deletions" trick above.
When replacing, if using Perl or PCRE, use the (*SKIP)(*F) variation mentioned above to match exactly what you want, and do a straight replacement.
In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.
Have fun!
No, wait, there's more!
Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.

Do three different matches and handle the combination of the three situations using in-program conditional logic. You don't need to handle everything in one giant regex.
EDIT: let me expand a bit because the question just became more interesting :-)
The general idea you are trying to capture here is to match against a certain regex pattern, but not when there are certain other (could be any number) patterns present in the test string. Fortunately, you can take advantage of your programming language: keep the regexes simple and just use a compound conditional. A best practice would be to capture this idea in a reusable component, so let's create a class and a method that implement it:
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class MatcherWithExceptions {
private string m_searchStr;
private Regex m_searchRegex;
private IEnumerable<Regex> m_exceptionRegexes;
public string SearchString {
get { return m_searchStr; }
set {
m_searchStr = value;
m_searchRegex = new Regex(value);
}
}
public string[] ExceptionStrings {
set { m_exceptionRegexes = from es in value select new Regex(es); }
}
public bool IsMatch(string testStr) {
return (
m_searchRegex.IsMatch(testStr)
&& !m_exceptionRegexes.Any(er => er.IsMatch(testStr))
);
}
}
public class App {
public static void Main() {
var mwe = new MatcherWithExceptions();
// Set up the matcher object.
mwe.SearchString = #"\b\d{5}\b";
mwe.ExceptionStrings = new string[] {
#"\.$"
, #"\(.*" + mwe.SearchString + #".*\)"
, #"if\(.*" + mwe.SearchString + #".*//endif"
};
var testStrs = new string[] {
"1." // False
, "11111." // False
, "(11111)" // False
, "if(11111//endif" // False
, "if(11111" // True
, "11111" // True
};
// Perform the tests.
foreach (var ts in testStrs) {
System.Console.WriteLine(mwe.IsMatch(ts));
}
}
}
So above, we set up the search string (the five digits), multiple exception strings (your s1, s2 and s3), and then try to match against several test strings. The printed results should be as shown in the comments next to each test string.

Your requirement that it's not inside parens in impossible to satify for all cases.
Namely, if you can somehow find a ( to the left and ) to the right, it doesn't always mean you are inside parens. Eg.
(....) + 55555 + (.....) - not inside parens yet there are ( and ) to left and right
Now you might think yourself clever and look for ( to the left only if you don't encounter ) before and vice versa to the right. This won't work for this case:
((.....) + 55555 + (.....)) - inside parens even though there are closing ) and ( to left and to right.
It is impossible to find out if you are inside parens using regex, as regex can't count how many parens have been opened and how many closed.
Consider this easier task: using regex, find out if all (possibly nested) parens in a string are closed, that is for every ( you need to find ). You will find out that it's impossible to solve and if you can't solve that with regex then you can't figure out if a word is inside parens for all cases, since you can't figure out at a some position in string if all preceeding ( have a corresponding ).

Hans if you don't mind I used your neighbor's washing machine called perl :)
Edited:
Below a pseudo code:
loop through input
if line contains 'if(' set skip=true
if skip= true do nothing
else
if line match '\b\d{5}\b' set s0=true
if line does not match s1 condition set s1=true
if line does not match s2 condition set s2=true
if s0,s1,s2 are true print line
if line contains '//endif' set skip=false
Given the file input.txt:
tiago#dell:~$ cat input.txt
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
And the script validator.pl:
tiago#dell:~$ cat validator.pl
#! /usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
sub validate_s0 {
my $line = $_[0];
if ( $line =~ \d{5/ ){
return "true";
}
return "false";
}
sub validate_s1 {
my $line = $_[0];
if ( $line =~ /\.$/ ){
return "false";
}
return "true";
}
sub validate_s2 {
my $line = $_[0];
if ( $line =~ /.*?\(.*\d{5.*?\).*/ ){
return "false";
}
return "true";
}
my $skip = "false";
while (<>){
my $line = $_;
if( $line =~ /if\(/ ){
$skip = "true";
}
if ( $skip eq "false" ) {
my $s0_status = validate_s0 "$line";
my $s1_status = validate_s1 "$line";
my $s2_status = validate_s2 "$line";
if ( $s0_status eq "true"){
if ( $s1_status eq "true"){
if ( $s2_status eq "true"){
print "$line";
}
}
}
}
if ( $line =~ /\/\/endif/) {
$skip="false";
}
}
Execution:
tiago#dell:~$ cat input.txt | perl validator.pl
it should match 12345
it should match 12345
it should match 12345

Not sure if this would help you or not, but I am providing a solution considering the following assumptions -
You need an elegant solution to check all the conditions
Conditions can change in future and anytime.
One condition should not depend on others.
However I considered also the following -
The file given has minimal errors in it. If it doe then my code might need some modifications to cope with that.
I used Stack to keep track of if( blocks.
Ok here is the solution -
I used C# and with it MEF (Microsoft Extensibility Framework) to implement the configurable parsers. The idea is, use a single parser to parse and a list of configurable validator classes to validate the line and return true or false based on the validation. Then you can add or remove any validator anytime or add new ones if you like. So far I have already implemented for S1, S2 and S3 you mentioned, check classes at point 3. You have to add classes for s4, s5 if you need in future.
First, Create the Interfaces -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo.Contracts
{
public interface IParser
{
String[] GetMatchedLines(String filename);
}
public interface IPatternMatcher
{
Boolean IsMatched(String line, Stack<string> stack);
}
}
Then comes the file reader and checker -
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using FileParserDemo.Contracts;
using System.ComponentModel.Composition.Hosting;
using System.ComponentModel.Composition;
using System.IO;
using System.Collections;
namespace FileParserDemo.Parsers
{
public class Parser : IParser
{
[ImportMany]
IEnumerable<Lazy<IPatternMatcher>> parsers;
private CompositionContainer _container;
public void ComposeParts()
{
var catalog = new AggregateCatalog();
catalog.Catalogs.Add(new AssemblyCatalog(typeof(IParser).Assembly));
_container = new CompositionContainer(catalog);
try
{
this._container.ComposeParts(this);
}
catch
{
}
}
public String[] GetMatchedLines(String filename)
{
var matched = new List<String>();
var stack = new Stack<string>();
using (StreamReader sr = File.OpenText(filename))
{
String line = "";
while (!sr.EndOfStream)
{
line = sr.ReadLine();
var m = true;
foreach(var matcher in this.parsers){
m = m && matcher.Value.IsMatched(line, stack);
}
if (m)
{
matched.Add(line);
}
}
}
return matched.ToArray();
}
}
}
Then comes the implementation of individual checkers, the class names are self explanatory, so I don't think they need more descriptions.
using FileParserDemo.Contracts;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace FileParserDemo.PatternMatchers
{
[Export(typeof(IPatternMatcher))]
public class MatchAllNumbers : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveIfBlock : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("if\\(");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
//push the if
stack.Push(m.ToString());
}
//ignore current line, and will validate on next line with stack
return true;
}
regex = new Regex("//endif");
if (regex.IsMatch(line))
{
foreach (var m in regex.Matches(line))
{
stack.Pop();
}
}
return stack.Count == 0; //if stack has an item then ignoring this block
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithEndPeriod : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("(?m)(?!\\d+.*?\\.$)\\d+");
return regex.IsMatch(line);
}
}
[Export(typeof(IPatternMatcher))]
public class RemoveWithInParenthesis : IPatternMatcher
{
public Boolean IsMatched(String line, Stack<string> stack)
{
var regex = new Regex("\\(.*\\d+.*\\)");
return !regex.IsMatch(line);
}
}
}
The program -
using FileParserDemo.Contracts;
using FileParserDemo.Parsers;
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace FileParserDemo
{
class Program
{
static void Main(string[] args)
{
var parser = new Parser();
parser.ComposeParts();
var matches = parser.GetMatchedLines(Path.GetFullPath("test.txt"));
foreach (var s in matches)
{
Console.WriteLine(s);
}
Console.ReadLine();
}
}
}
For testing I took #Tiago's sample file as Test.txt which had the following lines -
this is a text
it should match 12345
if(
it should not match 12345
//endif
it should match 12345
it should not match 12345.
it should not match ( blabla 12345 blablabla )
it should not match ( 12345 )
it should match 12345
Gives the output -
it should match 12345
it should match 12345
it should match 12345
Don't know if this would help you or not, I do had a fun time playing with it.... :)
The best part with it is that, for adding a new condition all you have to do is provide an implementation of IPatternMatcher, it will automatically get called and thus will validate.

Same as #zx81's (*SKIP)(*F) but with using a negative lookahead assertion.
(?m)(?:if\(.*?\/\/endif|\([^()]*\))(*SKIP)(*F)|\b\d+\b(?!.*\.$)
DEMO
In python, i would do easily like this,
import re
string = """cat 123 sat.
I like 000 not (456) though 111 is fine
222 if( //endif if(cat==789 stuff //endif 333"""
for line in string.split('\n'): # Split the input according to the `\n` character and then iterate over the parts.
if not line.endswith('.'): # Don't consider the part which ends with a dot.
for i in re.split(r'\([^()]*\)|if\(.*?//endif', line): # Again split the part by brackets or if condition which endswith `//endif` and then iterate over the inner parts.
for j in re.findall(r'\b\d+\b', i): # Then find all the numbers which are present inside the inner parts and then loop through the fetched numbers.
print(j) # Prints the number one ny one.
Output:
000
111
222
333

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.

I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.

My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total

Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")

Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.

This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.

the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T

pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.

If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js