Regular expressions: How to combine AND & OR operators - regex

I'm currently trying to perform a single regular expression which combine AND / OR operators but can't find a way to deal with it.
I'm using the PHP PCRE regex engine.
I've a text and I want to check the matching regarding rules which are allowing / disallowing words.
Example :
Rule allow: I want all to allow text which contains some
Rule disallow: I want to disallow if the text contains some page
Text 1: I found some trick.
Text 2: I found some pages in the book.
For the moment I'm computing a unique regex to check if the text does not match :
//staring with not allowing
//continue with disallowing
/^(?=(?!(some.*)))(?=(some page)).*$/
This is my problem, if the allow rule contains the disallow rule, I can't get a valid regex. This never match.
I've checked with other regex operators but can't found a valid way to build my pattern.
I want to generate a unique regex here to push the lines of the regex capabilities (I know for the moment :)).
This process perfectly works when the allow does not contains the disallow :
Rule allow: I want to allow if the text contains some page
Rule disallow: I want to disallow is the text which contains some
In this order it works...

Could you just use a negative lookahead
some(?! text)
https://regex101.com/r/bA7eV2/3

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Writing valid RegEx for use in file/folder exclusion

I'm trying to write two expressions to use in the files/folder Exclusion List for Code42 CrashPlan backup. Their support won't help with RegEx expressions, they just point me to their KB article.
In their "File Exclusions" section, I'd like to:
exclude this folder specifically: S:\Google Drive\Temp
any file or folder containing the string Backup_Excluded anywhere in its name.
This is what I've got so far - but I have no way of knowing if they're correct:
(?i).*Google Drive\\Temp ...but since I really want to exclude a specific folder, not a pattern - do I need to escape the slashes and colon in the path of S:\Google Drive\Temp
(?i).*Backup_Excluded
Research disclaimer: I know there are RegEx resources out there, but am unsure which flavor/syntax to use, as I'd imagine there are many. I was hoping those with more RegEx familiarity could advise.
The link you posted says:
The Code42 app treats all file separators as forward slashes /.
So it seems you'd want to use / instead of \\ in your regular expressions.
Colon doesn't need escaping.
\ needs escaping because it's the escaping character itself.
/ normally needs escaping because it is the default separators for regular expression sections. However, the examples in your link don't escape it, so only the matching section is implied, so no escaping.
Then you could probably use:
S:/Google Drive/Temp
or [A-Z]:/Google Drive/Temp (to allow any drive)
.*Backup_Excluded.*
I probably wouldn't use (?i), as the capitals in those strings are usually there, but that's your call.
Check out e.g. https://regex101.com/ to test your regular expressions (also in different flavours).

Non printable regex

I am in need of a regex that will match non printable characters. The reason being is that I have a hailstorm spammer that is abusing my network and is getting past my PCRE based heuristic filter by obfuscating his subjects with non printable characters. Therefore, any text based rules I create are bypassed because there is no match.
For example:
The regular text based subject: Reduce tech cоsts with clоud cоmputing
The obfuscated subject:
Reduce tech cоsts with clоud cоmputing
ReduÑe teÑh cоsts with Ñlоud Ñоmputing
ReduÑe teÑh Ñosts with Ñloud Ñomputing
Rеducе tеch cоsts with Ñlоud Ñоmputing
What I am looking for is a regex that I can modify to match all of the phrases that have been used and build a list of regexes.
Maybe, if I can get a regex that will match the subjects, I can meta them together with other matching header information that will thwart these messages.
Any help would be much appreciated.
You can use the following to match.
(Reduce|ReduÑe|Rеducе)\s*(tech|teÑh|tеch)\s*
(cоsts|Ñosts)\s*(with)\s*(clоud|Ñlоud|Ñloud)\s*
(cоmputing|Ñоmputing|Ñomputing)
You can add the unique keywords that have been used in the particular group (reduce, tech, etc) and the above regex handles the different combinations of phrases that can be made using the keywords.
For example, above regex restricts 3x3x2x1x3x3 (162 ways) of spamming using given keywords.
EDIT: You can use [^\w\s."'\/\\=!##$%^&*(){}\[\]?><,+|`~-]+ for checking if subject contains characters that are not printable, and take actions on it. (If you are using this, you might need to add other regexes to handle spam phrases that can be created with printable characters)
Demo with PCRE

Google Analytics Regular Expressions

Kinda new to Rgeluar expressions and for the benefit of learning wanted to know how to do the following on one line:
page matching regular expression: .pdf/$
and page containing "somestring"
and page excluding "someotherstring"
I can obtain my desired output using the 3 rules above. My question is can I put all into one line using regular expression? So the first line would be something like:
page matching reg exp: .pdf/$ somestring+ (then regex for does not contain in GA) someotherstring
Is it possible to put all in a oner?
Lookahead will help you to match multiple independent things in one expression, and even allows to require non-matching. In your case:
/^(?=.*somestring)(?!.*someotherstring).*\.pdf$/

negative look ahead to exclude html tags

I'm trying to come up with a validation expression to prevent users from entering html or javascript tags into a comment box on a web page.
The following works fine for a single line of text:
^(?!.*(<|>)).*$
..but it won't allow any newline characters because of the dot(.). If I go with something like this:
^(?!.*(<|>))(.|\s)*$
it will allow multiple lines but the expression only matches '<' and '>' on the first line. I need it to match any line.
This works fine:
^[-_\s\d\w"'\.,:;#/&\$\%\?!#\+\*\\(\)]{0,4000}$
but it's ugly and I'm concerned that it's going to break for some users because it's a multi-lingual application.
Any ideas? Thanks!
Note that your RE prevents users from entering < and >, in any context. "2 > 1", for example. This is very undesirable.
Rather than trying to use regular expressions to match HTML (which they aren't well suited to do), simply escape < and > by transforming them to < and >. Alternatively, find a package for your language-of-choice that implements whitelisting to allow a limited subset of HTML, or that supports its own markup language (I hear markdown is nice).
As for "." not matching newline characters, some regexp implementations support a flag (usually "m" for "multi-line" and "s" for "single line"; the latter causes "." to match newlines) to control this behavior.
The first two are basically equivalent to /^[^<>]*$/, except this one works on multiline strings. Any reason why you didn't write the RE that way?
So, I looked into it and there is a .Net 'SingleLine' option for regular expressions that causes "." to also match on the new line character. Unfortunately, this isn't available in the ASP.Net RegularExpressionValidator. As far as I can see, there's no way to make something like ^(?!.(<\w+>)).$ work on a multi-line textbox without doing server-side validation.
I took your advice and went the route of escaping the tags on the server side. This requires setting the validation page directive to 'false' but in this particular instance that isn't a big deal because the comment box is really the only thing to worry about.