I am in need of a regex that will match non printable characters. The reason being is that I have a hailstorm spammer that is abusing my network and is getting past my PCRE based heuristic filter by obfuscating his subjects with non printable characters. Therefore, any text based rules I create are bypassed because there is no match.
For example:
The regular text based subject: Reduce tech cоsts with clоud cоmputing
The obfuscated subject:
Reduce tech cоsts with clоud cоmputing
ReduÑe teÑh cоsts with Ñlоud Ñоmputing
ReduÑe teÑh Ñosts with Ñloud Ñomputing
Rеducе tеch cоsts with Ñlоud Ñоmputing
What I am looking for is a regex that I can modify to match all of the phrases that have been used and build a list of regexes.
Maybe, if I can get a regex that will match the subjects, I can meta them together with other matching header information that will thwart these messages.
Any help would be much appreciated.
You can use the following to match.
(Reduce|ReduÑe|Rеducе)\s*(tech|teÑh|tеch)\s*
(cоsts|Ñosts)\s*(with)\s*(clоud|Ñlоud|Ñloud)\s*
(cоmputing|Ñоmputing|Ñomputing)
You can add the unique keywords that have been used in the particular group (reduce, tech, etc) and the above regex handles the different combinations of phrases that can be made using the keywords.
For example, above regex restricts 3x3x2x1x3x3 (162 ways) of spamming using given keywords.
EDIT: You can use [^\w\s."'\/\\=!##$%^&*(){}\[\]?><,+|`~-]+ for checking if subject contains characters that are not printable, and take actions on it. (If you are using this, you might need to add other regexes to handle spam phrases that can be created with printable characters)
Demo with PCRE
Related
I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.
I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.
So I read a lot about Negation in Regex but can't solve my problem in MS Word 2016.
How do I exclude a String, Word, Number(s) from being found?
Example:
<[A-Z]{2}[A-Z0-9]{9;11}> to search a String like XY123BBT22223
But how to exclude for example a specefic one like SEDWS12WW04?
Well it depends on what you need to achieve or is this a matter of curiosity... RegEx is not the same as the built-in Advanced Find with Wildcards; for that you need VBA.
Depending on your need, without using VBA, you could make use of space and return characters - something like this will work for the strings provided: [ ^13][A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,}[ ^13] (assuming you use normal carriage returns and spaces in your document)
Anyway, this is a good article on wildcard searches in MS Word: https://wordmvp.com/FAQs/General/UsingWildcards.htm
EDIT:
In light of your further comments you will probably want to look at section 8 of the linked article which explains grouping. For my proposed search you can use this to your advantage by creating 3 groups in your 'find' and only modifying the middle group, if indeed you do intend to modify. Using groups the search would look something like:
([ ^13])([A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,})([ ^13])
and the replace might look like this:
\1 SOMETHING \3
Note also: compared to a RegEx solution my suggestion is kinda lame, mainly because compared to RegEx, MS-Words find and replace (good as it is, and really it is) is kinda lame... it's hacky but it might work for you (although you might need to do a few searches).
BUT... if it really is REGEX that you want, well you can get access to this via VBA: How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
And... then you will be able to use proper RegEx for find and replace, well almost - I'm under the impression that the VBA RegEx still has some quirks...
As already noted by others, this is not possible in Microsoft Word's flavor of regular expressions.
Instead, you should use standard regular expressions. It is actually possible to use standard regular expressions in MS Word if you use a special tool that integrates into Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace). This tool opens as a pane to the right of the document window and works just like the Advanced Find & Replace dialog. However, in addition to Word's existing search functionality, it can use the standard regular expressions syntax to search and replace any text within a Word document.
In your particular case, I would use this:
\b[A-Z]{2}[A-Z0-9]{9,11}\b(?<!\bSEDWS12WW04)
To explain, this searches for a word boundary + ID + word boundary, and then it looks back to make sure that the preceding string does not match [word boundary + excluded ID]. In a similar vein, you can do something like
(?<!\bSEDWS12WW04|\bSEDWS12WW05|\bSEDWS12WW05)
to exlude several IDs.
Multiple Find & Replace is quite powerful: you can add any number of expressions (either using regular expressions or using Word's standard search syntax) to a list and then search the document for all of them, replace everything, display all matches in a list and replace only specific matches, and a few more things.
I created this tool for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Best regards, Stanislav
I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?
I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.
I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.
Short version:
How can I get a regex that matches a#a.aaaa but not a#a.aaaaa using CAtlRegExp ?
Long version:
I'm using CAtlRegExp http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx to try to match email addresses. I want to use the regex
^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$
extracted from here.
But the syntax that CAtlRegExp accepts is different than the one used there. This regex returns the error REPARSE_ERROR_BRACKET_EXPECTED, you can check for yourself using this app: http://www.codeproject.com/KB/string/mfcregex.aspx
Using said app, I created this regex:
^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z]$
But the problem is this matches a#a.aaaaa as valid, I need it to match 4 characters maximum for the op-level domain.
So, how can I get a regex that matches a#a.aaaa but not a#a.aaaaa ?
Try: ^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+\c\c\c?\c?$
This expression replaces the [A-Z]{2,4} sequence which CAtlRegExp doesn't support with \c\c\c?\c?
\c serves as an abbreviation of [a-zA-Z]. The question marks after the 3rd and 4th \c's indicate they can match either zero or one characters. As a result, this portion of the expression matches 2, 3 or 4 characters, but neither more nor less.
You are trying to match email addresses, a very widely used critical element of internet communication.
To which I would say that this job is best done with the most widely used most correct regex.
Since email address format rules are described by RFC822, it seems useful to do internet searches for something like "RFC822 email regex".
For Perl the answer seems to be easy: use Mail::RFC822::Address: regexp-based address validation
RFC 822 Email Address Parser in PHP
Thus, to achieve the most correct handling of email addresses, one should either locate the most precise regex that there is out somewhere for the particular toolkit (ATL in your case) or - in case there's no suitable existing regex yet - adapt a very precise regex of another toolkit (Perl above seems to be a very complete albeit difficult candidate).
If you're trying to match a specific sub part of email addresses (as seems to be the case given your question), then it probably still makes sense to start with the most up-to-date/correct/universal regex and specifically limit it to the parts that you require.
Perhaps I stated the obvious, but I hope it helped.