OpenFire Content Filter using REGEX - regex

Hi i am currently implementing the following regex to prevent user submitting contents which contains profanity as describe within the regex
(?i)(pecan|tie|shirt|hole|ontology|meme|pelagic|cock|duck|slot|anjing lo|Banting|Chiba|Screw|Screwing|fat|where|mother|peer|per|sock|socker|locker|ans|rect|anal|pickpocket|joker|muck)\b
I would like to improve the regex so it also filter out credit card number (master, visa, jcb, amex and so on)
i have the regex for each card namely:
^4[0-9]{12}(?:[0-9]{3})?$ (Visa)
^5[1-5][0-9]{14}$ (Master)
^3[47][0-9]{13}$ (Amex)
^3(?:0[0-5]|[68][0-9])[0-9]{11}$ (Diners)
^6(?:011|5[0-9]{2})[0-9]{12}$ (Discover)
^(?:2131|1800|35\d{3})\d{11}$ (JCB)
However when i combine these credit card amex along with the profanity filter like this
(?i)(pecan|tie|shirt|hole|ontology|meme|pelagic|cock|duck|slot|anjing lo|Banting|Chiba|Screw|Screwing|fat|where|mother|peer|per|sock|socker|locker|ans|rect|anal|pickpocket|joker|muck)\b (?i)^4[0-9]{12}(?:[0-9]{3})?$\b (?i)^5[1-5][0-9]{14}$\b it will ignore the profanity filter.
Can anyone points me to the right direction?

Filtering profanity is a great example when NOT to use regex!... Anyone who wants to swear can easily get around your filter by typing "0" instead of "o", or inserting a "." in the middle of a word, or hundreds of other workarounds. There are much better alternatives out there, if you'd like to do some research. Anyway, ignoring that...
Firstly, do you really need to do this in a single regex pattern?! Your code would look much more readable and be more easily maintainable if you split this into multiple lines of code.
But if you really insist on doing it this way, your pattern is looking for a swear word, followed by a Visa number, followed by a Master number. You have not implemented any "OR" condition here.

This is one of the stupidest policy requirement I've ever seen. Your filter will miss a lot of profanities, and will trigger on non-profanities; see Scunthorpe problem.
Then, your credit card regexes already exclude all possible swearwords because they allow only digits, out of which it is going to be difficult to construct a swearword.
But if your boss insists, make him happy with
(?i)^(?!.*(pecun|tai|shit|asshole|kontol|memek|pelacur|cock|dick|slut|anjing lo|bangsat|cibay|fuck|fucking|faggot|whore|motherfucker|peler|pler|suck|sucker|fucker|anus|rectum|anal|cocksucker|sucker|suck)\b)4[0-9]{12}(?:[0-9]{3})?$

Related

Improve regex that works

I'm not a regex expert, so please be nice :-)
I created this regex to verify if a user submitted a day of the week (in italian language):
/((lun|mart|giov)e|mercol(e?)|vener)d(ì|i('?)|í)|sabato|domenica/
This regex perfectly works and it matches the following:
lunedi
lunedì
lunedí
lunedi’
martedi
martedì
martedí
martedi'
mercoledi
mercoledì
mercoledí
mercoledi'
mercoldi
mercoldì
mercoldí
mercoldi'
giovedi
giovedì
giovedí
giovedi'
venerdi
venerdì
venerdí
venerdi'
sabato
domenica
Now consider the first part of the regex and focus on venerdì: as you can see, I added an OR (|) just to manage the venerdì day, just because of the presence of that “r”.
Anything works just fine but I’m here to ask if is there any way to start the regex this way:
(lun|mar|giov|ven)e
and then manage that “r” some way.
I red about backrefences and conditionals but I’m not sure they can be of any help.
My idea is something like: “if the first group captured ‘ven’, than add “r” to the “e” right after the end of the group.
Is this possible?
Don't "golf" your regex. If you want to improve it at all, make it more readable. While it it certainly worthwile to use different cases for the different "i" variants, everything else should IMHO be kept as simple as possible.
How about something like this?
(lune|marte|mercole?|giove|vener)d(ì|i'?|í)|sabato|domenica
Don't use backreferences and other advanced features if you don't need them, just to make your regex a few chars shorter. Even if you would still understand what it means, think about your fellow co-developers -- or just yourself two months from now.
I just removed a few redundant (...) and the "shared e" part. Note how (besides the (...)) it is the same length, whether you use (lun|mart|giov)e or lune|marte|giove, but the latter is arguably more readable. Similarly, a backreference or some conditional would likely make your regex longer instead of shorter -- and considerably more complicated.

Form Input Pattern & Validation for Full Name

I have a Google form where I am asking for the user to include their "full name" to keep the form short and sweet (without two inputs for first name/last name). You are able to validate answers in Google Forms using regular expressions, but I'm not sure where to start.
I want at minimum two words in the input, each with at least 2 characters, and I don't want to block any special characters (so that people with names like O'Leary can still write it). Basically, I just want to make sure there are two words included in a field, each with at least 2 letters.
I have no experience with regex or the patterns so any help is appreciated!
I builded this regular expresion to accept full names from a lot of countries:
^([a-zA-Z\-ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïòóôõöùúûüýÿ']{2,75} ?){2,5}$
You can test it on regex101.com. This site also helps you understand this regex with explanations on the top right.
Hope it helps.

Is rearranging words of text possible with RegEx?

I've got a list of devices in a database, such as Model X123 ABC. Another portion of the system accepts user input and needs to, as well as possible, match their entries to the existing devices. But the users have the ability to enter anything they want. They might enter the above model as Model 100 ABC X123 or Model X123.
Understand, this is a general example, and the permutations of available models and matching user entries is enormous, and I'm just trying to match as many as possible so that the manual corrections can be kept to a minimum.
The system is built in FileMaker, but has access to pretty much any plugin I wish, which means I have access to Groovy, PHP, JavaScript, etc. I'm currently working with Groovy using the ScriptMaster plugin for other simple regex pattern matching elsewhere, and I'm wondering if the most straightforward way to do this is to use regex.
My thinking with regex is that I'm looking for patterns, but I'm unsure if I can say, "Assign this grouping to this pattern regardless of where it is in the order of pattern groups." Above, I want to find if the string contains three patterns: (?i)\bmodel\b, (?i)\b[a-z]\d{3}\b, and (?i)\b[a-z]{3}\b, but I don't care about what order they come in. If all three are found, I want to place them in that specific order: first the word "model", capitalized, then the all-caps alphanumeric code and finally the pure alphabetical code in all-caps.
Can (should?) regex handle this?
I suggest tokenizing the input into words, matching each of them against the supported tokens, and assembling them into canonical categorized slots.
Even better would be to offer search suggestions when the user enters the information, and require the user to pick a suggestion.
But you could do it by (programmatically) constructing a monster regex pattern with all the premutations:
\b(?:(model)\s+([a-z]\d{3})\s+([a-z]{3})
|(model)\s+([a-z]{3})\s+([a-z]\d{3})
|([a-z]\d{3})\s+(model)\s+([a-z]{3})
|([a-z]\d{3})\s+([a-z]{3})(model)
|([a-z]{3})(model)\s+([a-z]\d{3})
|([a-z]{3})\s+([a-z]\d{3})\s+(model)
)\b
It'd have to use named capturing groups but I left that out in the hopes that the above might be close to readable.
I'm not sure I fully understand your underlying objective -- is this to be able to match up like products (e.g., products with the same model number)? If so, a word permutations function like this one could be used in a calculated field to create a multikey: http://www.briandunning.com/cf/1535
If you need partial matching in FileMaker, you could also use a redux search function like this one: http://www.fmfunctions.com/fid/380
Feel free to PM me if you have questions that aren't a good format to post here.

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.