Newbie regex question - detect spam

Newbie regex question - detect spam - regex

Here's my regex newbie questions:
How can I check if a string has 3 spam words? (for example: viagra, pills and shop)
How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)

Regex doesn't seem like quite the right hammer for this particular nail. For your list, you can simply throw all of you blacklisted words in a sorted list of some kind, and scan each token against that list. Direct string operations are always faster than invoking the regular expression engine du jour.
For your variations ("v-iagra", et. al) I'd remove all non-characters (as #Kinopiko suggested) and then run them past your blacklist again. If you're wary of things like "viiagra", et cetera, I'd check out Aspell. It's a great library, and looks like CPAN has a Perl binding.

How can I check if a string has 3 spam words? (for example: viagra,pills and shop)
A regex to spot any one of those three words might look like this (Perl):
if ($string =~ /(viagra|pills|shop)/) {
# spam
}
If you want to spot all three, a regex alone isn't really enough:
my $bad_words = 0;
while ($string =~ /(viagra|pills|shop)/g) {
$bad_words++;
}
if ($bad_words >= 3) {
# spam
}
How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)
It's not so easy to do that with just a regex. You could try something like
$string =~ s/\W//g;
to remove all non-word characters like . and -, and then check the string using the test above. This would strip spaces too though.

Related

Writing a regex pattern for text if not 255/255

I'm currently writing a Perl program where I need a regex pattern, but there is something I don't know how to do.
Text
Reliability: 255/255
I need to write a regex where if reliability is 255/255 then it's okay (I have that one: Reliability:\s+(255\/255))
but if Reliability is different from 255/255 it's not.
I need to do it without an else. I just need the regex if possible.
Clarification
I send a command on a router : show interface $WAN_Int where $WAN_Int is an IP address. I put the result in a table. Then I read the table and I search for Reliability: 255/255. if it's 255, it's OK. If it's not, there is a problem.

my $ok = $string eq 'Reliability: 255/255';
or, if it must be a regex
my $ok = $string =~ m{^Reliability: 255/255$};
The $ok is 1 if $string is equal to the text or an empty string otherwise.
Both eq operator
and the regex's match operator in scalar context
return "true" (1) on success or false (an empty string) otherwise. So you can use them in an expression and assign their return (or decide based on it).
Please keep in mind the precedence table when doing this or use parenthesis liberally.
Given the clarification a regex is better here, to allow for flexibility in input as it comes from another interface which may have (or pick up) slight variations in the expected format.
Then allow for multiple spaces between words (\s+), for the phrase being only a part of the string (drop anchors ^ and $), perhaps for the word not being capitalized
my $ok = $string =~ m{\b[Rr]eliability:\s+255/255\b};
where \b is the word-boundary anchor, since trailing numbers after 255 may compromise this.

This may work for you:
Reliability:\s+(?!255)\d+/255\s*$
https://regex101.com/r/1JAsG6/5/
Also, for matching 255 on your original regexp, It can be simplified (you don't need classes and you can avoid the group)
Reliability:\s+255/255\s*$
https://regex101.com/r/LnfXVI/3/
NOTE: If you use / as a delimiter for the regexp, you may need to scape / on the regxes as \/
Extra: If you would want to also disallow strings like Reliability: 123456/255, use this instead: Reliability:\s+(?!255|0\d)\d{1,3}/255\s*$
https://regex101.com/r/1JAsG6/8/

Regex - skip over expressions and parse the rest

I use regular expressions for sorting data into groups. The lines look somewhat like:
testword test
test testword
tes.w. tes.
tes tes.w.
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
The word test is to be found as well as othertexttobefound and sometexttobefound.
Now I am trying to tell my parser that it is supposed to plainly ignore testword and its derivatives while searching and focus on the rest of my data entries. The "good words" and the "bad words" can be anywhere in each line.
I have tried [^w] which is fine for the beginning of strings, but in my versions not for the other cases. Also (?:w) didn't do the trick. I cannot use lookarounds as these would keep the whole line from being detected.
After long searches on the internet I am hoping for help here!
After much appreciated help from Naxos84, I am adding some German real life examples:
sozialabgabe sozialarbeiter
soz.abg. sozialarbeiter
sozarbeiter soz.abg.
sozialarbeiter otherirrelevantstuff
otherirrelevantstuff soz abg
otherirrelevantstuff sozabg
otherirrelevantstuff sozialabgabe
If I search with:
sozial["^\ab"]|soz["^\ab"]|sometexttobefound|othertexttobefound
Lines 6 and 7 get marked as well, but I don't want those.
What am I doing wrong?
A link:
regexr

To find all the matches you want: any occurence of "test" and "sometexttobefound" and "othertexttobefound you can try the following regex:
test[^\w]|sometexttobefound|othertexttobefound
This regex means:
Find every "test" that is not followed by a word OR sometexttobefound OR othertexttobefound
I tried this regex with the follow text (I added a few 'test's)
testword test
test testword
tes.w. testtes.
tes tes.w. test
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
at regexr (when using the global flag)
If you also want to find things like "tes" I guess you should add it. (I'm not a regex expert)
Like:
test[^\w]|tes[^\w]|sometexttobefound|othertexttobefound

If you want to get all words from the text except from some special words, you could use:
#words = grep{$_ ne 'testword'} split /\P{L}+/, $str;
(if $str is your complete string)
See perl docs for \P{...}. Instead of \P{L}, you could also use \W, but those are locale-dependent.
But if you need to use regexps only, then you could use
#words = $str =~ /\b(?!testword)\p{L}+\b/g;
But again, \b is locale-dependent again, so you might want to use \b{...} or rebuild the word boundary matches with \p{L}:
#words = $str =~ /
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
(?!testword)\p{L}+
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
/gx;

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.

A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).

Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>

I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.

So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

Simple regex - finding words including numbers but only on occasion

I'm really bad at regex, I have:
/(#[A-Za-z-]+)/
which finds words after the # symbol in a textbox, however I need it to ignore email addresses, like:
foo#things.com
however it finds #things
I also need it to include numbers, like:
#He2foo
however it only finds the #He part.
Help is appreciated, and if you feel like explaining regex in simple terms, that'd be great :D

/(?:^|(?<=\s))#([A-Za-z0-9]+)(?=[.?]?\s)/
#This (matched) regex ignores#this but matches on #separate tokens as well as tokens at the end of a sentence like #this. or #this? (without picking the . or the ?) And yes email#addresses.com are ignored too.
The regex while matching on # also lets you quickly access what's after it (like userid in #userid) by picking up the regex group(1). Check PHP documentation on how to work with regex groups.

You can just add 0-9 to your regex, like so:
/(#[A-Za-z0-9-]+)/
Don't think any more explanation is needed since you've been able to come this far by yourself. 0-9 is just like a-z (though numeric ofcourse).
In order to ignore emailaddresses you will need to provide more specific requirements. You could try preceding # with (^| ) which basically states that your value MUST be preceeded by either the start of the string (so nothing really, though at the start) or a space.
Extending this you can also use ($| ) on the end to require the value to be followed by the end of the string or a space (which means there's no period allowed, which is requirement for a valid emailaddress).
Update
$subject = "#a #b a#b a# #b";
preg_match_all("/(^| )#[A-Za-z0-9-]+/", $subject, $matches);
print_r($matches[0]);

Regex which ignores comments

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC
shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off.
Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!

Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.
You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.

On the topic of good regex tools, I really like RegexBuddy, but it's not free.
Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.

Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)
There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.
To do this for your VB-style comments, you can use the regex:
'.*|(ABC)
This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.
If you want your regex to both skip comments and strings, you can add strings to your regex:
'.*|"[^"\r\n]*"|(ABC)

I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...
Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.

For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.
link text

Here is my solution to this problem:
1. Find a store all your comments in hash
2. Do your regexp replacement
3. Bring comments back to file
Save your time :-)
string fileTextWithComments = "Some tetx file contents";
Dictionary<string, string> comments = new Dictionary<string, string>();
// 1. Find a store all your comments in hash
Regex rc = new Regex("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)");
MatchCollection matches = rc.Matches(fileTextWithComments);
int index = 0;
foreach (Match match in matches)
{
string key = string.Format("/*Comment#{0}*/", index++);
comments.Add(key, match.Value);
fileTextWithComments = fileTextWithComments.Replace(match.Value, key);
}
// 2. Do your regexp replacement
Regex r = new Regex("YOUR REGEXP PATTERN");
fileTextWithComments = r.Replace(fileTextWithComments, "NEW STRING");
// 3. Bring comments back to file :-)
foreach (string key in comments.Keys)
{
string comment = comments[key];
fileTextWithComments = fileTextWithComments.Replace(key, comment);
}

Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.
What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.
I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.
In Perl because of it's regexp processing power. Assuming #lines is a list of lines you want to match, and $pattern is the pattern you want to match.
#matches =[];
for (#lines){
$line = $_;
$line ~= s/"[^"]*?(?<!\)"//g;
$line ~= s/'.*//g;
push #matches, $_ if $line ~= m/$pattern/;
}
The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace.
The next strips comments. If the pattern still matches, it adds that line to the list of matches.
It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.

You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.
Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.
This answer to a different but related question looks good too...

If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.

RegEx1: (-user ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -user "test user"
Regex2: (-daterange ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -daterange "1/4/13 1/20/13"
RegEx3: (-date )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -date 1/4/13 -
RegEx4: (-day )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -day monday -
Search for the quoted value first if not found, search for the no quotes parameter. This expects only one occurrence of the parameter. It also expects the command to either; use quotes to encapsulate a string with no quotes inside, or; use any character other than a quote in the first position, have no occurrence of ' -' until the next parameter, and have a trailing ' -' (add it onto the string before the regex).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Newbie regex question - detect spam - regex

Here's my regex newbie questions: How can I check if a string has 3 spam words? (for example: viagra, pills and shop) How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)

Related

Writing a regex pattern for text if not 255/255

Regex - skip over expressions and parse the rest

Force first letter of regex matched value to uppercase

Simple regex - finding words including numbers but only on occasion

Regex which ignores comments

Categories

Resources