Regular Expression to exclude set of Keywords - regex

I want an expression that will fail when it encounters words such as "boon.ini" and "http". The goal would be to take this expression and be able to construct for any set of keywords.

^(?:(?!boon\.ini|http).)*$\r?\n?
(taken from RegexBuddy's library) will match any line that does not contain boon.ini and/or http. Is that what you wanted?

An alternative expression that could be used:
^(?!.*IgnoreMe).*$
^ = indicates start of line
$ = indicates the end of the line
(?! Expression) = indicates zero width look ahead negative match on the expression
The ^ at the front is needed, otherwise when evaluated the negative look ahead could start from somewhere within/beyond the 'IgnoreMe' text - and make a match where you don't want it too.
e.g. If you use the regex:
(?!.*IgnoreMe).*$
With the input "Hello IgnoreMe Please", this will will result in something like: "gnoreMe Please" as the negative look ahead finds that there is no complete string 'IgnoreMe' after the 'I'.

Rather than negating the result within the expression, you should do it in your code. That way, the expression becomes pretty simple.
\b(boon\.ini|http)\b
Would return true if boon.ini or http was anywhere in your string. It won't match words like httpd or httpxyzzy because of the \b, or word boundaries. If you want, you could just remove them and it will match those too. To add more keywords, just add more pipes.
\b(boon\.ini|http|foo|bar)\b

you might be well served by writing a regex that will succeed when it encounters the words you're looking for, and then invert the condition.
For instance, in perl you'd use:
if (!/boon\.ini|http/) {
# the string passed!
}

^[^£]*$
The above expression will restrict only the pound symbol from the string. This will allow all characters except string.

Which language/regexp library? I thought you question was around ASP.NET in which case you can see the "negative lookhead" section of this article:
http://msdn.microsoft.com/en-us/library/ms972966.aspx
Strictly speaking negation of a regular expression, still defines a regular language but there are very few libraries/languages/tool that allow to express it.
Negative lookahed may serve you the same but the actual syntax depends on what you are using. Tim's answer is an example with (?...)

I used this (based on Tim Pietzcker answer) to exclude non-production subdomain URLs for Google Analytics profile filters:
^\w+-*\w*\.(?!(?:alpha(123)*\.|beta(123)*\.|preprod\.)domain\.com).*$
You can see the context here: Regex to Exclude Multiple Words

Related

RegEx Expression for Eclipse that searches for all items that have not been dealt with

To help stop SQL Injection attacks, I am going through about 2000 parameter requests in my code to validate them. I validate them by determining what type of value (e.g. integer, double) they should return and then applying a function to them to sanitize the value.
Any requests I have dealt with look like this
*SecurityIssues.*(request.getParameter
where * signifies any number of characters on the same line.
What RegExp expression can I use in the Eclipse search (CTRL+H) which will help me search for all the ones I have not yet dealt with, i.e. all the times that the text request.getParameter appears when it is not preceded by the word SecurityIssues?
Examples for matches
The regular expression should match each of the following e.g.
int companyNo = StringFunctions.StringToInt(request.getParameter("COMPANY_NO‌​"))
double percentage = StringFunctions.StringToDouble(request.getParameter("MARKETSHARE"))
int c = request.getParameter("DUMMY")
But should not match:
int companyNo = SecurityIssues.StringToIntCompany(request.getParameter("COMP‌​ANY_NO"))
With inspiration and the links provided by #michaeak (thank you), as well as testing in https://regex101.com/ I appear to have found the answer:
^((?!SecurityIssues).)*(request\.getParameter)
The advantage of this answer is that I can blacklist the word SecurityIssues, as opposed to having to whitelist the formats that I do want.
Note, that it is relatively slow, and also slowed down my computer a lot when performing the search.
Try e.g.
=\s*?((?!SecurityIssues).)*?(request\.getParameter)\(
Notes
Paranthesis ( or ) are special characters for group matching. They need to be escaped with \.
If .* will match anything, also characters that you don't want it to match. So .*? will prevent it from matching anything (reluctant). This can be helpful if after the wildcard other items need to match.
There is a tutorial at https://docs.oracle.com/javase/tutorial/essential/regex/index.html , I think all of these should be available in eclipse. You can then deal with generic replacement also.
Problem
From reading Regular expression that doesn't contain certain string and Regular expression to match a line that doesn't contain a word? it seems quite difficult to create a regex matching anything but not to contain a certain word.

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

do we ever use regex to find regex expressions?

let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?
(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.
What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.
Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.
Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.

How to remove a small part of the string in the big string using RegExp

Hey guys, I don't know RegExp yet. I know a lil about it but I'm not experience user.
Supposed that I run a RegExp match on a website, the matches are:
Data: Informations
Data: Liberty
Then I want to extract only Informations and Liberty, I don't want the Data: part.
Does Data: always appear at the begining of a line?
Can there be multiple spaces between the : and the next word?
Do you know about groups?
What do you want: lazy matching vs greedy matching?
If so, you can use (with lazy matching):
^Data:\s+(.*?)$
With character classes:
^Data:\s+(\w+)$
if you know that it'll always be a word. Try this website.
Can't be absolutely sure without knowing more about the potential matches, but this should be at least a good starting point:
Data: (.*)$
That will return everything after "Data: " to the end of the line.
Search for a regular expression like
Data: (.*)
Then use the "first submatch", which is often referred to by "$1" or "\1", depending on the language you are using.
Regular expression engines support what are commonly called "capturing groups". If you surround a pattern or part of a pattern with (), the part of the string matched by that part of the regular expression will be captured.
The command(s) you use to do the matching will determine how to get these captured values. They may be stored in special variables (eg: $1, $2) or you may be able to specify the names of the variables either embedded within the regular expression or as arguments to the regular expression command. Exactly how depends on what language you are using.
So, read up on the regexp commands for the language of your choice and look for the term "capturing groups" or maybe just "groups".