Go regex, Negative Look Ahead alternative - regex

I am trying to implement the regex (?<!\\{)\\[[a-zA-Z0-9_]+\\](?!\\}) with go regex.
Match value will be like [ua] and [ua_enc] and unmatched should be {[ua]} and {[ua_enc]}
As Negative lookahead is not supported in Go, what may be the alternative expression for this?

There is no alternative expression for this. Using plain (?:[^{]|^)(...)(?:[^}]|$) to capture the intended match and assert the previous and next characters are not braces will kind-of work: you will need to work with the first capture group instead of with the full match, and it will fail when there is only a single character between two matches (e.g. [foo]_[bar]). The best way, really, is to use FindAllStringSubmatchIndex and manually check the previous and next characters to make sure they are not braces outside of regexp.

Related

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Regex matchin "java.util" but not "java.util.Collections"

I have a regex:
(abc|xyz|java\.util|)
However, I would like to ignore java.util.Collections. I'm stumped as to how to do this.
It's as simple as not matching a dot: [^.]
Of course, there might be other solutions that work better for you, depending on things like if that's the whole string, if a character is guaranteed to come after it, etc. If you give some more details, I can be more specific.
For example, if it's an import statement, you could just match a semicolon by putting a literal semicolon after it. If you plan to use the bit immediately afterwards, use a negative lookahead: (?!\.) If the string will end after the util, anchor it to the end with $.
If you want to fail on only java.util.Collections but accept anything else, then you want to use the specific negative lookahead (?!\.Collections). If you want to only allow one thing (say Random), you can add(?:\.Random)? immediately after java.util in your current regex.
You could use the end of line character
(abc|xyz|java\.util|)$
Or negative look-ahead
(abc|xyz|java\.util|)$(?!\.)
You can use a negative lookahead and use a regex like this:
java\.util(?!\.Collections)
Working demo
So, you can add the pattern to your regex and have:
(abc|xyz|java\.util(?!\.Collections)|)

Collapse and Capture a Repeating Pattern in a Single Regex Expression

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

How to get the inverse of a regular expression?

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:
(http://)([a-zA-Z0-9\/\.])*
If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?
You could simply search and replace everything that matches the regular expression with an empty string, e.g. in Perl s/(http:\/\/)([a-zA-Z0-9\/\.])*//g
This would give you everything in the original text, except those substrings that match the regular expression.
If for some reason you need a regex-only solution, try this:
((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)
I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.
The regex is a bit of a monster, so I'll try to break it down:
(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])
The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.
We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)
Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.
Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.
The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.
This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.
Corrections welcome, of course!
If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.
s/^(.*)(your regex here)(.*)$/$1$3/
im not sure if this will work exactly as you intend but it might help:
Whatever you place in the brackets [] will be matched against. If you put ^ within the bracket, i.e [^a-zA-Z0-9/.] it will match everything except what is in the brackets.
http://www.regular-expressions.info/