Regex character interval with exception - regex

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)

Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.

The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']

(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'

Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]

The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

Related

Is there alternative regex syntax to avoid the error "look-around, including look-ahead and look-behind, is not supported"?

I tried to implement this regular expression for checking if a string ("username") has a length between 3 and 30, contains only letters (a-z), numbers (0-9), and periods (.) (not consecutive):
use regex::Regex; // 1.3.5
fn main() {
Regex::new(r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
}
When trying to compile the regex, I get this error:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
^^^
error: look-around, including look-ahead and look-behind, is not supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is there an alternative regex or ways to validate strings with these requirements?
I could remove the length {3,30} and get string length as suggested, but for the second part (?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$ (prevent consecutive dots)?
The issue at hand is what is meant by "regular expression". Wikipedia has good information on this, but a simple summary is that a regular language is one defined with a few simple operations, including literal matches, alternation, and the Kleene star (match zero or more). Regex libraries have added features that don't extend this language, but make it easier to use (such as being able to say [a-z] instead of (a|b|c|d|e|f...|z)).
Then, along came Perl, which implemented support for regular expressions. However, instead of using the commonly used NFA/DFA implementation for regular expressions, it implemented them using backtracking. There are two consequences of this, one, it allowed things beyond regular languages to be added, such as backtracking, and two, it can be really, really slow.
Many languages used these backtracking implementations of regular expressions, but there has been a somewhat recent resurgence of removing the features from the expressions that make them difficult to implement efficiently, specifically backtracking. Go has done this, the Re2 library is a C/C++ implementation of this. And, as you've discovered the regex crate also works this way. The advantage is that it always matches in linear time.
For your particular example, what you are trying to match is indeed still a regular language, it just has to be expressed differently. Let's start with the easy part, matching the characters, but not allowing consecutive dots. Instead of thinking of it this way, think of it as matching possibly a dot between the characters, but the characters themselves aren't options. In other words, we can match with: [a-z0-9](\.?[a-z0-9])*. We first match a single character. If you want to allow this to start with a dot, you could remove this part. Then we need zero or more occurrences of an optional dot followed by a single non-dot character. You could append a \.? if you want to allow a dot at the end.
The second requirement, of 3-30 characters would make this regex rather complicated, because our repeated sequence is of 1 or 2 characters. I would suggest, instead, just checking the length programmatically in addition to checking the regex. You could also make a second regex that checks the length, and check that both match (Regular languages do not have an and operation).
You may also find, depending on how your are matching, you may have to anchor the match (putting a ^ at the start and a $ at the end).
A solution to the full problem:
use regex::Regex; // 1.3.5
fn main() {
let pat = Regex::new(r"^[a-z0-9](\.?[a-z0-9])*$").unwrap();
let names = &[
"valid123",
"va.li.d.12.3",
".invalid",
"invalid.",
"double..dot",
"ss",
"really.long.name.that.is.too.long",
];
for name in names {
let len = name.len();
let valid = pat.is_match(name) && len >= 3 && len <= 30;
println!("{:?}: {:?}", name, valid);
}
}

Matching Word() when word is not (some word)

Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)

Do not include the condition itself in regex

Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");

The Different Delimiters of Regex

When I look up regular expressions for various purposes, I see people using delimiters like /, #, !, and ~. Do these do anything different, or do they have the same effect?
They don't do anything different, they delimit the regular expression (in languages where it is needed).
The difference is: the behaviour of that character inside the regex does change. The regex delimiter becomes an additional special character and needs to be escaped (==> choose a delimiter that you don't need within the regex!).
Side note: In php you can even use a regex special character like + or | as regex delimiter, but this works only when you don't need that character inside the regex (NOT recommended). related answer
In some languages you can choose the delimiters, in others you can't.
You must escape that delimiter every time it appears in the regular expression. Choosing a delimiter that does not occur in the expression reduces the need for escaping, making the expression easier to read.
The following two regular expressions are identical, except that the first uses / as a delimiter, whereas the second uses #:
/http:\/\/example\.com\/.*\/foo\//
#http://example\.com/.*/foo/#

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com