Regular expression to match a string outside class or function block - regex

Here's the reduced case of PHP code:
use Package;
use Package2;
class {
use Trait;
function fn() {
function() use ($var) {
}
}
}
I'd like to match only the use before Package; and Package2; not use Trait nor use ($var)
Nothing like negative lookahead and negative lookbehind seem to work. Tried this approach Regular Expression, match characters outside curly braces { }
Obviously doesn't work: https://regex101.com/r/L6N4Ye/1
Using the PCRE interpreter.

While using regex might not be the best choice here. You could use one if you have control over the format of the code you are parsing. Otherwise, using a PHP parser would be the best idea.
With that in mind, how about checking if the use is at the beggining of the string (^) ?
^use\s+(?![^{]*\})
see here

I am not aware of the PHP syntax, so please forgive me for missed syntactical considerations.
Since in this particular case, you are sure that all uses you are interested in lie before the class boundary, I think what may help is to look for all use that is not preceded by a {, which can be achieved through the following regex which uses a negative lookbehind for {:
(?<!\{\s{0,100})\s*use\s*(?<pkg>.*);
After applying this to the entire source code, you may look for the groups named pkg in the matched substrings.
However, the not-so-good part in the negative lookbehind is the \s{0,100}, which I have included only to allow spaces after the opening brace. There must be a better way for this. I had to do this because negative lookbehinds need a calculatable maximum length, due to which \\s* will not work.
My assumptions on the syntax:
use is always small case
A use package statement ends with ; necessarily
Whitespace is allowed freely between tokens as in the case of Java

Related

Is there alternative regex syntax to avoid the error "look-around, including look-ahead and look-behind, is not supported"?

I tried to implement this regular expression for checking if a string ("username") has a length between 3 and 30, contains only letters (a-z), numbers (0-9), and periods (.) (not consecutive):
use regex::Regex; // 1.3.5
fn main() {
Regex::new(r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
}
When trying to compile the regex, I get this error:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
r"^(?=.{3,30}$)(?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$").unwrap();
^^^
error: look-around, including look-ahead and look-behind, is not supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is there an alternative regex or ways to validate strings with these requirements?
I could remove the length {3,30} and get string length as suggested, but for the second part (?!\.)(?!.*\.$)(?!.*?\.\.)[a-z0-9.]+$ (prevent consecutive dots)?
The issue at hand is what is meant by "regular expression". Wikipedia has good information on this, but a simple summary is that a regular language is one defined with a few simple operations, including literal matches, alternation, and the Kleene star (match zero or more). Regex libraries have added features that don't extend this language, but make it easier to use (such as being able to say [a-z] instead of (a|b|c|d|e|f...|z)).
Then, along came Perl, which implemented support for regular expressions. However, instead of using the commonly used NFA/DFA implementation for regular expressions, it implemented them using backtracking. There are two consequences of this, one, it allowed things beyond regular languages to be added, such as backtracking, and two, it can be really, really slow.
Many languages used these backtracking implementations of regular expressions, but there has been a somewhat recent resurgence of removing the features from the expressions that make them difficult to implement efficiently, specifically backtracking. Go has done this, the Re2 library is a C/C++ implementation of this. And, as you've discovered the regex crate also works this way. The advantage is that it always matches in linear time.
For your particular example, what you are trying to match is indeed still a regular language, it just has to be expressed differently. Let's start with the easy part, matching the characters, but not allowing consecutive dots. Instead of thinking of it this way, think of it as matching possibly a dot between the characters, but the characters themselves aren't options. In other words, we can match with: [a-z0-9](\.?[a-z0-9])*. We first match a single character. If you want to allow this to start with a dot, you could remove this part. Then we need zero or more occurrences of an optional dot followed by a single non-dot character. You could append a \.? if you want to allow a dot at the end.
The second requirement, of 3-30 characters would make this regex rather complicated, because our repeated sequence is of 1 or 2 characters. I would suggest, instead, just checking the length programmatically in addition to checking the regex. You could also make a second regex that checks the length, and check that both match (Regular languages do not have an and operation).
You may also find, depending on how your are matching, you may have to anchor the match (putting a ^ at the start and a $ at the end).
A solution to the full problem:
use regex::Regex; // 1.3.5
fn main() {
let pat = Regex::new(r"^[a-z0-9](\.?[a-z0-9])*$").unwrap();
let names = &[
"valid123",
"va.li.d.12.3",
".invalid",
"invalid.",
"double..dot",
"ss",
"really.long.name.that.is.too.long",
];
for name in names {
let len = name.len();
let valid = pat.is_match(name) && len >= 3 && len <= 30;
println!("{:?}: {:?}", name, valid);
}
}

Regular expression for word *not* in specific latex command

I am looking for a regular expression which will match all occurrences of foo unless it is in a \ref{..} command. So \ref{sec:foo} should not match.
I will probably want to add more commands for which the arguments should be excluded later so the solution should be extendable in that regard.
There are some similar questions trying to detect when something is in parenthesis, etc.
Regex to match only commas not in parentheses?
Split string by comma if not within square brackets or parentheses
The first has an interesting solution using alternatives: \(.*?\)|(,) the first alternative matches the unwanted versions and the second has the match group. However, since I am using this is in a search&replace context I cannot really use match groups.
Finding a regex for what you want will need variable length look behind which is only available with very limited languages (like C# and PyPi in Python) hence you will have to settle with a less than perfect. You can use this regex to match foo that is not within curly braces as long as you don't have curly nested curly braces,
\bfoo\b(?![^{}]*})
This will not match a foo inside \ref{sec:foo} or even in somethingelse{sec:foo} as you can see it only doesn't match a foo that isn't contained in a curly braces. If you need a precise solution, then it will need variable length look behind support which as I said, is available in very limited languages.
Regex Demo

How can I simulate a negative lookup in a regular expression

I have the following regular expression that includes a negative look ahead. Unfortunately the tool that I'm using does not support regular expressions. So I'm wondering if its possible to achieve negative look ahead behaviour without actually using one.
Here is my regular expression:
(?<![ABCDEQ]|\[|\]|\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
Here it is working with sample data on Regex101.com:
see expression on regex101.com
I'm using a tool called Alteryx. The documentation indicates that it uses Perl, however, for whatever reason the look ahead does not work.
Alteryx appears to use the Boost library for its regex support, and the Boost documentation says lookbehind expressions must have a fixed length. It's more restrictive than PHP (PCRE), which allows you to use alternation in a lookbehind, as long as each branch is fixed-length. But that's easy enough to get around: just use multiple lookbehinds:
(?<![ABCDEQ])(?<!\[)(?<!\])(?<!\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
That regex works for me in a Boost-powered regex tester, where yours doesn't. I would compress it a little more by putting square brackets inside the character set:
(?<![][ABCDEQ])(?<!\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
The right bracket is treated as a literal when it's the first character listed, and the left bracket is never special (though some other flavors have different rules).
Here's the updated demo.

Do not include the condition itself in regex

Here's the regexp:
/\.([^\.]*)/g
But for string name.ns1.ns2 it catches .ns1 and .ns2 values (which does make perfect sense). Is it possible only to get ns1 and ns2 results? Maybe using assertions, nuh?
You have the capturing group, use its value, however you do it in your language.
JavaScript example:
var list = "name.ns1.ns2".match(/\.([^.]+)/g);
// list now contains 'ns1' and 'ns2'
If you can use lookbehinds (most modern regex flavors, but not JS), you can use this expression:
(?<=\.)[^.]+
In Perl you can also use \K like so:
\.\K[^.]+
I'm not 100% sure what you're trying to do, but let's go through some options.
Your regex: /\.([^\.]*)/g
(Minor note: you don't need the backslash in front of the . inside a character class [..], because a . loses its special meaning there already.)
First: matching against a regular expression is, in principle, a Boolean test: "does this string match this regex". Any additional information you might be able to get about what part of the string matched what part of the regex, etc., is entirely dependent upon the particular implementation surrounding the regular expression in whatever environment you're using. So, your question is inherently implementation-dependent.
However, in the most common case, a match attempt does provide additional data. You almost always get the substring that matched the entire regular expression (in Perl 5, it shows up in the $& variable). In Perl5-compatible regular expressions, f you surround part of the regular expression with unquoted parentheses, you will additiionally get the substrings that matched each set of those as well (in Perl 5, they are placed in $1, $2, etc.).
So, as written, your regular expression will usually make two separate results available to you: ".ns1", ".ns2", etc. for the entire match, and "ns1", "ns2", etc. for the subgroup match. You shouldn't have to change the expression to get the latter values; just change how you access the results of the match.
However, if you want, and if your regular expression engine supports them, you can use certain features to make sure that the entire regular expression matches only the part you want. One such mechanism is lookbehind. A positive lookbehind will only match after something that matches the lookbehind expression:
/(?<\.)([^.]*)/
That will match any sequence of non-periods but only if they come after a period.
Can you use something like string splitting, which allows you to break a string into pieces around a particular string (such as a period)?
It's not clear what language you're using, but nearly every modern language provides a way to split up a string. e.g., this pseudo code:
string myString = "bill.the.pony";
string[] brokenString = myString.split(".");

How do I match a pattern with optional surrounding quotes?

How would one write a regex that matches a pattern that can contain quotes, but if it does, must have matching quotes at the beginning and end?
"?(pattern)"?
Will not work because it will allow patterns that begin with a quote but don't end with one.
"(pattern)"|(pattern)
Will work, but is repetitive. Is there a better way to do that without repeating the pattern?
You can get a solution without repeating by making use of backreferences and conditionals:
/^(")?(pattern)(?(1)\1|)$/
Matches:
pattern
"pattern"
Doesn't match:
"pattern
pattern"
This pattern is somewhat complex, however. It first looks for an optional quote, and puts it into backreference 1 if one is found. Then it searches for your pattern. Then it uses conditional syntax to say "if backreference 1 is found again, match it, otherwise match nothing". The whole pattern is anchored (which means that it needs to appear by itself on a line) so that unmatched quotes won't be captured (otherwise the pattern in pattern" would match).
Note that support for conditionals varies by engine and the more verbose but repetitive expressions will be more widely supported (and likely easier to understand).
Update: A much simpler version of this regex would be /^(")?(pattern)\1$/, which does not need a conditional. When I was testing this initially, the tester I was using gave me a false negative, which lead me to discount it (oops!).
I'll leave the solution with the conditional up for posterity and interest, but this is a simpler version that is more likely to work in a wider variety of engines (backreferences are the only feature being used here which might be unsupported).
This is quite simple as well: (".+"|.+). Make sure the first match is with quotes and the second without.
Depending on the language you're using, you should be able to use backreferences. Something like this, say:
(["'])(pattern)\1|^(pattern)$
That way, you're requiring that either there are no quotes, or that the SAME quote is used on both ends.
This should work with recursive regex (which needs longer to get right). In the meantime: in Perl, you can build a self-modifying regex. I'll leave that as an academic example ;-)
my #stuff = ( '"pattern"', 'pattern', 'pattern"', '"pattern' );
foreach (#stuff) {
print "$_ OK\n" if /^
(")?
\w+
(??{defined $1 ? '"' : ''})
$
/x
}
Result:
"pattern" OK
pattern OK
Generally #Daniel Vandersluis response would work. However, some compilers do not recognize the optional group (") if it is empty, therefore they do not detect the back reference \1.
In order to avoid this problem a more robust solution would be:
/^("|)(pattern)\1$/
Then the compiler will always detect the first group. This expression can also be modified if there is some prefix in the expression and you want to capture it first:
/^(key)=("|)(value)\2$/