How do I match a string that is not in a comment? - regex

I use the following in my .vimrc to match capitalised strings and highlight them:
match Macro /\v<[A-Z|_]{2,}>/
However, I don't want to match comments (ie. where a // preceeds the text in the same line or where the text is surrounded by a /* and */).
How do I modify the above to achieve this?

I'm assuming that the | in your regex was supposed to mean "or." It doesn't: within brackets, no "or" is required. Your | refers to the actual character |.
This regular expression should do the trick about 98% of the time, maybe more:
\v(\/\/[^\n]*|\/\*(\_[^*]|\*\_[^/])*)#<!<[A-Z_]{2,}>
It uses positive lookbehind to make sure that there is no // preceding the string in the same line and no /* preceding it that is not followed by a */. It fails in the following case:
if (string == "/*") { // Looks like the start of a block comment
return CONSTANT; // Won't be highlighted
}
If you want better results than this (that is, if you're worried that you'll obsess over the bug whenever you run into it) you could make this more sophisticated. How sophisticated depends on your language. In JavaScript, for example, you will need to worry about regex literals as well as strings:
// Looks like a comment after the "//" in the regex:
if (/\//.test(string)) return CONSTANT; // Won't be highlighted
If you want an idea of how complicated a regex to match a regex is, look at my answer here.

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex - find all instances of words that begin with # but do not contain 'administrator'

I am having a hard time getting my head around this regex. What I am trying to do is as follows:
Match any occurrence of words that begin with #. So, for example, if the code finds the following tags #jon, #james, #jill, then it should hide the text.
But if the code finds occurrences of the following tag: #ADMINISTRATOR, then it should display the text
In addition, if the code finds no occurrences of any words tagged with #, it should also display the text.
Essentially, I want to hide any comments that are hashed tagged with a user name other than ADMINISTRATOR.
So far, I have the following code:
if (mb_ereg_match(".*(#[^ADMINISTRATOR]){1,}.*", $comment))
{
$hideComment = true;
}else
{
$hideComment = false;
}
The above code works for the most part, except for when the text being searched contains any one of the following:
#A, #AD, #ADM, #ADMI, #ADMIN, etc.
then the code does not hide the comment, which is not what I want. I only want an exact match to '#ADMINISTRATOR' to display the comments. Plus, any comment that contains no tags should also be displayed.
Any idea what I am doing wrong?
This is a negative lookahead based regex that will work for you:
(?i)#(?!ADMINISTRATOR)\w+
Here is a Live Demo
I've not used whatever program you're using to write your regex, but the syntax in general isn't doing what you think it is. When you use a set of [], you are saying that what lies within is a class of characters. Your regular expression states I'm looking for something that follows a #, but that something doesn't begin with an A, or any of the following characters.
What you want to use is another grouping. You can use () instead of [] to represent a specific group of characters. However, as you may notice, () is also what you use to capture part of your regex. Thus, you'll want to use a non-matching group. In python, non-matching groups look like this: (?:ADMINISTRATOR)
All put together, your regex might look something like this in python:
mb_ereg_match("(#.*(?!ADMINISTRATOR))\w ",$COMMENT)
An interval in a regex will always match a single character, whether negated or not. [ADMINISTRATOR] will match either an A, D, M and so forth. [^ADMINISTRATOR] will match anything that is not an A, D, M, etc.
If you want a regex that does not have a given string, I'd suggest using a negative lookahead instead, as anubhava suggested.

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

Regular Expression for comments but not within a "string" / not in another container

So I need a regular expression for finding single line and multi line comments, but not in a string. (eg. "my /* string")
for testing (# single line, /* & */ multi line):
# complete line should be found
lorem ipsum # from this to line end
/*
all three lines should be found
*/ but not here anymore
var x = "this # should not be found"
var y = "this /* shouldn't */ match either"
var z = "but" & /* this must match */ "_"
SO does the syntax display really well; I basically want all the gray text.
I don't care if its a single regex or two separates. ;)
EDIT: one more thing. the opposite would also satisfy me, searching for a string which is not in a comment
this is my current string matching: "[\s\S]*?(?<!\\)" (indeed: will not work with "\\")
EDIT2:
OK finally I wrote my own comment parser -.-
And if someone else is interested in the source code, grab it from here: https://github.com/relikd/CommentParser
Here's one possibility (it does have an achilles heel that i'll get to):
(#[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n]|/\*([^*]|\*(?!/))*?\*/)(?=[^"]*(?:"[^"]*"[^"]*)*$)
In action here
With the GLOBAL and DOTALL flags, but not the MULTILINE flag.
Explanation of the regex:
(
#[^"\n\r]* Hash mark followed by non-" and non-end-of-line
(?:"[^"\n\r]*"[^"\n\r]*)* If any quotes in the comment, they must be balanced
[\r\n] Followed by end-of-line ($ except we
don't have multiline flag)
| OR
/\*([^*]|\*(?!/))*?\*/ /* xxx */ sort of comment
) BOTH FOLLOWED BY
(?=[^"]*(?:"[^"]*"[^"]*)*$) only a *balanced* number of quotes for the
*rest of the code :O!*
However, this relies on balanced quotes being used throughout the text (it also doesn't take into account escaped quotes, but it's easy enough to modify the regex to take that into account).
If a user has a comment with a " in it that isn't balanced...boom. You're screwed!
Regex is generally not recommended by things like HTML/code parsing, but if you can rely on the fact that quotes have to balance when you define a string, etc, you can sometimes get away with it.
Since you are also parsing comments, which have no set structure (ie you are not guaranteed that quotes within comments will be balanced), you won't be able to find a regex solution that works here.
Anything you think up can be outwitted by an unbalanced quote in a comment somewhere (say the comment was # remove all the " marks), or by multiline strings (where on a given line there may be unbalanced quotes).
Bottom line - you can probably make a regex that will work in most cases, but not for all. To get something watertight you'll have to write some code.
I would use two regular expressions for this:
/(\/\*.*?\/)|(#.+?$)/m to find all the comments, the "m" modifier is to enable multiline
/"[^"]*?"/ to find all the strings
If you apply the highlighting to the comments first and only after to the strings, the invalid comments should disappear.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.