Regex Ignore Comments In Java - regex

I can't seem to find a concise answer on this and as I know very little about regex, I feel the easiest option is to ask.
I am trying to count lines of code in eclipse, which I can do, but it includes the comments.
Basically, my regex pattern is "\n"
Pretty basic, yes, but try as I might, I can't seem to figure out a way to ignore a line starting with "//"
I've tried [^(//)] but that seems to count every "/". I've tried the same thing without the delimiter: "\"
Any ideas, even if you just point me in the right direction, my google searches didn't turn up anything useful.

Better to use negative lookahead here. For your case code like this will work:
String str = "Line1\n" +
"/Line2\n" +
"//Line3\n" +
"Line4\n" +
" // Line5\n" +
"Line6\n";
Pattern pt = Pattern.compile("^(?!\\s*//)", Pattern.MULTILINE);
matcher matcher = pt.matcher(str);
int c=0;
while (matcher.find()) c++;
System.out.println("# of lines: " + c);
Output
# of lines: 4
(?!\\s*//) is negative lookahead that is saying match only if a line doesn't start with 0 or more spaces followed by //
As you can see there are 2 lines above starting with comment // hence they are not counted.
Also it is important to use Pattern.MULTILINE flag here to make every line recognize start of line character ^.

Simplified regex, will improve if you need more:
^//
Anything in [] is a character class, which means match one of the symbols within it. Also adding ^ is the inverse of that and is not the same as a ^ outside which mathces the beginning of the string.
You can also do something like:
^[^/][^/].*
to match lines not starting with //

There are better ways and other tools to count lines of code, for instance a test coverage tool or (I don't know if this still works with the newest version): http://metrics.sourceforge.net/
If you just ignore //, then you will still count the package and import declarations along with multi-line comments such as:
/**
* Javadoc
*/
or brackets that sit alone on lines like:
while(...)
{
...
}

I'm not familiar with eclipse, but if it has lookahead this should do it:
\n(?!//)

^\s*(?!//)\S.*$
Highlights any line that starts ("^") with any number of spaces (\s*"), then does not start a comment ("(?!//)"), then has a non-space character ("\S"), then has any number of any characters until end of the line (".*$").

Related

Regex to replace block comment with line comment

There are tons of examples to do the conversion from C-style line comment to 1-line block comment. But I need to do the opposite: find a regex to replace multi-line block comment with line comments.
From:
This text must not be touched
/*
This
is
random
text
*/
This text must not be touched
To
This text must not be touched
// This
// is
// random
// text
This text must not be touched
I was thinking if there's a way to represent "each line" concept in regex, then just add // in front of each line. Something like
\/\*\n(?:(.+)\n)+\*\/ -> // $1
But the greediness nature of the regex engine makes $1 just match the last line before */. I know Perl and other languages have some advanced regex features like recursion, but I need to do this in a standard engine. Is there any trick to accomplish this?
EDIT: To clarify, I'm looking for pure regex solution, not involving any programming language. Should be testable on sites like https://regex101.com/.
If you are interested in a single regex pass in the modern JavaScript engine (and other regex engines supporting infinite length patterns in lookbehinds), you can use
/(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n)(?=[\s\S]*?^\*\/)|(?:\r?\n)?(?:^\/\*|^\*\/)/gm
Replace with $1$1, see the regex demo.
Details
(?<=^(\/)\*(?:(?!^\/\*)[\s\S])*?\r?\n) - a positive lookbehind that matches a location that is immediately preceded with
^(\/)\* - /* substring at the start of a line (with / captured into Group 1)
(?:(?!^\/\*)[\s\S])*? - any char, zero or more occurrences, as few as possible, not starting a /* char sequence that appears at the start of a line
\r?\n - a CRLF or LF ending
(?=[\s\S]*?^\*\/) - a positive lookahead that requires any 0 or more chars as few as possible followed with */ at the start of a line, immediately to the right of the current location
| - or
(?:\r?\n)? - an optional CRLF or LF linebreak
(?:^\/\*|^\*\/) - and then either /* or */ at the start of a line.
As usual in such cases, two regular expressions—the second applied to the matches of the first—can do what one cannot achieve.
const txt = `This text must not be touched
/*
This
is
random
text
*/
This text must not be touched`;
const to1line = str => str.replace(
/\/\*\s*(.*?)\s*\*\//gs,
(_, comment) => comment.replace( /^/mg, '//')
);
console.log( to1line( txt ));

Look for any character that surrounds one of any character including itself

I am trying to write a regex code to find all examples of any character that surrounds one of any character including itself in the string below:
b9fgh9f1;2w;111b2b35hw3w3ww55
So ‘b2b’ and ‘111’ would be valid, but ‘3ww5’ would not be.
Could someone please help me out here?
Thanks,
Nikhil
You can use this regex which will match three characters where first and third are same using back reference, where as middle can be any,
(.).\1
Demo
Edit:
Above regex will only give you non-overlapping matches but as you want to get all matches that are even overlapping, you can use this positive look ahead based regex which doesn't consume the next two characters instead groups them in group2 so for your desired output, you can append characters from group1 and group2.
(.)(?=(.\1))
Demo with overlapping matches
Here is a Java code (I've never programmed in Ruby) demonstrating the code and the same logic you can write in your fav programming language.
String s = "b9fgh9f1;2w;111b2b35hw3w3ww55";
Pattern p = Pattern.compile("(.)(?=(.\\1))");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group(1) + m.group(2));
}
Prints all your intended matches,
111
b2b
w3w
3w3
w3w
Also, here is a Python code that may help if you know Python,
import re
s = 'b9fgh9f1;2w;111b2b35hw3w3ww55'
matches = re.findall(r'(.)(?=(.\1))',s)
for m in re.findall(r'(.)(?=(.\1))',s):
print(m[0]+m[1])
Prints all your expected matches,
111
b2b
w3w
3w3
w3w

i need help in regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor
Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

Regexp x.get(y) -> x[y]

While porting many lines of code from one language to another I must replace all array access from the form of the function call x.get(y) to the square brackets notation x[y]. There are few text editors around that can do regular expression based replace.
What should be typed in the "text to find" field and what should be typed in the "replace with" field in this situation? Both x and y can vary, so the original code can have lines like:
... state.get(1);
... text.get(i);
... result.get(line);
after conversion:
... state[1];
... text[i];
... result[line];
You can search for \.get\((\w+)\) and replace with [$1].
The above pattern assumes only alphanumeric characters between the parentheses, but there are other alternatives:
.* (without checking ". matched newline") should match until the end of the line.
[^)]* should match characters that are not ). Would work for new lines.
In both cases, you may want to include the ; in your pattern.
Note that this is very fragile either way - you might encounter code like state.get(a.get(3 + sin(6))), and probably get incorrect results.
For Notepad++, I would write in Find what: ([0-9,a-z,A-Z,-,_]+).get\(([0-9,a-z,A-Z,-,_]+)\)
replace with \1[\2]
Input:
x.get(1);
text.get(i);
result.get(line);
Output:
x[1];
text[i];
result[line];

Regex to match all lines starting with a specific string

I have this very long cfg file, where I need to find the latest occurrence of a line starting with a specific string. An example of the cfg file:
...
# format: - search.index.[number] = [search field]:element.qualifier
...
search.index.1 = author:dc.contributor.*
...
search.index.12 = language:dc.language.iso
...
jspui.search.index.display.1 = ANY
...
I need to be able to get the last occurrence of the line starting with search.index.[number] , more specific: I need that number. For the above snippet, that number would be 12.
As you can see, there are other lines too containing that pattern, but I do not want to match those.
I'm using Groovy as a programming/scripting language.
Any help is appreciated!
Have you tried:
def m = lines =~ /(?m)^search\.index\.(\d+)/
m[ -1 ][ 1 ]
Try this as your expression :
^search\.index\.(\d+)/
And then with Groovy you can get your result with:
matcher[0][0]
Here is an explanation page.
I don't think you should go for it but...
If you can do a multi-line search (anyway you have to here), the only way would be to read the file backward. So first, eat everything with a .* (om nom nom)(if you can make the dot match all, (?:.|\s)* if you can't). Now match your pattern search\.index\.(\d+). And you want to match this pattern at the beginning of a line: (?:^|\n) (hoping you're not using some crazy format that doesn't use \n as new line character).
So...
(?:.|\s)*(?:^|\n)search\.index\.(\d+)
The number should be in the 1st matching group. (Test in JavaScript)
PS: I don't know groovy, so sorry if it's totally not appropriate.
Edit:
This should also work:
search\.index\.(\d+)(?!(?:.|\s)*?(?:^|\n)search\.index\.\d+)