Hello,
Background:
I'm using Checkstyle 4.4.2 with a RegExp checker module to detect when the file name in out java source headers do not match the file name of the class or interface in which they reside. This can happen when a developer copies a header from one class to another and does not modify the "File:" tag.
The regular expression use in the RexExp checker has been through many incarnations and (though it is possibly overkill at this point) looks like this:
File: (\w+)\.java\n(?:.*\n)*?(?:[\w|\s]*?(?: class | interface )\1)
The basic form of files I am checking (though greatly simplified) looks like this
/*
*
* Copyright 2009
* ...
* File: Bar.java
* ...
*/
package foo
...
import ..
...
/**
* ...
*/
public class Bar
{...}
The Problem:
When no match is found, (i.e. when a header containing "File: Bar.java" is copied into file Bat.java ) I receive a StackOverflowError on very long files (my test case is #1300 lines).
I have experimented with several visual regular expression testers and can see that in the non-matching case when the regex engine passes the line containing the class or interface name it starts searching again on the next line and does some backtracking which probably causes the StackOverflowError
The Question:
How to prevent the StackOverflowError by modifying the regular expression
Is there some way to modify my regular expression such that in the non-matching case (i.e. when a header containing "File: Bar.java" is copied into file Bat.java ) that the matching would stop once it examines the line containing the interface or class name and sees that "\1" does not match the first group.
Alternatively if that can be done, Is is possible minimize the searching and matching that takes place after it examines the line containing the interface or class thus minimizing processing and (hopefully) the StackOverflow error?
Try
File: (\w+)\.java\n.*^[\w \t]+(?:class|interface) \1
in dot-matches-all mode. Rationale:
[\w\s] (the | doesn't belong there) matches anything, including line breaks. This results in a lot of backtracking back up into the lines that the previous part of the regex had matched.
If you let the greedy dot gobble up everything up to the end of the file (quick) and then backtrack until you find a line that starts with words or spaces/tabs (but no newlines) and then class or interface and \1, then that doesn't require as much stack space.
A different, and probably even better solution would be to split the problem into parts.
First match the File: (\w+)\.java part. Then do a second search with ^[\w \t]+(?:class|interface) plus the \1 match from the first search on the same file.
Follow up:
I plugged in Tim Pietzcher's suggestion above and his greedy solution did indeed fail faster and without a StackOverflowError when no match was found. However, in the positive case, the StackOverflowError still occurred.
I took a look at the source code RegexpCheck.java. The classes pattern is constructed in multiline mode such that the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. Then it reads the entire class file into a string and does a recursive search for the pattern(see findMatch()). That is undoubtedly the source of the StackOverflowException.
In the end I didn't get it to work (and gave up) Since Maven 2 released the maven-checkstyle-plugin-2.4/Checkstyle 5.0 about 6 weeks ago we've decided to upgrade our tools. This may not solve the StackOverflowError problem, but it will give me something else to work on until someone decides that we need to pursue this again.
Related
I am trying to search a text file that will return a result if more than one word is found in that line. I don't see this explained in the documentation and I have tried various loops with no success.
What I would like to do is something similar to this:
$read(name.txt, s, word1|word2|word3)
or even something like this:
$read(name.txt, w, word1*|*word2*|*word3)
I don't know RegEx that well so I'm assuming this can be done with that but I don't know how to do that.
The documentation in the client self is good but I also recommend this site: http://en.wikichip.org/wiki/mirc. And with your problem there is a nice article : http://en.wikichip.org/wiki/mirc/text_files
All the info is taken from there. So credits to wikichip.
alias testForString {
while ($read(file.txt, nw, *test*, $calc($readn + 1))) {
var %line = $v1
; you can add your own words in the regex, seperate them with a pipe (|)
noop $regex(%line,/(word1|word2|word3|test)/))
echo -a Amount of results: $regml(0)
}
}
$readn is an identifier that returns the line that $read() matched. It is used to start searching for the pattern on the next line. Which is in this case test.
In the code above, $readn starts at 0. We use $calc() to start at line 1. Every match $read() will start searching on the next line. When no more matches are after the line specified $read will return $null - terminating the loop.
The w switch is used to use a wildcard in your search
The n switch prevents evaluating the text it reads as if it was mSL code. In almost EVERY case you must use the n switch. Except if you really need it. Improper use of the $read() identifier without the 'n' switch could leave your script highly vulnerable.
The result is stored in a variable named %line to use it later in case you need it.
After that we use a noop to execute a regex to match your needs. In this case you can use $regml(0) to find the amount of matches which are specified in your regex search. Using an if-statement you can see if there are two or more matches.
Hope you find this helpful, if there's anything unclear, I will try to explain it better.
EDIT
#cp022
I can't comment, so I'll post my comment here, so how does that help in any way to read content from a text file?
I have a text file in Notepad++ that contains about 66,000 words all in 1 line, and it is a set of 200 "lines" of output that are all unique and placed in 1 line in the basic JSON form {output:[{output1},{output2},...}]}.
There is a set of characters matching the RegEx expression "id":.........,"kind":"track" that occurs about 285 times in total, and I am trying to either single them out, or copy all of them at once.
Basically, without some super complicated RegEx terms, I am stuck because I can't figure out how to highlight all of them at once, and also the Remove Unbookmarked Lines feature does not apply because this is all in one line. I have only managed to be able to Mark every single occurrence.
So does this require a large number of steps to get the file into multiple lines and work from there, or is there something else I am missing?
Edit: I have come up with a set of Macro schemes that make the process of doing this manually work much faster. It's another alternative but still takes a few steps and quite some time.
Edit 2: I intended there to be an answer for actually just highlighting the different sections all at once, but I guess that it not possible. The answer here turns out to be more useful in my case, allowing me to have a list of IDs without everything else.
You seem to already have a regex which matches single instances of your pattern, so assuming it works and that we must use Notepad++ for this:
Replace .*?("id":.........,"kind":"track").*?(?="id".........,"kind":"track"|$) with \1.
If this textfile is valid JSON, this opens you up to other, non-notepad++ options, like using Python with the json module.
Edited to remove unnecessary steps
I am using the maven replacer plugin and I've run into a situation where I have a regular expression that matches across lines which I need to run on the input file until all matches have been replaced. The configuration for this expression looks like this:
<regexFlags>
<regexFlag>DOTALL</regexFlag>
</regexFlags>
<replacements>
<replacement>
<token>\#([^\n\r=\#]+)\#=([^\n\r]*)(.*)(\#default\.\1\#=[^\n\r]*)(.*)</token>
<value>#$1#=$2$3$5</value>
<replacement>
<replacements>
The input could look like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#default.a.b.c#=QQQ
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
and I want the output to look like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#asdfasd.fasdfs.asdfa#=23423
#default.RR.TT#=393993
The intention is to re-write the file, but without the tokens with a #default prefix, where another token without the prefix has already been defined.
#default.a.b.c#=QQQ and #default.h.i.j#=234 have been removed from the output because other tokens already contains a.b.c and h.i.j.
The current problem I have is that the replacer plugin only replaces the first match, so my output looks like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
Here, #default.a.b.c=QQQ is gone, which is correct, but #default.h.i.j#=234 is still present.
If I were writing this in code, I think I could probably just loop while attempting to match on the entire output, and break when there are no matches. Is there a way to do this with the replacer plugin?
Edit: I may have over simplified my example. A more realistic one is:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#default.a.b.c#=QQQ
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
#x.y.z#=0
#default.q.r.s#=1
#l.m.n#=8.3
#q.r.s#=78
#blah.blah.blah#=blah
This shows that it's possible for a default.x.x.x=y to precede a x.x.x=y token (as #default.q.r.s#=1 preceedes #q.r.s#=78`), my prior example wasn't clear about this. I do actually have an expression to capture this, it looks a bit like this:
\#default\.([^\n\r=#|]+)#=([^\n\r|]*)(.*)#\1#=([^\n\r|]*)(.*)
I know line separators are missing from this even though they were in the other one - I was experimenting with removing all line separators and treating it as a single line but that hasn't helped. I can resolve this problem simply by running each replacement multiple times by copying and pasting the configurations a few times, but that is not a good solution and will fail eventually.
I don't believe you could solve this problem as is, a work-around is to reverse the order of the file top to bottom, perform lookahead regex and then reverse the result order
pattern = #default\.(.*?)#[^\r\n]+(?=[\s\S]*#\1#) Demo
another way (depending on the capabilities of "Maven") is to run this pattern
#(.*)(#[\s\S]*)#default\.\1.*
and replace with #$1$2 Demo in a loop until there are no matches
then run this pattern
#default\.(.*)#.*(?=[\s\S]*\1)
and replace with nothing Demo in a loop until there are no matches
It doesn't look like the replacer plugin can actually do what I want. I got around this by using regular expressions to build multiple filter files, and then applying them to the resource files.
My original goal had been to use regular expressions to build a single, clean, and tidy filter file. In the end, I discovered that I was able to get away with just using multiple filters (not as clean or tidy) and apply them in the correct order.
Given the following file path:
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
I would like to generate the following using a single regex replace (the root can be a constant):
some.very.interesting
This is for the purpose of generating a snippet for Sublime Text which can automatically insert the correct package/namespace header for my scala/java classes :)
Sublime Text uses the following syntax for their regex replace patterns (aka 'substitutions'):
{input/regex/replace/flags}
Hence why an iterative approach cannot be taken - it has to be done in one pass! Also, substitutions cannot be nested :(
If you know the maximum number of nested folders.You can specify that in your regex.
For 1 to 3 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3
For 1 to 5 nested folders
Regex:/Users/Lawrence/MyProject/(\w+)/?(\w+)?/?(\w+)?/?(\w+)?/?(\w+)?/[^/]+$
Replace:$1.$2.$3.$4.$5
Given the constraints this is only thing you can do
Input
/Users/Lawrence/MyProject/some/very/interesting/Code.scala
Regex
^/Users/Lawrence/MyProject/[^/]+/[^/]+/[^/]+/Code.scala
or
^/[^/]+/[^/]+/[^/]+/([^/]+)/([^/]+)/([^/]+)/
Replace
\1.\2.\3
Update
This gets you closer, but not exactly it:
Regex
(^/Users/Lawrence/MyProject/|/Code\.scala$|/)
Replacement
.
Output would be:
.some.very.interesting.
Without multiple replacements in a single line and without recursive back references it's going to be hard.
You might have to do a second replacement, replacing something like this with an empty string (if you can):
(^\.|\.$)
This is my first post on stackoverflow, so please be gentle with me...
I am still learning regex - mostly because I have finally discovered how useful they can be and this is in part through using Sublime Text 2. So this is Perl regex (I believe)
I have done searching on this and other sites but I am now genuinely stuck. Maybe I am trying to do something that can't be done
I would like to find a regex (pattern) that will let me find the function or method or procedure etc that contains a given variable or method call.
I have tried a number of expressions and they seem to get part of the way but not all the way. Particularly when searching in Javascript I pick up multiple function declarations instead of the one nearest to the call/variable that I am looking for.
for example:
I am looking for the function that calls the method save data()
I have learnt, from this excellent site that I can use (?s) to switch . to include newlines
function.*(?=(?s).*?savedata\(\))
however, that will find the first instance of the word function and then all the text unto and including savedata()
if there are multiple procedures then it will start at the next function and repeat until it gets to savedata() again
function(?s).*?savedata\(\) does something similar
I have tried asking it to ignore the second function (I believe) by using something like:
function(?s).*?(?:(?!function).*?)*savedata\(\)
But that doesn't work.
I have done some investigation with look forwards and look backwards but either I am doing it wrong (highly possible) or they are not the right thing.
In summary (I guess), how do I go backwards, from a given word to the nearest occurrence of a different word.
At the moment I am using this to search through some javascript files to try and understand the structure/calls etc but ultimately I am hoping to use on c# files and some vb.net files
Many thanks in advance
Thanks for the swift responses and sorry for not added an example block of code - which I will do now (modified but still sufficient to show the issue)
if I have a simple block of javascript like the following:
function a_CellClickHandler(gridName, cellId, button){
var stuffhappenshere;
var and here;
if(something or other){
if (anothertest) {
event.returnValue=false;
event.cancelBubble=true;
return true;
}
else{
event.returnValue=false;
event.cancelBubble=true;
return true;
}
}
}
function a_DblClickHandler(gridName, cellId){
var userRow = rowfromsomewhere;
var userCell = cellfromsomewhereelse;
//this will need to save the local data before allowing any inserts to ensure that they are inserted in the correct place
if (checkforarangeofthings){
if (differenttest) {
InsSeqNum = insertnumbervalue;
InsRowID = arow.getValue()
blnWasInsert = true;
blnWasDoubleClick = true;
SaveData();
}
}
}
running the regex against this - including the second one that was identified as should be working Sublime Text 2 will select everything from the first function through to SaveData()
I would like to be able to get to just the dblClickHandler in this case - not both.
Hopefully this code snippet will add some clarity and sorry for not posting originally as I hoped a standard code file would suffice.
This regex will find every Javascript function containing the SaveData method:
(?<=[\r\n])([\t ]*+)function[^\r\n]*+[\r\n]++(?:(?!\1\})[^\r\n]*+[\r\n]++)*?[^\r\n]*?\bSaveData\(\)
It will match all the lines in the function up to, and including, the first line containing the SaveData method.
Caveat:
The source code must have well-formed indentation for this to work, as the regex uses matching indentations to detect the end of functions.
Will not match a function if it starts on the first line of the file.
Explanation:
(?<=[\r\n]) Start at the beginning of a line
([\t ]*+) Capture the indentation of that line in Capture Group 1
function[^\r\n]*+[\r\n]++ Match the rest of the declaration line of the function
(?:(?!\1\})[^\r\n]*+[\r\n]++)*? Match more lines (lazily) which are not the last line of the function, until:
[^\r\n]*?\bSaveData\(\) Match the first line of the function containing the SaveData method call
Note: The *+ and ++ are possessive quantifiers, only used to speed up execution.
EDIT:
Fixed two minor problems with the regex.
EDIT:
Fixed another minor problem with the regex.