Please suggest a tool that could automate replacing like:
Mutex staticMutex = Mutex(m_StaticMutex.Handle());
staticMutex.Wait();
to
boost::unique_lock<boost::mutex> lock(m_StaticMutex);
As you see, the arguments must be taken into account. Is there a way simpler than regular expressions?
If you can do this with a modest amount of manual work (even including "search and replace") then this answer isn't relevant.
If the code varies too much (indentation, comments, different variable names) and there's a lot of these, you might need a Program Transformation tool. Such tools tend to operate on program representations such as abstract syntax trees, and consequently are not bother by layout or whitespace or even numbers that are spelled differently because of radix, but actually have the same value.
Our DMS Software Reengineering Toolkit is one of these, and has a C++ Front End.
You'd need to give it a rewrite rule something like the following:
domain Cpp; -- tell DMS to use the C++ front end for parsing and prettyprinting
rule replace_mutex(i:IDENTIFIER):statements -> statements
"Mutex \i = Mutex(m_StaticMutex.Handle());
\i.Wait();" =>
"boost::unique_lock<boost::mutex> lock(m_StaticMutex);";
The use of the metavariable \i in both places will ensure that the rule only fires if the name is exactly the same in both places.
It isn't clear to me precisely what you are trying to accomplish; it sort of looks like you want to replace each private mutex with one global one, but I'm not a boost expert. If you tried to do that, I'd expect your program to behave differently.
If those lines appear frequently in your code, similar formatted, just with different variable names, but not "too" frequently (<200~300 times), I would suggest you use an editor with record-replay capabilities (for example Visual Studio under Windows). Record the steps to replace the 2 lines by the new one (but keep the variable name). Then repeat "search for Mutex" - "replay macro" as often as you need it.
Of course, this specific case should be also solvable for all occurences at once by any text editor with good "Find-and-Replace in Files" capabilities.
Related
I am looking to create a system that extracts blocks of code from C++ files. For example, if I wanted to extract every while loop, I would look for a pattern that begins with while and ends with }. The problem with that specific example is that while loops may contain other scope blocks, so I'd need to:
Find the string while - regex can easily do this
match braces starting with the open brace after the while and ending with its matching brace
Also match while loops that contain a single line and no braces
Handle as many special cases as possible, such as while loops declared in comments etc, as per #Cid's suggestion
I can do this with a parser and a lot of code, but I was wondering if anything existed that perhaps extends regex to this sort of document level query?
There are parser libraries and tools, even free open-source ones. Clang has one, for example. So does GCC. There are others.
It's a lot of code because C++ is hard to parse. But if someone else gas written the code and it works, that's nit a problem. The usual difficulty with using these products is finding good documentation, but you can always try asking specific questions here
But just doing a lexical analysis of C++ is less difficult, and would be sufficient for crude analysis of program structure if you don't care that it will fail on corner cases. If you start with preprocessed code (or make the dubious assumption that preprocessing doesn't change the program structure) and don't worry about identifying template brackets (in particular, distinguishing between the right shift operator and two consecutive close angle brackets), you should be able to build a lexical analyser with a reasonably short scanner generator specification.
That might be sufficient for crude analysis of program structure if you don't care that it will fail on corner cases.
Is there any reason why the expression
(foo5 (foo4 (foo3 (foo2 (foo1 arg)))))
cannot be replaced with
(foo5 (foo4 (foo3 (foo2 (foo1 arg)-)
or the like, and then expanded back?
I know lack of reader macros means that you cannot change syntax, but can this expansion possibly be hard coded into the java?
I do this when I hand write code.
Yes, you could do this, even without reader macros (in fact, you can change Clojures syntax with a bit of hacking).
But, the question is, what would it gain you? Would it always expand to top-level? But then cutting and pasting code would fail, if you moved it to or from top level. And, of course, all the various tools that operate of clojure syntax would need to understand it.
Ultimately if you really dislike all the close parens why not use
(-> arg foo1 foo2 foo3 foo4)
instead?
Yes, this could be done, but I'm not sure it is the right solution and there are a number of negatives which will likely outweigh the benefits.
Suggestions like this are often the result of poor coding tools and a 'traditional' conceptual model for writing code. Selecting the right tools and looking at your code from a slightly different perspective will usually eliminate the cause which lead to this type of suggestion.
Most of the non-functional, non-lispy style languages are based around a token and line model of code. You tend to think of the code in terms of lines of tokens and you tend to edit the code on this basis. There is typically less nesting of expressions and lines are usually terminated with some marker, such as a semi-colan. Likewise, tools such as your editor, have features which have evolved to support token and line based editing. They are good at it.
The lisp style languages are less focused on lines of tokens. The emphasis here is on list forms. lines of tokens are replaced with nested lists of symbols - the line is less relevant and you typically have a lot more nesting of forms. This change means your standard line oriented tools, like your editor, are less suitable. The typical mental model of the code as lines of tokens is also less useful.
With languages like Clojure, your better off thinking in terms of list forms and not lines of code. Once you make this transition, you then start looking for tools which also model the code along these lines. For example, you either look for editors specifically designed to work with lists of data rather than lines of data or you look for editors which have extensions which will allow you to work with lists.
Once your editor understands that lists are the fundamental grouping unit, not lines, things like parenthesis become largely irrelevant from a code writing/editing perspective. You don't worry about closing parenthesis, counting parenthesis nesting levels etc. This all gets managed by the editor automatically. You don't move by lines, you move by lists, you don't kill/delete a line, you kill a list, you don't cut and copy a block of lines, you cut and copy a list of lists etc.
The good news is that in many respects, the structure of these list based code representations are actually easier to manipulate than most of the line based languages. This is primarily because there is less ambiguity or complexity. There are fewer exceptions to the rules and the rules are inherently simple. As a consequence, many editors designed for programmers will have support for this style of coding as well as advanced features which are difficult to implement in less structured code.
My suspicion is that your suggestion to have an additional bit of syntactic sugar to avoid having to type multiple closing parenthesis is actually a symptom of not having the right tools to write your code. Once you do, you will almost never need to enter a closing parenthesis or count opening parens to ensure you get the nesting right. This will be handled by the editor. Your biggest challenge will be in shifting your mental model to think in terms of lists and lists of lists. The parens will become largely invisible and you will jump around in your code according to list units rather than line units. The change is not easy and it can take some time to re-train your brain and fingers, but once you do, you will likely be surprised at how quickly you begin to edit and manipulate your code.
If your an emacs user, I highly recommend extensions such as paredit and lispy. If your using some other editor, look for paredit type extensions. However, as these are extensions, you must also spend some time training yourself to use whatever the key bindings are that the extension uses - there is no point having an extension with great code navigaiton based on lists if you still just arrow around with the arrow keys (unless it is emacs and you have re-bound those arrow keys to use the paredit navigation bindings).
Is there a way to create a rule for a Keyword in your User Define Language that states that in order for this to be a keyword, it must be at the beginning of the line ... or at least be the first word on the line?
Note: I'm the guy who answered the question on SuperUser. The same answer:
I am afraid this is not possible. You can consult UDL2 documentation to learn about User Defined Language capabilities. It is intentionally restricted in order to be easy enough giving a compromise between usability for ordinary users and efficiency.
Solution: The only thing I can advise to you beyond UDL2 is to create your own build of Notepad++. If you get the source, you can see that all built-in language highlighters are implemented procedurally using .lex files. You can create yours and there you have unlimited highlighting possibilities. Then you need to add color definitions to existing XML files, menu item and necessary bindings and you should be done. Hint: built-in Batch language is already highlighting first word on the line so maybe it is a good point to start from.
Workaround: if highlighting of first word on line is sufficient to you, just switch langugage to Batch. :)
Another solution: In these cases, user RProgram always suggests people to switch from Notepad++ to SynWrite editor. Its user-defined languages have much wider capabilities. Maybe this will be the fastest way how you can get to desired result without going too deep.
Actually the built-in 'INI File' language option already highlights the first words ('Keys') up to the '=' sign (besides coloring 'section' names) but that's all. It may be useful for some uses but is certainly limited in applicability.
I want to add regular expression search capability to my public web page. Other than HTML encoding the output, do I need to do anything to guard against malicious user input?
Google searches are swamped by people solving the converse problem-- using regular expressions to detect malicious input--which I'm not interested in. In my scenario, the user input is a regular expression.
I'll be using the Regex library in .NET (C#).
Denial‐of‐Service Concerns
The most common concern with regexes is a denial‐of‐service attack through pathological patterns that go exponential — or even super‐exponential! — and so appear to take forever to solve. These may only show up on particular input data, but one can generally create one wherein this doesn’t matter.
Which ones these are will depend somewhat on how smart the regex compiler you’re using happens to be, because some of these can be detected during compilation time. Regex compilers that implement recursion usually have a built‐in recursion‐depth counter for checking non‐progression.
Russ Cox’s excellent 2007 paper on Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...) talks about ways that most modern NFAs, which all seem to derive from Henry Spencer’s code, suffer severe performance degradation, but where a Thompson‐style NFA has no such problems.
If you only admit patterns that can be solved by DFAs, you can compile them up as such, and they will run faster, possibly much faster. However, it takes time to do this. The Cox paper mentions this approach and its attendant issues. It all comes down to a classic time–space trade‐off.
With a DFA, you spend more time building it (and allocating more states), whereas with an NFA you spend more time executing it, since it can be multiple states at the same time, and backtracking can eat your lunch — and your CPU.
Denial‐of‐Service Solutions
Probably the most reasonable way to address these patterns that are on the losing end of a race with the heat‐death of the universe is to wrap them with a timer that effectively places a maximum amount of time allowed for their execution. Usually this will be much, much less than the default timeout that most HTTP servers provide.
There are various ways to implement these, ranging form a simple alarm(N) at the C level, to some sort of try {} block the catches alarm‐type exceptions, all the way to spawning off a new thread that’s specially created with a timing constraint built right into it.
Code Callouts
In regex languages that admit code callouts, some mechanism for allowing or disallowing these from the string you’re going to compile should be provided. Even if code callouts are only to code in the language you are using, you should restrict them; they don’t have to be able to call external code, although if they can, you’ve got much bigger problems.
For example, in Perl one cannot have code callouts in regexes created from string interpolation (as these would be, as they’re compiled during run‐time) unless the special lexically‐scoped pragma use re "eval"; in active in the current scope.
That way nobody can sneak in a code callout to run system programs like rm -rf *, for example. Because code callouts are so security‐sensitive, Perl disables them by default on all interpolated strings, and you have to go out of your way to re‐enable them.
User‐Defined \P{roperties}
There remains one more security‐sensitive issue related to Unicode-style properties — like \pM, \p{Pd}, \p{Pattern_Syntax}, or \p{Script=Greek} — that may exist in some regex compilers that support that notation.
The issue is that in some of these, the set of possible properties is user‐extensible. That means you can have custom properties that are actual code callouts to named functions in some particular namepace, like \p{GoodChars} or \p{Class::Good_Characters}. How your language handles those might be worth looking at.
Sandboxing
In Perl, a sandboxed compartment via the Safe module would give control over namespace visibility. Other languages offer similar sandboxing technologies. If such devices are available, you might want to look into them, because they are specifically designed for limited execution of untrusted code.
Adding to tchrist's excellent answer: the same Russ Cox who wrote the "Regular Expression" page has also released code! re2 is a C++ library which guarantees O(length_of_regex) runtime and configurable memory-use limit. It's used within Google so that you can type a regex into google code search -- meaning that it's been battle tested.
Yes.
Regexes can be used to perform DOS attacks.
There is no simple solution.
You'll want to read this paper:
Insecure Context Switching: Inoculating regular expressions for survivability The paper is more about what can go wrong with regular expression engines (e.g. PCRE), but it may help you understand what you're up against.
You have to not only worry about the matching itself, but how you do the matching. For example, if your input goes through some sort of eval phase or command substitution on its way to the regular expression engine there could be code that gets executed inside the pattern. Or, if your regular expression syntax allows for embedded commands you have to be wary of that, too. Since you didn't specify the language in your question it's hard to say for sure what all the security implications are.
A good way to test your RegEx's for security issues (at least for Windows) is the SDL RegEx fuzzing tool released by Microsoft recently. This can help avoid pathologically bad RegEx construction.
Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
"string1"
"string2"
"string3"
and the input:
"string2"
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...