is there an application to find identical parts in different files? - replace

i have a legacy HTML website I need to add some features to. Just looking at it, I noticed there are many "common" parts in each HTML file - footer, some script blocks, header, etc. I would like to move all these pieces into separate files (and include them using SSI for now) - that will make understanding of the project much easier. However, there are some blocks which looks similar but are a bit different (different class names for example). So straightforward cut/paste will not work - I will have to carefully examine each piece I remove. And I do not want to do that - there are too many files. I'm wondering if there is an application which can compare a bunch of files and find identical blocks (not necessary present in ALL files).
Thanks.

You want a clone detector.
Many clone detectors will find only identical lines of code., or identical sequences of tokens. Those won't work for you. You want a clone detector that understands how to detect parameterized clones.
Some of the token-based detectors will find clones with only very minor variations as parameters; e.g., if it is just a class name which is different, these may work for you. Such detectors often produce unstructured sequences of clones; the following is clone from the perspective of a token based detector:
} void foo(
and
}
void bar(
To avoid such clones, token detectors generally insist on very long sequences of tokens, which means they can miss modest size but interesting clones.
Our CloneDR will find parameterized clones in which the parameter may be complex structure. It does that by parsing the code of interest, and comparing the abstract syntax trees, which represent the essential code minus all the layout and whitespace. Where the trees or sequences of trees are different, it can suggested a parameter representing the entire subtree (e.g., expression, htmt tag group, presence/absence of attributes, etc.). Because it operates on trees, it CANNOT propose the kind of clone above. That in turn means it can find modest size clones that make sense, as well as large ones.
CloneDR operates from precise language descriptions, to produce clones that precisely match language structures. There is a version specifically for HTML.
(I'm the architect; you can see my technical paper on CloneDR at the Wikipedia page.)

Related

c++ parser and formatter using a single grammar declaration

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.
A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar.
But is there a way to use the same grammar to generate the formatting code?
I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.
could I use the abstract syntax tree?
boost::spirit? metaprogramming?
anyone tried this?
thanks
It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.
There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.
Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.
In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.
The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)
Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.
It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.
Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)
One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:
// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;
Or the common case of trailing comments used to document variables:
/* Tracking the current allocation */
int needed_; // Bytes required.
int current_; // Bytes currently allocated.
// int adjustment_; // (TODO) Why is this needed?
/* Either points to the current allocation, or is 0 */
char* buffer_;
In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.
Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.
OP asks, Has anyone tried this?
Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on
lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.
The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.
However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).
DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces
(our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]
EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

Read-mostly data structure to compress and search source code

Suppose you have a lot of source code (like 50GB+) in popular languages (Java, C, C++, etc).
The project needs are:
compressing source code to reduce disk use and disk I/O
indexing it in such way that particular source file can be extracted from the compressed source without decompressing the whole thing
compression time for the whole codebase is not important
search and retrieval time (and memory use when searching and retrieving) are important
This SO answer contains potential answers: What are the lesser known but useful data structures?
However, this is just a list of potentials - I do not know how those structures actually evaluate against requirements listed above.
Question: what are data structures (and their implementations) that would perform well according to the aforementioned requirements?
The main data-structure used for searching is the inverted list. Fortunately, you don't need to implement it yourself. Lucene is a widely used search tool which works with inverted lists internally.
Using Lucene you can create a document with multiple fields. The idea is that some of these fields will be searchable with standard keyword-type queries.
I've implemented a source code search utility which I'll now describe briefly in the following paragraphs. The entire source code itself is stored as a non-indexable field named "code" (you can modify the source to store a compressed version).
For the retrieval part, note that the keywords that you are going to use for the search could be names of function, classes, packages or variables. They could also be words from the comments and so on. In my implementation, I extracted these information from using a Java annotated syntax tree (AST). You could do the same for other languages as well by making use of an appropriate parser to construct an AST.
Another possibility is the query-by-example (QBE) paradigm, where you could use a small snippet of code to search approximately similar snippets from your indexed code-base. This is particularly helpful for detecting source-code reuse and plagiarism, the main purpose for which I developed the tool.
The project page is here. I call it the YASOCS (Yet Another SOurce Code Searcher).
The search is very fast since it uses the inverted list. You could also use Luke (an open source Lucene index visualizer) to "see" the index yourself and execute test queries using the interface.

Design patterns for aggregating heterogeneous tabular data

I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)
There are several phrases that you use in your question that stick out to me: custom handling for each file, representation is somewhat different, complexity manageable. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.
I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
"string1"
"string2"
"string3"
and the input:
"string2"
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

How should I go about building a simple LR parser?

I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.
In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible input. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state). Note: since the valid transitions are so sparse, most parser generators support some way of compressing these tables, but that doesn't change the basic idea).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.
The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.
you need to read about ANTLR
I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.
Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom yylex function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.