Code for identifying programming language in a text file [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:
I must write this in C++.
A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...
Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)
The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)
It should have a good ratio of accuracy/speed (speed is a bit more favored)
It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.
Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?
EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.

You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.

One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.

If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.
If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.

I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.
It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.

As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?

Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.
Here's a description of the highlighting system.

Since the list of languages is known upfront you know the syntax/grammar for each of them.
Hence you can, as an example, to write a function to extract reserved words from the provided source code.
Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.
If in the end you only have 1 possibility left - this is your language.
If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.

You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.
By constructing this search tree efficiently you could end with a rather fast code.

This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.
You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.

The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.

Related

Cleanest data structure to use when interpreting data from neatly-structured user commands (in C++) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to write a simple in-house program that parses user commands written in a language of our team's own invention (but based closely on another program we are already familiar with). The command parser that I am working on now will simply be the UI through which the user can run the other algorithms I have already written. (Those other algorithms, by the way, are used to generate the input files for a molecular dynamic simulation package called LAMMPS.) The only thing I really have left to do is just write this UI, but as it turns out, writing your own scripting language is almost an intractable challenge for a non software engineer to tackle on his own.
According to the answers I received, what I am try to make would be considered a Domain Specific Language, and it is not advisable to try to make one's own DSL due to the enormous amount of work required to make it useful and bug-free.
The best option then would actually be to use an existing scripting language like Lua or Python, and embed it in the program.
To do this, I will most likely use Lua because it seems most fitting for our needs. So at this point, the rest of this question is no longer relevant since the answer would be: "Don't do it yourself." But I'm still going to keep part of it here for other users to be able read and learn from the wonderful answers below.
Thanks again to everyone who replied!
Old Question:
I would like to write a program that parses a user text input and then
runs a function corresponding to that input. To do this I would need
to parse the string for relevant keywords. I believe there will be
less than 15 keywords when I'm done, so ideally I'd like this code
to be simple and short.
The problem is that I am currently using if-statements to parse the
strings. This is an extremely inconvenient way to parse commands
because even for a short 3 word commands the code explodes into nested-ifs
3 layers deep. So longer 8+ word sentences will become nested-ifs more than
8 layers deep.
This kind of programing approach quickly becomes unmanageable, especially
when I need to make any significant changes to a command.
My question is whether or not there exists a data structure in C++ that
can help me better manage my giant nested-ifs, or if anyone could suggest
a better way to parse a string for lots of different data types (i.e.
substings, ints, and floats) and output an error message when the expected
type is not found?
Here is an example of a short user session to show the kinds of commands
I would like to interpret:
load "Basis.Silicon" as material 1
add material 1 to layer 1
rotate layer 1 about x-axis by 45 degrees
translate layer 1 in x-axis by 10 nm
generate crystal
These commands are based on an already-existing program that our team
uses, but unfortunately the source code for this program has never been
publicly released so I am left guessing as to how it was actually
implemented.
One final note, unlike natural language processors, I know exactly what
the format of each line will be. So my issue isn't so much how to interpret
the text, but rather how to code the logic in a concise and manageable way.
Thanks everyone!
Your question is not clear. And your goals are more difficult than what you believe.
Either you consider that you want to somehow process human language sentences (e.g. in English). Then you want to study natural language processing, and you can find some libraries related to that field.
Or you consider that you want to interpret some formal programming or scripting language. Then you want to study interpreters and compilers. BTW, in that case, you might just embed an existing interpreter (like Lua, Guile, Python, etc....) in your program.
You could also think in terms of expert systems with a knowledge base made of rules (this approach could be viewed as in the middle between NLP and scripting language) You'll then need some inference engine (perhaps CLIPS). See also J.Pitrat's blog.
Notice that even coding a simple interpreter is more difficult than you believe. You absolutely need to represent abstract syntax trees, which you construct from textual input with a parsing phase.
BTW, All of NLP, expert systems, and interpreter design and implementation are difficult fields. You could get a PhD in all 3 fields (but you have to choose which).
If you go the embedded interpreter way: study the interpreters I mentioned (Guile, Lua, Python, Neko, etc...) and choose which one you want, to embed.
If for whatever reason, you want to make an interpreter from scratch: Learn several programming languages first (including scripting languages like Ruby, Python, Ocaml, Scheme, Lua, Neko, ...). Read books on Programming Language Pragmatics (by M.Scott) and Lisp In Small Pieces (by Queinnec). Read also text books on compilation and parsing, and on Garbage Collection and formal (e.g. denotational) semantics. All this may need a dozen years of work.
Notice that by experience embedding a software in an interpreter is a very structuring design. If you did not thought of that at the beginning you probably need to redesign and refactor a lot your existing application. For instance, when embedding a software in an interpreter, you cannot afford that bad input crashes the program. So error handling and memory management (interfacing to the GC of the interpreter) is challenging and gives new constraints. Hence you'll need to re-think your application.
If all this is new (and even if you don't choose e.g. Guile as the embedding interpreter): learn and practice a bit of Scheme -e.g. with Guile or PltScheme- (e.g. reading SICP), read a little bit about λ-calculus and closures, then read Queinnec's Lisp In Small Pieces book. Remember the halting problem (which is partly why interpreters are difficult to code).
BTW the syntax you are proposing (e.g. rotate mat 1 by x 90) is not very readable and looks COBOL-like. If possible, have a language which looks familiar to existing ones. Make it easy to read !
Start by reading all the wikipages I am referencing here.
FWIW, I am the main author of MELT, a domain specific language (inspired a lot by Scheme) to extend the GCC compiler. Some of the papers / documentations I wrote might inspire you (and contain valuable references).
Addenda (after question was reformulated)
You seems to invent some formal syntax like
add material 1 to layer 1
rotate layer 1 about x-axis by 90 degrees
translate layer 1 in x-axis by 10 inches
I can't guess what kind of language is it? Are you implementing a 3D printer? If yes, you should stick to some existing standard formal language in that domain.
I believe that such a COBOL-like syntax is really wrong. The point is that it is too verbose, and that you are wishing to implement some domain specific language. I find your example very bad-looking.
Is that syntax your invention, or is there some document specifying (and many thousands already existing lines coded in) your domain specific language. If you are just inventing it, please reconsider the syntax and the semantics.
First, you need to specify on paper the full syntax and semantics of your DSL.
Is your DSL Turing complete? (I guess that yes, because Turing completeness is reached very quickly - e.g. with variables and loops....). If yes, you are inventing a scripting language. Please don't invent scripting language without knowing several programming & scripting languages (then read Programming Language Pragmatics...). The point is that, if your scripting language will become successful, advanced users will soon or later write important programs in it (e.g. many thousand lines). Then, these advanced users will be programmers. In that case, it is very important (for social & economic reasons) to have a DSL well founded and looking familiar (if possible, an extension of some existing scripting language).
If your DSL already exists, stick to its specification on paper. If that specification is not good enough, improve it with formalization (e.g. by writing some BNF syntax, and some formal (e.g. denotational) semantics for it). Publish and discuss that formalization with existing users.
Several industries got some ad-hoc DSLs which became widely used but was ill designed
(e.g., in the French nuclear industry, the Gibiane DSL designed in the 1970s by nuclear physicists, not computer scientists; the US Boeing corporation is also rumored to have made similar mistakes). Then, maintaining and improving the many hundred thousands lines of DSL scripts is becoming a nightmare (and may means losing millions of dollars or euros). So you better stick to some existing scripting language. The advantages are that there exist some culture on it (e.g. you can find dozens of books on Python or Lua, and many trained engineers familiar with them), that the interpreter is widely used and tested, that the community working on them is improving the interpreters, so it has quite few uncorrected bugs.
You should not attempt to design and implement your own DSL if you are not a trained computer scientist. Stick to some existing scripting language (of course their syntax is not like you want it to be), and leverage on existing implementations and experiment.
As a counter-example, J.Ousterhout has invented the widely used Tcl scripting language, with the claim that scripts are always small (e.g. hundreds of line only) and won't grow to big code base; unfortunately, some of them did, and Tcl is known as a bad language to code many dozens of thousands of lines (even if Tcl is an easy and convenient language for tiny scripts). The moral of the story is that if a (turing complete) scripting language is becoming successful, some "crazy" advanced user will code hundred of thousands of script code. So you need that scripting language to be well designed from the start. Hence, you should adopt and adapt a good existing scripting language (and avoid inventing an unfamiliar syntax without having a good knowledge of several existing scripting languages)
later additions
PS: my criticism of Tcl is not entirely subjective: the point is that Tcl was designed for small scripts in mind (read J.Ousterhout's first papers about Tcl), but my point is that when you offer a Turing-complete scripting language, some "crazy" user will eventually write huge scripts for it. Hence, you need to anticipate such "crazy" usage by offering a scripting language which "scales up" to big scripts, so is built according to software engineering practices for large software code base.
NB. Lua is probably a good choice as a language to embed. It is small, has a nice implementation, is well documented, and has good performance. But be careful about memory management issues (and this advice holds for any scripting language).
EDIT: To be more clear, I would like to have a short list of key words
(<15). The order/presence of which would determine which function will
be run.
You can build a small ruleset engine (e.g. something that processes lists of words). You write that engine/function once and just pass the data structures to it.
As an alternative, a solution using regular expressions would be probably the fastest to code (the engine is ready for you), assuming you're familiar with the regexp syntax (if not, it's still a good investment).
You could build a table of keywords and function pointers:
typedef void (*Function_Pointer)(void);
struct table_entry
{
const char * keyword;
Function_Pointer p_function;
};
table_entry function_table[] =
{
{"car", Process_Car},
{"bike", Process_Bike},
};
Search the table for a keyword. If the keyword is found, dereference the function pointer.
The following snippet will execute the function for processing the word "car":
(function_table[0].p_function)();
There is a famous program, called Eliza, which parses sentences for keywords.
Examples can be found at: Eliza C++ examples

AI: What kind of process would sites like Wit use to train Natural language [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project where I would like to achieve a sense of natural language understanding. However, I am going to start small and would like to train it on specific queries.
So for example, starting out I might tell it:
songs.
Then if it sees a sentence like "Kanye Wests songs" it can match against that.
BUT then I would like to give it some extra sentences that could mean the same thing so that it eventually learns to be able to predict unknown sentences into a set that I have trained it on.
So I might add the sentence: "Songs by
And of course would be a database of names it can match agains.
I came across a neat website, Wit.ai that does something like I talk about. However, they resolve their matches to an intent, where I would like to match it to a simplified query or BETTER a database like query (like facebook graph search).
I understand a context free grammar would work well for this (anything else?). But what are good methods to train several CFG that I say have similar meaning and then when it sees unknown sentences it can try and predict.
Any thoughts would be great.
Basically I would like to be able to take a natural language sentence and convert it to some form that can be run better understood to my system and presented to the user in a nice way. Not sure if there is a better stackexchange for this!
To begin with, I think SO is quite well-suited for this question (I checked Area 51, there is no stackexchange for NLP).
Under the assumption that you are already familiar with the usual training of PCFG grammars, I am going to move into some specifics that might help you achieve your goal:
Any grammar trained on a corpus is going to be dependent on the words in that training corpus. The poor performance on unknown words is a well-known issue in not just PCFG training, but in pretty much any probabilistic learning framework. What we can do, however, is to look at the problem as a paraphrasing issue. After all, you want to group together sentences that have the same meaning, right?
In recent research, detecting sentences or phrases that have the same (or similar) meaning have employed a technique known as as distributional similarity. It aims at improving probability estimation for unseen cooccurrences. The basic concept is
words or phrases that share the same distribution—the same set of words in the same context in a corpus—tend to have similar meanings.
You can use only intrinsic features (e.g. production rules in PCFG) or bolster such features with additional semantic knowledge (e.g. ontologies like FreeBase). Using additional semantic knowledge enables generation of more complex sentences/phrases with similar meanings, but such methods usually work well only for specific domains. So, if you want your system to work well only for music, it's a good idea.
Reproducing the actual distributional similarity algorithms will make this answer insanely long, so here's a link to an excellent article:
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods by Madnani and Dorr.
For your work, you will only need to go through section 3.2: Paraphrasing Using a Single Monolingual Corpus. I believe the algorithm labeled as 'Algorithm 1' in this paper will be useful to you. I am not aware of any publicly available tool/code that does this, however.

Calling OpenOffice spell/grammar check from a C/C++ program

The problem is as follows: I'm writing a brute-force decrypter to crack some supersecret code (it's a contest, not a crime), which turned out to be impossible: just too many nodes in the tree that needs to be searched. To overcome this problem, I thought it might be helpful to check the intermediate 'solutions' to see if they produce (parts of) sentences. For example, I might get something like: "jvabaosajbgasgav..." or "lookslikeitsworking....". The first clearly is gibberish and in that case it wouldn't make any sense to continue cracking the code. The second one can easily be identified by eye as a valid English sentence.
I'm not planning on writing my own spell/grammar checker, so I thought it might be possible to call the spell checker from an open source project like OpenOffice or LibreOffice. I checked the openoffice.org website but I couldn't really find out what to do next. Like, how can I link against their libraries? Are these libraries in the SDK? What functions can I use?
The program I'm writing is in pure C, so I probably need to write a wrapper to call their C++ member-functions, right?
Any help is much appreciated!
I believe you'd be vastly more successful integrating with something written with such integration in mind, like the Aspell library.

Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.
Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.
Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!
Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.
Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:
int langs = dict["the"];
if (langs | mylang == mylang)
// no other language
Since there will be other languages, probably a more general approach is better.
For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).
Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.
To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.
I have found Google's CLD very helpful, it's written in C++, and from their web site:
"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."
Well,
Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.
In Java, I'd send you to Apache Tika. It has an Open-source statistical language detector.
For C++, you could use JNI to call it. Now, time for a disclaimer warning. Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.
http://www.basistech.com, the product name is RLI.
This will not work well one word at a time, as many words are shared. For instance, in several languages "the" means "tea."
Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.
However, the problem might not be too hard to solve yourself. See the Wikipedia article on the problem for ideas. Also a small support vector machine might do the trick quite handily. Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.
I wouldn't hold my breath. It is difficult enough to determine the language of a text automatically. If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.
Basically you need a huge database of all the major languages. To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text. This is not something you would want to have to implement on your laptop.
Spell check first 3 words of your text in all languages (the more words to spell check, the better). The spelling with least number of spelling errors "wins". With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable. It is not a perfect method, but I figure it would work in most cases.
Otherwise if there is equal number of errors in all languages, use the default language. Or randomly pick another 3 words until you have more clear result. Or expand the number of spell checked words to more than 3, until you get a more clear result as well.
As for the spell checking libraries, there are many, I personally prefer Hunspell. Nuspell is probably also good. It is a matter of personal opinion and/or technical capabilities which one to use.
I assume that you are working with text not with speech.
If you are working with UNICODE than it has provided slot for each languages.
So you can identify that all characters of particular word is fall in this language slot.
For more help about unicode language slot you can fine over here

Is there a working Semantic Diff tool for C++?

I've found only abandoned projects, theories and feature lists.
But I could not find a tool which can compare two C++ source files and show semantic differences.
Of course I would not compare two unrelated files, or changes accumulated during 10 years of development and maintenance. But changes usually discussed at a code review would be understood more easily, if presented together with proper meaning, instead of textual changes at char/word/line level
A true "semantic diff" is impossible. Closer would be a tool that compared two source files to see how one was derived from the other by abstract editing operations (delete, copy, move, rename) on program structures (variables, expressions, statements, methods).
See our Smart Differencer for a tool that does this for a few languages, with C++ coming soon.
March 2013: GCC/MS dialects of C++11 supported.
There is one for C used by linux kernel developers. Not for C++ though
http://coccinelle.lip6.fr
It's not possible, in general. One program cannot even tell whether another will terminate (the famous Halting Problem). For more limited programs (finitely bounded loops, and maybe other constraints) you can do more, but it's heavy-duty stuff, and the language would have to be, not C++ but something less tricky, a limited subset or a simpler language altogether. That's probably why you're finding nothing that's been more than started.
Yes, it is called "writing comments for commits" and "issue tracking". Consider it metadata for semantic analysis.