string kernels for GP/SVM regression - shogun

I want to solve a small regression problem where the inputs are variable-length strings from a small vocabulary. I'd like to use Gaussian Process regression with some kind of string kernel. (SVM regression also ok.)
I see from this page that shogun supports many kinds of string kernels - can someone please provide a high level summary (with references to papers) of how they work?
I'd also like to see a worked example (in python), since I've never used shogun before. I found this post on stackoverflow, but it's dated from 2014, and it's not clear if the interface is up to date.
Thanks
Kevin

The documentation pages of string kernel classes contain the information you are looking for. For example:
http://www.shogun-toolbox.org/api/latest/classshogun_1_1CPolyMatchStringKernel.html contains a high-level summary.
http://www.shogun-toolbox.org/api/latest/classshogun_1_1CSalzbergWordStringKernel.html refers to the paper.
Quite likely not all classes will contain either one piece of information or the other.

Related

GCC ext headers — up-to-date documentation?

I'm quite confused by this paradox:
GCC ext apparently contains lots of broadly useful functionality. For example, ext/pb_ds/assoc_container.h lets you build an order statistic tree just by specifying particular template arguments, and ext/numeric contains the power(..) algorithm for O(lg N) exponentiation of a generic object to a non-zero integer power — this algorithm gets written from scratch all the time. There is also the rope data structure, algorithms for random sampling, and many others. Not things you would use every day, but definitely things that would be handy every other year or so.
Almost nobody seems to be using them. There is very little discussion on the web. There are some bug reports, and posts like this one suggesting either that these things are buggy and unmaintained or that there is no definitive guide on how to use them properly.
Now, trying to find the documentation, I type in gcc "ext" into Google, and get https://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/ as the first result. Going to Examples of Associate Containers gets me to another table of contents, but clicking on e.g. the link to basic_set.cc gives me a 404 page.
At this point I'm not even sure if this code had received enough testing to be able to rely on it for serious applications.
Is there any proper documentation for when and how to use #include <ext/numeric> and the like? Or at least examples and asymptotic complexity estimates?
Since it sounds like you've found a defect in the documentation, I'd suggest sending an e-mail to libstdc++#gcc.gnu.org to subscribe to the mailing list. I was able to find a mirror for the libstdc++ test suite on Github, which contains the examples you want. If you're looking for documentation for ext_numerics, it's at gcc.gnu.org/onlinedocs/libstdc++/manual/ext_numerics.html.

how to match soft aggregate features(eyes,nose,mouth) using some statistical method?

I know a little bit of ML, and want to implement a learning system by myself,but do not know how to do.Any one can give me a demo or use other method to compare faces?
Here is a related post: https://stackoverflow.com/questions/14079794/how-to-recognize-face-by-geometric-feature-such-as-eyes-nose-mouth.
One can not reasonably answer this question bassed on the above information because of sheer vastness of the subject.
For the start you should know that these problems are usually solved using Machine Learning techniques like Neural Networks. You said you know a bit about ML but as IMHO you might want to read more or take an online Course on Machine learning.
There are some good Courses on Coursera.org one that I like is Machine Learning by Andrew Ng.
These Methods are also described in above mentioned course and there are some good assignments too, which will help you to get the detailled idea behind machine learning.

Automatically sort functions alphabetically in C++ code

I am aware of a similar question for C#. I downloaded and tried NArrange and UniversalIndentGUI but both do not sort functions of C++ code per name. Does anyone know a non-commercial tool that does this job?
Unless you're under orders to rearrange the code to conform to an arbitrary coding standard, my advice is do not do this. I've seen people do it before, and the results are not pretty. The file will look completely different after you're done, and you'll have effectively trashed all of the edit history in source control. Any diffs between this version and any version that came before it will look like a jumbled mess. In the long run, having a clear diff history is worth more to you and your team than some measure of code cleanliness.

Code for identifying programming language in a text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:
I must write this in C++.
A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...
Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)
The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)
It should have a good ratio of accuracy/speed (speed is a bit more favored)
It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.
Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?
EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.
You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.
One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.
If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.
If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.
I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.
It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.
As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?
Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.
Here's a description of the highlighting system.
Since the list of languages is known upfront you know the syntax/grammar for each of them.
Hence you can, as an example, to write a function to extract reserved words from the provided source code.
Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.
If in the end you only have 1 possibility left - this is your language.
If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.
You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.
By constructing this search tree efficiently you could end with a rather fast code.
This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.
You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.
The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.

How to start modification with big projects

I have to do enhancements to an existing C++ project with above 100k lines of code.
My question is How and where to start with such projects ?
The problem increases further if the code is not well documented.
Are there any automated tools for studying code flow with large projects?
Thanx,
Use Source Control before you touch anything!
There's a book for you: Working Effectively with Legacy Code
It's not about tools, but about various approaches, processes and techniques you can use to better understand and make changes to the code. It is even written from a mostly C++ perspective.
First study the existing interface well.
Write tests if they are absent, or expand already written ones.
Modify the source code.
Run tests to check if the modification somehow breaks the older behaviour.
There is another good book, currently freely available on the net, about object oriented reengineering : http://www.iam.unibe.ch/~scg/OORP/
The book "Code Reading" by Diomidis Spinellis contains lots of advice about how to gain an overview and in-depth knowledge about larger, unknown projects.
Chapter 6 is focuses sonely on that topic (Tacking Large Projects). Also the chapters about tooling (Ch. 9) and architecture (Ch. 8) might contain nice hints for you.
However, the book is about understanding (by reading) the "code". It does not tackle directly the maintenance step.
First thing I would do is try to find the product's requirements.
It's almost unthinkable that a product of this size would be developed without requirements.
By perusing the requirements, you'll be able to:
get a sense of what the product (and hence the code) is at least supposed to be doing
see just how well (or poorly) the code actually fulfills those requirements
Otherwise you're just looking at code, trying to divine the intention of the developers...
If you are able to run the code in a PC, you can try to build a callgraph usually from a profiling output.
Also cross referencing tools like cscope, ctags, lxr, etc. Can help a lot. A
Spending some time reading, building class diagrams or even adding comments to the parts of the code you took long to understand are steps towards getting familiar with the codebase and getting ready to modify/extend it.
The first thing you need to do is understand how the code works. Read what documentation there is and then watch the program operate under a debugger. If you watch the main function/loop and then slowly work your way deeper into the program, you can gain a pretty good idea how things are operating. Make sure you write down your findings so others who follow after you have a better position to start from.
Running Doxygen with the EXTRACT_ALL tag set to document all the relationships in the code base. It's not going to help you with the code flow, but hopefully it will shed some light with regards to the structure and design of the entire application.
A very good austrian programmer once told me that in order to understand a program you first have to understand the data-structures that the program uses.