Online (as opposed to bulk processed) data mining packages [closed] - data-mining

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
By "bulk processed" I mean a static data set of facts (as in a CSV) processed all at once to extract knowledge. While "online", it uses a live backing store: facts are added as they happen ("X buys Y") and queries happen on this live data ("what would you reccomend to a person who is looking at y right now?").
I have (mis)used the term real-time, but I dont mean that results must come within a fixed time. ('''Edit: replaced real-time with online above''')
I have in mind a recommendation engine which uses live data. However all online resources (such as SO questions) I encountered make no distinction between real-time and bulk processing data mining packages. I had to search individually:
Carrot2 which reads from Lucene/Solr and other live datasets (online)
Knime which does scheduled execution on static files (bulk)
Mahout which runs on Hadoop (and Pregel-based Giraph in future) (online?)
a commercial package that integrates with Cassandra (online?)
What are the online data-mining packages?
Is there a reason why the literature makes no distinction between online and bulk processing packages? Or is all practical data-mining actually bulk operation in nature?

For some algorithms, there are online versions available. For example for LOF, the local outlier factor, there is an online variant. I believe there are also online variants of k-means (and in fact, the original MacQueen version can be seen as "online", although most people turn it into an offline version by reiterating it until convergence), but see below for the problem with the k parameter.
However, online operation often comes at a significant performance cost. Up to the point where it is faster to run the full algorithm on a snapshow every hour instead of continuously updating the results. Think of internet search engines. Most large-scale search engines still do not allow "online" queries, but instead you query the last index that was built, probably a day or more ago.
Plus, online operation needs a significant amount of additional work. It's easy to compute a distance matrix, it is much harder to online update it by adding and removing columns, and synchronize all dependant results. In general, most data-mining results are just too complex to perform this. It's easy to compute the mean of a data stream, for example. But '''often there is just no known solution on updating the results without rerunning the - expensive - process'''. In other situations, you will even need to change the algorithm paramters. So at some point, a new cluster may form. k-means however is not meant to have new clusters appear. So essentially, you can't just write an online version of k-means. It will be a different algorithm, as it needs to dynamically modify the input parameter "k".
So usually, the algorithms will already be either online or offline. And a software package will not be able to turn offline algorithms into online algorithms.

online data-mining algorithms imply that they compute results in real time, and usually implies that the algorithms are incremental. That is, the model is updated each time it sees a new training instance, and no periodic retraining with a batch algorithm is needed. Many machine learning libraries, like Weka provide incremental versions of batch algorithms. Also check moa project and spark streaming. The literature does make a distinction between the two, although the most of the "traditional" ML algorithms do not work in an online mode without infrastructure and computation optimizations.

Related

I want to import cnn trained model from c++( pytorch framework) into vhdl to use it on DE1-SOC FPGA, is there a way to do it? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
So, I used pre-trained alexnet model and only changed last layer output and weights of fully connected layers. I did it on c++ using pytorch. Now I want use this model to predict what objects are on webcam and i need to use DE1-SOC FPGA. Also, it should be only processed on FPGA itself.
My suggestion is to feed webcam images into this model when button is pressed, then model will give some number, and after there will some simple procedures on this number. So, the problem is how to import this model or use this c++ model on FPGA or how import it into VHDL language with less pain?
Also, I found this video, https://www.youtube.com/watch?v=iyEZOx1YwmM&t=73s . Would it be helpful? I am new on FPGAs and VHDL, so I would admire any suggestions or examples with code.
So, the problem is how to import this model or use this c++ model on FPGA or how import it into VHDL language with less pain?
"Less pain" makes the question already very subjective. However this is not the first time I see a request for converting'neural network' code to run on an FPGA.
Why an FPGA?
There seems to be the (false) perception that "On an FPGA things run much faster!"
An FPGA with moderate complex code can run at 100-200 MHz. Your CPU can have four cores running at 3.3GHz. So you loose about a factor 60 of speed.
In order for the FPGA to outperform your CPU you have to make up that speed loss factor.
Where an FPGA can outperform a CPU is in parallel processing an pipe-lining. Thus your algorithm, which was compiled to run (sequential) on four cores, must be re-written to be split in at least 60 parts and those must all run at the same time. So you need to build 60+ parallel engines, or 15 engines each consisting of four pipe line stages, or 4 engines with... (You get the gist)
To HDL
At the same time you have to convert your C++, Python or whatever code to HDL.
Some good progress is mode in that direction, but the input still has to adhere to many special rules and the results I have seen are not very good yet. (Compared to manual written code) Most of it is big and slow. It is suited where you are short of development time and have plenty of time/space resource on your FPGA.
Many articles have been written that "an FPGA is well suited to implement a neural network". I suspect this where the drive for generating FPGA code comes from. However that does not mean that your c++ pytorch code can be automatically converted to a neural network in HDL code.
I/O.
I/O is often overlooked. In your case you want to process images from a webcam. So how do you plan to put those images into the FPGA? Believe me making a camera interface on an FPGA is not a trivial matter. Been there, done that! Even with pre-defined and tested cores it takes several weeks to get things running. If you want to pass data from e.g. a PC to the FPGA you will need some other interface. PCI-e comes to mind but add time for writing drivers. Then you have to learn how to get your data from such an FPGA PCI-e IP core and pass it into your neural network nodes.
Back to your question:
Can your code be converted to HDL and run on an FPGA? Maybe, I don't know. But I very much suspect you will have lots and lots of "pain" trying to do so.
Just a last remark: The video link you give, shows how to run an C++ code on an ARM processor. That ARM processor happens to be embedded alongside a lot of programmable gates. However non of those programmable gates are used. It does NOT convert "Hello world" into HDL.

How can I benchmark the performance of C++ code? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am starting to study algorithms and data structures seriously, and interested in learning how to compare the performance of the different ways I can implement A&DTs.
For simple tests, I can get the time before/after something runs, run that thing 10^5 times, and average the running times. I can parametrize input by size, or sample random input, and get a list of running times vs. input size. I can output that as a csv file, and feed it into pandas.
I am not sure there are no caveats. I am also not sure what to do about measuring space complexity.
I am learning to program in C++. Are there humane tools to achieve what I am trying to do?
Benchmarking code is not easy. What I found most useful was Google benchmark library. Even if you are not planning to use it, it might be good to read some of examples. It has a lot of possibilities to parametrize test, output results to file and even returning you Big O notation complexity of your algorithm (to name just few of them). If you are any familiar with Google test framework I would recommend you to use it. It also keeps compiler optimization possible to manage so you can be sure that your code wasn't optimized away.
There is also great talk about benchmarking code on CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". There are many insights in possible mistake that you can make (it also uses google benchmark)
It is operating system and compiler specific (so implementation specific). You could use profiling tools, you could use timing tools, etc.
On Linux, see time(1), time(7), perf(1), gprof(1), pmap(1), mallinfo(3) and proc(5) and about Invoking GCC.
See also this. In practice, be sure that your runs are lasting long enough (e.g. at least one second of time in a process).
Be aware that optimizing compilers can transform drastically your program. See CppCon 2017: Matt Godbolt talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
Talking from an architecture point of view, you can also benchmark your C++ code using different architectural tools such as Intel Pin, perf tool. You can use these tools to study the architecture dependency of your code. For example, you can compile your code for different level of optimizations and check the IPC/CPI, cache accesses and load-store accesses. You can even check if your code is suffering a performance hit due to library functions. The tools are powerful and can give you potentially huge insights into your code.
You can also try disassembling your code and study where your code spends most of the time and try and optimize that. You can look at different techniques to ensure that the frequently accessed data remains in the cache and thus ensure a high hit rate.
Say, you realize that your code is heavily dominated by loops, you can run your code for different loop bounds and check for the metrics in 2 cases. For example, set the loop bound for 100,000 and find the desired performance metric 'X' and then set the loop bound for 200,000 and find the performance metric 'Y'. Now,calculate Y-X. This will give you a much better insight into the behavior of the loops because by subtracting the two metrics, you have effectively removed the static effects of the code.
Say, you run your code for 10 times and with different user input size. You can maybe find the runtime per user input size and then sort this new metric in ascending order, remove the first and the last value(to remove the outliers) and then take the average. Finally, find the Coefficient of variance to understand how the run times behave.
On a side note, more often than not, we end up using the term 'average' or 'arithmetic mean' rashly. Look at the metric you plan to average and look at harmonic means, arithmetic means and geometric means in each of the cases. For example,finding the arithmetic mean for rates will give you incorrect answers. Simply finding arithmetic means of two events which do not occur equally in time can give incorrect results. Instead, use weighted arithmetic means.

open source CRFs implementation for computer vision problems? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
There are several open source implementations of conditional random fields (CRFs) in C++, such as CRF++, FlexCRF, etc. But from the manual, I can only understand how to use them for 1-D problems such as text tagging, it's not clear how to apply them in 2-D vision problems, suppose I have computed the association potentials at each node and the interaction potentials at each edge.
Did anyone use these packages for vision problems, e.g., segmentation? Or they simply cannot be used in this way?
All in all, is there any open source packages of CRFs for vision problems?
Thanks a lot!
The newest version of dlib has support for learning pairwise Markov random field models over arbitrary graph structures (including 2-D grids). It estimates the parameters in a max-margin sense (i.e. using a structural SVM) rather than in a maximum likelihood sense (i.e. CRF), but if all you want to do is predict a graph labeling then either method is just as good.
There is an example program that shows how to use this stuff on a simple example graph. The example puts feature vectors at each node and the structured SVM uses them to learn how to correctly label the nodes in the graph. Note that you can change the dimensionality of the feature vectors by modifying the typedefs at the top of the file. Also, if you already have a complete model and just want to find the most probable labeling then you can call the underlying min-cut based inference routine directly.
In general, I would say that the best way to approach these problems is to define the graphical model you want to use and then select a parameter learning method that works with it. So in this case I imagine you are interested in some kind of pairwise Markov random field model. In particular, the kind of model where the most probable assignment can be found with a min-cut/max-flow algorithm. Then in this case, it turns out that a structural SVM is a natural way to find the parameters of the model since a structural SVM only requires the ability to find maximum probability assignments. Finding the parameters via maximum likelihood (i.e. treating this as a CRF) would require you to additionally have some way to compute sums over the graph variables, but this is pretty hard with these kinds of models. For this kind of model, all the CRF methods I know about are approximations, while the SVM method in dlib uses an exact solver. By that I mean, one of the parameters of the algorithm is an epsilon value that says "run until you find the optimal parameters to within epsilon accuracy", and the algorithm can do this efficiently every time.
There was a good tutorial on this topic at this year's computer vision and pattern recognition conference. There is also a good book on Structured Prediction and Learning in Computer Vision written by the presenters.

Genetic programming in c++, library suggestions? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm looking to add some genetic algorithms to an Operations research project I have been involved in. Currently we have a program that aids in optimizing some scheduling and we want to add in some heuristics in the form of genetic algorithms. Are there any good libraries for generic genetic programming/algorithms in c++? Or would you recommend I just code my own?
I should add that while I am not new to c++ I am fairly new to doing this sort of mathematical optimization work in c++ as the group I worked with previously had tended to use a proprietary optimization package.
We have a fitness function that is fairly computationally intensive to evaluate and we have a cluster to run this on so parallelized code is highly desirable.
So is c++ a good language for this? If not please recommend some other ones as I am willing to learn another language if it makes life easier.
thanks!
I would recommend rolling your own. 90% of the work in a GP is coding the genotype, how it gets operated on, and the fitness calculation. These are parts that change for every different problem/project. The actual evolutionary algorithm part is usually quite simple.
There are several GP libraries out there ( http://en.wikipedia.org/wiki/Symbolic_Regression#Implementations ). I would use these as examples and references though.
C++ is a good choice for GP because they tend to be very computationally intensive. Usually, the fitness function is the bottleneck, so it's worthwhile to at least make this part compiled/optimized.
I use GAUL
it's a C library with all you want.
( pthread/fork/openmp/mpi )
( various crossover / mutation function )
( non GA optimisation: Hill-Climbing, N-M Simplex, Simulated annealling, Tabu, ... )
Why build your own library when there is such powerful tools ???
I haven't used this personally yet, but the Age Layered Population Structure (ALPS) method has been used to generate human competitive results and has been shown to outperform several popular methods in finding optimal solutions in rough fitness landscapes. Additionally, the link contains source code in C++ FTW.
I have had similar problems. I used to have a complicated problem and defining a solution in terms of a fixed length vector was not desirable. Even a variable length vector does not look attractive. Most of the libraries focus on cases where the cost function is cheap to calculate which did not match my problem. Lack of parallelism is their another pitfall. Expecting the user to allocate memory for being used by the library is adding insult into injury. My cases were even more complicated because most of the libraries check the nonlinear conditions before evaluation. While, I needed to check the nonlinear condition during or after the evaluation based on the result of the evaluation. It is also undesirable when I needed to evaluate the solution to calculate its cost and then I had to recalculate the solution to present it. In most of the cases, I had to write the cost function two times. Once for GA and once for presentation.
Having all of these problems, I eventually, designed my own openGA library which is now mature.
This library is based on C++ and distributed with free Mozilla Public License 2.0. It guarantees that using this library does not limit your project and it can be used for commercial or none commercial purposes for free without asking for any permission. Not all libraries are transparent in this sense.
It supports three modes of single objective, multiple objective (NSGA-III) and Interactive Genetic Algorithm (IGA).
The solution is not mandated to be a vector. It can be any structure with any customized design containing any optional values with variable length. This feature makes this library suitable for Genetic Programming (GP) applications.
C++11 is used. Template feature allows flexibility of the solution structure design.
The standard library is enough to use this library. There is no dependency beyond that. The entire library is also a single header file for ease of use.
The library supports parallelism by default unless you turn it off. If you have an N-core CPU, the number of threads are set to N by default. You can change the settings. You can also set if the solution evaluations are distributed between threads equally or they are assigned to any thread which has finished its job and is currently idle.
The solution evaluation is separated from calculation of the final cost. It means that your evaluation function can simulate the system and keep a lot of information. Your cost function is called later and reports the cost based on the evaluation. While your evaluation results are kept to be used later by the user. You do not need to re-calculate it again.
You can reject a solution at any time during the evaluation. No waste of time. In fact, the evaluation and constraint check are integrated.
The GA assist feature help you to produce the C++ code base from the information you provide.
If these features match what you need, I recommend having a look at the user manual and the examples of openGA.
The number of the readers and citation of the related publication as well as its github favorite marks is increasing and its usage is keep growing.
I suggest you have a look into the matlab optimization toolkit - it comes with GAs out of the box, you only haver to code the fitness function (and a function to generate inital population eventually) and I believe matlab has some C++ interoperability so you could code you functions in C++. I am using it for my experiments and a very nice feature is that you get all sorts of charts out of the box as well.
Said so - if your aim is to learn about genetic algorithms you're better off coding it, but if you just want to run experiments matlab and C++ (or even just matlab) is a good option.

Any good C or C++ libraries out there for dealing with large point clouds? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Basically, I'm looking for a library or SDK for handling large point clouds coming from LIDAR or scanners, typically running into many millions of points of X,Y,Z,Colour. What I'm after are as follows;
Fast display, zooming, panning
Point cloud registration
Fast low level access to the data
Regression of surfaces and solids (not as important as the others)
While I don't mind paying for a reasonable commercial library, I'm not interested in a very expensive library (e.g. in excess of about $5k) or one with a per user run-time license cost. Open source would also be good. I found a few possibilities via google, but they all tend to be too expensive for my budget.
Check Point Cloud Library (PCL). It is quite a complete toolkit for processing and manipulating point clouds. It also provides tools for point clouds visualisation: pcl::visualization::CloudViewer which makes use of VTK library and wxWidgets
Since 2011, point clout translation (read/write) and manipulating toolkit has been developed: PDAL - Point Data Abstraction Library
I second the call for R which I interface with C++ all the time (using e.g. the Rcpp and RInside packages).
R prefers all data in memory, so you probably want to go with a 64bit OS and a decent amount of RAM for lots of data. The Task View on High-Performance Computing with R has some pointers on dealing with large data.
Lastly, for quick visualization, the hexbin is excellent for visually summarizing large data sets. For the zooming etc aspect try the rgl package.
Why don't you go have a look at the R programming language which can link directly to C code, thereby forming a bridge. R was developed with statistical code in mind but can very easily help not only to handle large datasets but also visualize them as well. There are quite a number of atmospheric scientists who are using R in their work. I know, I work with them for exactly the stuff you're trying to do. Think of R as a poor man's Matlab or IDL (but soon won't be.)
In spirit of the R answers, ROOT also provides a good undeling framework for this kind of thing.
Possibly useful features:
C++ code base and the Cint c++ interpreter as the working shell. Python binding.
Can display three dim point clouds
A set of geometry classes (though I don't believe that they support all the operations that you need)
Developed by nuclear and particle physicists instead of by statisticians :p
Vortex by Pointools can go up to much higher numbers of points than the millions that you ask for:
http://www.pointools.com/vortex_intro.php
It can handle files of many gigabytes containing billions of points on modest hardware.