Knime : How to achieve multi split by row - data-mining

I would like to split a data set into multiple data set of 1000 rows and how is it possible?
The Node row splitter has only two output . Let me know if there is any way to use java snippet for this requirement.

It is not entirely well specified how you want to split the table, but there are two loop types that might do what you are looking for: Chunk Loop (Start) or Group Loop (Start). Your workflow probably would look like this:
[(Chunk/Group) Loop Start] --> Your processing nodes of the selected rows --> [Loop End]
In the part Your processing nodes of the selected rows you will only see the splitted parts you need.
The difference between the two nodes is the following: the Chunk Loop Start nodes collect the rows to a group by their position (consecutive nodes part of the same group till the requested number of rows are consumed), while the Group Loop Start collects the rows with the same properties to the same collection for processing. (The Loop End node might be not the best fit depending on your processing requirements, in that case look for other Loop End nodes.)
In case these are not sufficient, you might try the parallel chunk loop nodes or as I remember there are bagging, ensemble and cross validation (X-Validation) nodes too in some extensions. (For more complex workflows you can also use recursive loops.) For feature elimination, you can also find support.

Related

Map-reduce : work on multiple lines

I have a requirement where i need to work on multiple rows of input data, first sort the data and then substract one value from row one in next row and so on. Can we do this operation in map reduce somehow ?
You can make your custom Record Reader and send your desired number of records to map task and perform the calculations.

Horizontal stretching in ListRenderer

I have a list that should display 7 items that each look like this:
Date Weekday Distance Time
Long text that may span many lines
two column text Distance Time
two column text Distance Time
two column text Distance Time
The last lines repeat in a number depending on the data, i e there may be different amounts of such lines for each list item.
I have tried implementing this with a ListCellRenderer that creates a table according to the requirements above, but I have a few problems with it:
The long text that may span many lines is implemented in a SpanLabel. But this text will not display more than one line anyway
Each item in the list will get space for the same number of lines below the first two..
So it seems that items in a list must be of the same size.
Later I also want to be able to detect selection on the entire list item, not just individual fields of it.
Is there a better way to do this?
How do I ensure that the SpanLabel actually gets as much space as it needs?
How do I ensure that the unknown number of lines gets the space they need, depending on how many they are?
Don't use a list: https://www.codenameone.com/blog/deeper-in-the-renderer.html
Lists in Codename One assume every entry is exactly the same height and provide no flexibility here.
I suggest doing something like the property cross demo: https://www.udemy.com/learn-mobile-programming-by-example-with-codename-one/
Where we use a Container with components within to provide a list like behavior with the full flexibility that arbitrary components allow.

Storm and stop words

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

MapReduce - Emitting the top 20% occurring words in the document

I was reading about MapReduce here , and the first example they give is counting the number of occurrences for each word in the document. I was wondering, suppose you wanted to get the top 20% occurring words in the document, how can you achieve that? it seems unnatural since each node in the cluster cannot see the whole files, just the list of all occurrences for a single word.
Is there way to achieve that?
Yes you certainly can achieve this : by forcing hadoop to have just a single reducer (though with this approach you lose the advantage of distributed computing per se).
This can be done as follows:
// Configuring mapred to have just one reducer
conf.setInt("mapred.tasktracker.reduce.tasks.maximum", 1);
conf.setInt("mapred.reduce.tasks", 1);
Now since you have just one reducer, you can keep track of the top 20% and emit them out in run() or cleanup() of the reducer. See here for more.