Algorithm/Programming for processing - mapreduce

I am using spark streaming (coding in Java), and I want to understand how I can make the algorithm for the following problem. I am relatively new at map-reduce and need some assistance in designing the algorithm.
Here is the problem in details.
Problem Details
Input:
Initial input of my text is:
(Pattern),(timestamp, message)
(3)(12/5/2014 01:00:01, message)
where 3 is a pattern type
I have converted this into DStream with Key=P1,P2, where P1 and P2 are some pattern classes for the input line, and Value = Pattern, input, timestamp. Hence each tuple of the DStream is as follows:
Template: (P1,P2),(pattern_id, timestamp, string)
Example: (3,4),(3, 12/5/2014 01:00:01, message)
Here 3 and 4 are pattern types which are paired together.
I have a model where each pair has a time difference associated with it. For example, the model is in HashMap with the key-values as :
Template: (P1,P2)(Time Difference)
Example: (3,4)(2:20)
where the time difference in the model is 2:20, and if we have two messages of pattern of type 3 and 4 respectively in the stream, the program should output an anomaly if the difference between the two messages is greater than 2:20.
What would be the best way to model this in spark streaming?
What I have tried so far
I created a DStream as shown in step 2 from step 1
Created a broadcast variable for sending the map learnt in the model (step 3 above) to all the workers
I am stuck at trying to figure out the algorithm on how to generate anomaly stream in spark streaming. Cannot figure out how to make this an associative function to do a stream operation
Here is the current code: https://gist.github.com/nipunarora/22c8e336063a2a1cc4a9

Related

Map-Reduce with a wait

The concept of map-reduce is very familiar. It seems like a great fit for a problem I'm trying to solve, but it's either missing something (or I lack enough understanding of the concept).
I have a stream of items, structured as follows:
{
"jobId": 777,
"numberOfParts": 5,
"data": "some data..."
}
I want to do a map-reduce on many such items.
My mapping operation is straightforward - take the jobId.
My reduce operation is irrelevant for this phase, but all we know is that it takes multiple strings (the "some data..." part) and somehow reduces them to a single object.
The only problem is - I need all five parts of this job to complete before I can reduce all the strings into a single object. Every item has a "numberOfParts" property which indicates the number of items I must have before I apply the reduce operation. The items are not ordered, therefore I don't have a "partId" field.
Long story short - I need to apply some kind of a waiting mechanism that waits for all parts of the job to complete before initiating the reduce operation, and I need this waiting mechanism to rely on a value that exists within the payload (therefore solutions like kafka wouldn't work).
Is there a way to do that, hopefully using a single tool/framework?
I only want to write the map/reduce part and the "waiting" logic, the rest I believe should come out of the box.
**** EDIT ****
I'm currently in the design phase of the project and therefore not using any framework (such as spark, hadoop, etc...)
I asked this because I wanted to find out the best way to tackle this problem.
"Waiting" is not the correct approach.
Assuming your jobId is the key, and data contains some number of parts (zero or more), then you must have multiple reducers. One that gathers all parts of the same job, then another that processes all jobs with a collection of parts greater than or equal to numberOfParts while ignoring others

Android/ Java : IS there fast way to filter large data saved in a list ? and how to get high quality picture with small storage space in server?

I have two questions
the first one is:
I have large data come from the server I saved it in a list , the customer can filter this data by 7 filters and two by text watcher this thing caused filtering operation to slow it takes 4 seconds in each time
I tried to put the filter keywords like(length or width ...) in one if and (&&) between them
but it didn't give me a result, also I tried to replace the textwatcher by spinner but it's not
useful.
I'm using one (for loop)
So the question: how can I use multi filter for list contain up to 2000 row with mini or zero slow?
the second is:
I saved from 2 to 8 pictures in the server in string form
the question is when I get these pictures from the server how can I show them in high quality?
when I show them I can see the pixels and this is not good for the customer
I don't want these pictures to take large space in the server and at the same time I want it in good quality when I restore them to display
I'm using Android/ Java
Thank you
The answer on my first quistion is if you want using filter (like when you are using online clothes shop and you want to filter it by less price ) you should use the hash map, not ordinary list it will be faster
The answer on my second question is: if you want to save store images in a database you should save it as a link, not a string or any other datatype

Proper Python data structure for real-time analysis?

Community,
Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?
I want to be able to do things like:
Slice the data to get the value of the last logged datapoint.
Slice the data to get the mean of the datapoints for the last n seconds.
Perform a regression on the last n data points to get g/s.
Remove from the log data points older than n seconds.
Current Attempts:
Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.
log = {}
def log_data():
log[round(time.time(), 4)] = read_data()
Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.
This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.
Goal:
So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?
Thanks,
Michael
To start, I would question two assumptions:
You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?
Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.
Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?
data = []
new_observation = (timestamp, value)
# new data comes in
data.append(new_observation)
# Slice the data to get the value of the last logged datapoint.
data[-1]
# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])
# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

How to develop a video combiner/builder filter in directshow

I am trying to build a filter.
It should have 3 video input as well as 1 audio input and build a vido file according to a fixed schema.
An example of this schema might be: "Play 3 seconds of the first source; then play 3 seconds of the second source; play 3 seconds of the third source; repeat"
There are some tutorials on the web on how to build filters, but I have some questions:
Is it correct to use a transform filter baseclass for this projcet?
Do I need to create custom pinclasses?
In which function is the actual video from the source passed to the filter where i can grab it?
How do I make some kind of synchronisation between the pins?
Assuming I had only one source: Could I just copy the value of the input sample to the output sample?
How do I send data to the output pin?
Is it correct to use a transform filter baseclass for this project?
No, explained here: DirectShow Filter: Transform
Do I need to create custom pinclasses?
Most likely. You need media type checking, then you would want to pass data to the filter class with identification on which pin it was received.
In which function is the actual video from the source passed to the filter where i can grab it?
The earliest point where you have the data under your control is IPin::Receive method on input in class.
How do I make some kind of synchronisation between the pins?
It's completely up to you: you are supposed to implement a sort of input queues, then match data from input queues to produce output. You are responsible to block execution on pins if you want them to wait until other input streams keep up and supply their data.
Assuming I had only one source: Could I just copy the value of the input sample to the output sample?
Input and output data come as media samples - objects that belongs to allocators. Actual copying depends on whether pin allocators are the same or different, in the latter case whether they are compatible. All in all, yes you can copy data.
How do I send data to the output pin?
CBaseOutputPin::Deliver gets you this (actually calls IPin::Receive of the connected downstream pin).
Why do you need an own filter for this? With DirectShowEditingServices you have a complete infrastrucure to build playlist and all. But this works only for file sources.
To work with Live-sources, the best solution would be GMFBridge. Here you create 1 to N Graphs for you sources and one playback/capture Graph. Then in the GMFBridge you can switch the connection from source to playback Graph.