Clustering Using MapReduce - mapreduce

I have unstructured twitter data which is retrieved by the apache flume and stored it into the HDFS. So now I want to convert this unstructured data into structured one using the mapreduce.
Task wanted to do using the mapreduce:
1. conversion Unstructured to structure one.
2. I just want the text part which contain tweet part.
3. I want to identify the tweets for particular topic and grouped according to their sub part.
e.g. I have tweets of samsung handset so i want to make a group according to their handsets like groups of Samsung Note 4, Samsung galaxy etc.
It is my college project so my guide suggested me to use k means algorithm, I search a lot on k means but failed to understand how to identifies the Centroid for this basically i failed to understand how to apply K means to this situation in MapReduce.
Please gude me if I am doing wrong as I am new to this concept

K-means is clustering algorithm. It cluster or group similar data and calculate the common centroid. You can create time-series for the above questions you have mention. Group the tweets according to the topic.
K-mean implementation in MapReduce.
https://github.com/himank/K-Means
Using K-means in Twitter datasets.
You can check the following links
https://github.com/JulianHill/R-Tutorials/blob/master/r_twitter_cluster.r
http://www.r-bloggers.com/cluster-your-twitter-data-with-r-and-k-means/
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html

Related

AWS GroundTruth text labeling - hide columns in the data, and checking quality of answers

I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.

Input Arrays Into Weka

I am fairly new to machine learning and I am trying to use WEKA (GUI) to implement a neural network on a sports data set. My issue is that I want my inputs to be Arrays (each Array is a contestant with stats such as speed, winrate, etc). I am wondering how I can tell WEKA that each input is an array of values.
You can define it in an .arff file. See this website for detailed information. As the figure below.
Or after opening your data in Weka, you can convert it with the help of some filters. I do not know the current format of your data. However, if you can open it in Weka, you can edit your data with many filters. Meanwhile, artificial neural networks only accept numerical values. Among these filters, there are those who convert nominal data to numerical data. I share an image from these filters below. If you are new to this area, I recommend you to watch videos of WekaMOOC (owned by Weka developers.). I think it will be very useful. Good luck.
Weka_filters_screen

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

How could I use graph mining method to get a multi-node graph?

I now use apriori algorithm to do a data mining project,and I get result such as:item1 <=> iteam2、item2 <=> item3.......
I want use graph mining to generate a graph containing many nodes and illustrating relation between these node like this:
I heard some data ming software--weka,rapidminer;I also heard some graph library--igraph,networkx;I also heard--tableau.But I'm still confused,could someone give me a illustration about detailed procedure?
I recommend using Prefuse tool kit for your problem. Take a look here http://prefuse.org/gallery/ . This contains an example of the graph that you need.
Loosely speaking, Prefuse also has a browser version called D3.js . If you want to display your graph in browser then use D3.js
I have used Prefuse as well as D3.js when I needed a desktop graph and a graph in the browser.
If by multi-node graph you are referring to this definition http://dl.acm.org/citation.cfm?id=1292799, then I would say that you could use Gephi to visualize your graph. Gephi is a powerful tool for network visualization/analysis, since you can annotate the vertices, apply clustering algorithms etc.
In your case, since multi-node graphs have multiple states, you can either use some annotation/coloring to show the different states of the nodes/edges, or even visualize these different states by importing different timestamps/versions of the network in Gephi. You can then observe the differences among them. Even if your graph is not multi-node, I would recommend Gephi for visualizing it.
If item1 <=> iteam2、item2 <=> item3 .. is your current data format, you can transform it to a format that Gephi recognizes, like adjacency list or edge list.