Map Reduce Removing Duplicates - mapreduce

I have been given a large text file and want to find the number of different words that start with each letter. I am trying to understand input and output values for map and reduce functions.
I understand a simpler problem which does not need to deal with duplicate words: determine the frequency with which each letter of the alphabet starts a word in the text using map reduce.
Map input: <0, “everyday i am city in tomorrow easy over school i iterate tomorrow city community”>
Map output: [<e,1>,<i,1>,<a,1>,<c,1>,<i,1>,<t,1>,<e,1>,<o,1>,<s,1>,<i,1>,<i,1>,<t,1>,<c,1>,<c,1>]
Reduce input: <a,[1]>,<c,[1,1,1]>,<e,[1,1]>,<i,[1,1,1,1]>,<o,[1]>,<s,[1]>,<t,[1,1]>
Reduce output: [<a,1>,<c,3>,<e,2>,<i,4>,<o,1>,<s,1>,<t,2>]
For the above problem the words 'i' 'city' and 'tomorrow' appear more than once so my final output should be:
Reduce output: [<a,1>,<c,2>,<e,2>,<i,3>,<o,1>,<s,1>,<t,1>]
I am unsure of how I would ensure duplicate words are remove in the above problem (would it be done in a pre processing phase or could it be implemented on either map or reduce functions). If I could get help understanding the map and reduce outputs of the new problem I would appreciate it.

You can do it in two map-reduce passes:
find all the distinct word by using word as a map output and in reduce outputting each word once
you already solved - find frequency of each initial letter on each unique word.
Alternatively, since there are not many unique words you can cache them in mapper and output each one (or its first letter) only once and the reduce will be identical to your simpler problem. Actually, no, that won't work because same words can appear in different mappers. But you can still cache the words in mapper in the first solution and output each word only once per mapper - a little bit less traffic between map and reduce.

Maybe something like this would help,
let str = "everyday i am city in tomorrow easy over school i iterate tomorrow city community"
let duplicatesRemoved = Set(str.split(separator: " "))
Output:
["city", "community", "tomorrow", "easy", "everyday", "over", "in", "iterate", "i", "am", "school"]
And maybe you don't need those map statements and can achieve something like this,
Code
var varCount = [Character: Int]()
for subStr in duplicatesRemoved {
if let firstChar = subStr.first {
varCount[firstChar] = (varCount[firstChar] ?? 0) + 1
}
}
Output
["i": 3, "t": 1, "e": 2, "c": 2, "s": 1, "a": 1, "o": 1]

Related

Word2Vec is it for word only in a sentence or for features as well?

I would like to ask more about Word2Vec:
I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.
From my understanding;
1) Feature extractions : Lemma 0, lemma 1, lemma 2
2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))
More explanation:
Sentence = "I have a pen".
Word = token of the sentence, for example, "have"
1) Feature extraction
"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:
[[0,0,1],
[1,0,0],
[0,1,0]]
2) Word embedding(Word2vec)
"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:
[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]
The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*
Questions:
1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec?
2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.
Hopefully someone could help me in this.
It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)
"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.
One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...
['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']
...then you have 7 unique case-flattened words...
['a', 'pen', 'will', 'need', 'ink', 'i', 'have']
...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:
[1, 1, 1, 1, 1, 0, 0] # A pen will need ink
[1, 1, 0, 0, 0, 1, 1] # I have a pen
Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.
Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).
Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.
Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:
[0, 1, 0, 0, 0, 0, 0] # 'pen'
If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:
[0.236, -0.711] # 'pen'
All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):
[-0.101, 0.271] # 'a'
[0.236, -0.711] # 'pen'
[0.302, 0.293] # 'will'
[0.672, -0.026] # 'need'
[-0.198, -0.203] # 'ink'
[0.734, -0.345] # 'i'
[0.288, -0.549] # 'have'
If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:
[1, 1, 0, 0, 0, 1, 1] # I have a pen
...you'd get a single 2-dimensional dense vector like:
[ 0.28925, -0.3335 ] # I have a pen
And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.
So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".
Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

Understanding; for i in range, x,y = [int(i) in i.... Python3

I am stuck trying to understand the mechanics behind this combined input(), loop & list-comprehension; from Codegaming's "MarsRover" puzzle. The sequence creates a 2D line, representing a cut-out of the topology in an area 6999 units wide (x-axis).
Understandably, my original question was put on hold, being to broad. I am trying to shorten and to narrow the question: I understand list comprehension basically, and I'm ok experienced with for-loops.
Like list comp:
land_y = [int(j) for j in range(k)]
if k = 5; land_y = [0, 1, 2, 3, 4]
For-loops:
for i in the range(4)
a = 2*i = 6
ab.append(a) = 0,2,4,6
But here, it just doesn't add up (in my head):
6999 points are created along the x-axis, from 6 points(x,y).
surface_n = int(input())
for i in range(surface_n):
land_x, land_y = [int(j) for j in input().split()]
I do not understand where "i" makes a difference.
I do not understand how the data "packaged" inside the input. I have split strings of integers on another task in almost exactly the same code, and I could easily create new lists and work with them - as I understood the structure I was unpacking (pretty simple being one datatype with one purpose).
The fact that this line follows within the "game"-while-loop confuses me more, as it updates dynamically as the state of the game changes.
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
Maybe someone could give an example of how this could be written in javascript, haskell or c#? No need to be syntax-correct, I'm just struggling with the concept here.
input() takes a line from the standard input. So it’s essentially reading some value into your program.
The way that code works, it makes very hard assumptions on the format of the input strings. To the point that it gets confusing (and difficult to verify).
Let’s take a look at this line first:
land_x, land_y = [int(j) for j in input().split()]
You said you already understand list comprehension, so this is essentially equal to this:
inputs = input().split()
result = []
for j in inputs:
results.append(int(j))
land_x, land_y = results
This is a combination of multiple things that happen here. input() reads a line of text into the program, split() separates that string into multiple parts, splitting it whenever a white space character appears. So a string 'foo bar' is split into ['foo', 'bar'].
Then, the list comprehension happens, which essentially just iterates over every item in that splitted input string and converts each item into an integer using int(j). So an input of '2 3' is first converted into ['2', '3'] (list of strings), and then converted into [2, 3] (list of ints).
Finally, the line land_x, land_y = results is evaluated. This is called iterable unpacking and essentially assumes that the iterable on the right has exactly as many items as there are variables on the left. If that’s the case then it’s just a nice way to write the following:
land_x = results[0]
land_y = results[1]
So basically, the whole list comprehension assumes that there is an input of two numbers separated by whitespace, it then splits those into separate strings, converts those into numbers and then assigns each number to a separate variable land_x and land_y.
Exactly the same thing happens again later with the following line:
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
It’s just that this time, it expects the input to have seven numbers instead of just two. But then it’s exactly the same.

Is there any way to find the index of everything found after using the count function

first question over here. I'm currently scanning through two lists and what I need to do is use the count function to find how many times a specific word appears and then I'd like to take the index of every one of those words and scan a second list for them. I have a list with a bunch of products and a second list with the corresponding prices so I'd like to be able to take the average of the prices and match them which each product. Any help would be appreciated
I may not fully have understood your data structure, some code and data would be appreciated, but try:
list1 = [1,2,3,4,5]
list2=['a', 'b','c', 'a', 'b']
b=list2.count('a')
print b
print
for i, j in enumerate(list2):
if j == 'a':
print i,": ", list1[i]
Result is:
2
0 : 1
3 : 4

lucene, search for sentences that contain one term twice

I have pairs of search strings, and I want to use Lucene to search for sentences that contain all terms that are contained in these strings. So for example if I have the two search strings "white shark" and "fish", I want all sentences containing both "white", "shark" and "fish". Apparently, with Lucene this can be done rather easily by means of a boolean query; this is how I do it in my code:
String search = str1+" "+ str2;
BooleanQuery booleanQuery = new BooleanQuery();
QueryParser queryParser = new QueryParser(...);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
booleanQuery.add(queryParser.parse(search), BooleanClause.Occur.MUST);
However, I also have pairs of search strings where one string is a subpart of the other, such as e.g. "timber wolf" and "wolf", and in these cases I would like to only get sentences that contain "wolf" at least twice (and "timber" at least once). Is there any way to achieve this with Lucene? Many thanks in advance for your answers!
Keep in mind, that a doc that has both "timber wolf" and a separate "wolf" will get scored higher all else being equal, since the term "wolf" occurs twice giving it a higher tf score. In most cases like this, the desired results being the first ones is acceptable, usually even desirable.
That said, I believe you could get what you want using a phrase query with slop, and having the slop set adequately high. Something like:
"timber wolf wolf"~10000
Which is probably high enough for most cases. This would require two instances of wolf and one of timber.
However, if you need timber wolf to appear (that is, those two terms adjacent and in order), you'll need to abandon the query parser, and construct the appropriate queries yourself. SpanQueries, specifically.
SpanQuery wolfQuery = new SpanTermQuery(new Term("myField", "wolf"));
SpanQuery[] timberWolfSubQueries = {
new SpanTermQuery(new Term("myField", "timber")),
new SpanTermQuery(new Term("myField", "wolf"))
};
//arguments "0, true" mean 0 slop and in order (respectively)
SpanQuery timberWolfQuery = new SpanNearQuery(timberWolfSubQueries, 0, true);
SpanQuery[] finalSubQueries = {
wolfQuery, timberWolfQuery
};
//arguments "10000, false" mean 10000 slop and not (necessarily) in order
SpanQuery finalQuery = new SpanNearQuery(finalSubQueries, 10000, false);

map-reduce function in CouchDB

I have a java program, that reads all words of a PDF file. I saved the words with the pagenumbers in a database (couchDB). Now I want to write a map and a reduce function, which list each word with the pagenumbers where the word occurs, but if words occur more than once on a page, I want just one entry. The result should be a row with word and a second row with a list (String separated with comma) of pagenumbers. Each word with the pagenumber is a separate document in couchDB.
How can I do this with a map-reduce function (filter same entries of pagenumbers)?
Thanks for help.
Surely there is more than one way of doing it. I'd go for something simple. Lets say your documents look somewhat like this:
{ 'type': 'word-index', 'word': 'Great', 'page_number': 45 }
This is a result of finding the word 'Great' on page 45. Now your view index is created by a view function:
function map(doc) {
if (doc.type == 'word-index') {
emit([doc.word, doc.page_number], null);
}
}
For reduce part just use the "_count" builtin.
Now to get the list of all the occurrences of word "Great" in your book, just query your view with startkey=["Great"] and endkey=["Great", {}]. Now the result would look somewhat like:
["Great", 45], 4
["Great", 70], 7
Which means that world "Great" appeared 4 times on page 45 and 7 times on page 70. You can extract your comma separated list you needed from it. The number of occurrences is a bonus.
--edit--
You also have to use group_level=2 in your query. If you don't the result of the query would simply be a single row with the count off all the documents you have.