I have questions when I'm trying to implementing PageRank with mapreduce.
I want to cite the codes here https://stackoverflow.com/a/5029780/1117436 to describe the problem.
map ((url,PR), out_links) //PR = random at start
for link in out_links
emit(link, ((PR/size(out_links)), url))
reduce(url, List[(weight, url)):
PR =0
for v in weights
PR = PR + v
Set urls = all urls from list
emit((url, PR), urls)
In the above process, it's clearly that the second parameter of the input of map procedure is the Out links of url but the second parameter of the output of reduce procedure seems to be the In links of url. So how can these codes work iteratively?
Then what I want to ask is how to write codes to make the pagerank alrorithm work properly?
UPDATE: I think this answer solves my problem.
https://stackoverflow.com/a/13568286/1117436
You can implement iterative algorithms using MapReduce, but it might not be the best and more efficient way (because you move stuff to HDFS/disk each iteration).
Having said that, if you are interested in looking at how one might implement something like PageRank using MapReduce have a look here:
https://github.com/castagna/mr-pagerank/
Start from the run() method in PageRank.java
If you are interested, you can have a look at a bunch of old (i.e. 2009) slides here:
https://github.com/castagna/mr-pagerank/blob/master/docs/pagerank-huguk-2009.pdf
Now, you can have much more fun at implementing/running PageRank with a Pregel clone such as Apache Giraph as Praveen already suggested to you.
There are already a couple of graph processing frameworks.
Look at Apache Giraph which can used for graph processing. Giraph is based on MR. GoldenOrb is in a very early stage. Also, take a look at Apache Hama which is an implementation of BSP, this has it's own computation engine and is not MR based, but uses HDFS for storage. Hama can also be used for graph processing.
Related
I would like to ask how tileLayers in detour work.
Can i layer any 2 tiles resulting in "cuts" being merged into bigger holes and in what manner, are there any limitations? How many of these layers can i have and how they interact when navigating. Can i have for example basic always unchanged layer with whole map and then multiple layers with user placed geometry etc...
I'am talking about dtNavMeshCreateParams::tileLayer from original lib. There is some implementation using it in TempObstackes example. But it is not well explained (it has many custom things inside but i'am interested only in function of those layers)
Thank you in advance, i cant find much info about this online so any help would really be appreceated.
So actually i finally (many hours) found my answer to what these layers are used for:
http://digestingduck.blogspot.com/2011/02/heightfield-layer-portals.html
(btw this seems to be blog of navmesh lib author, enjoy)
Is tf.py_func allowed at online prediction time?
If yes any examples of how to use it?
Does the answer change if I need to install additional pip packages?
My use-case: I work with text, I need to do word stemming (using porter stemmer), I know how to do it using python, tensorflow doesn't have Ops for that. I would like to use the same text processing at training and prediction time - thus I would like to encode it all into a tensorflow graph.
https://www.tensorflow.org/api_docs/python/tf/py_func comes with known limitations and I would like to know if it will work during training and online prediction before I invest more time into it.
Thanks
Unfortunately, no. Py_func can not be restored from a saved model. However, since your use case involves pre-processing, just invoke the py_func explicitly in all three (train, eval, serving) input functions. This won't work if the py_func is in the middle of your graph, but for stemming, it should work just fine.
I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.
I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details):
I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be.
So, when I want to query the system, I just have to compute the list of integers representing the query image, perform a full-text search (or similar) and retrieve the X most similar images.
My question is, what's the best approach to permorm such a search?
I've heard about Lucene, Lemur and other indexing methods, but I don't know if this kind of full-text searchs are the best way, given the domain is reduced (only integers instead of words).
I'd like to know about the alternatives in terms of efficiency, accuracy or C++ friendliness.
Thanks!
It sounds to me like you have a vectorspace model, so Lucene or a similar product may work well for you. In general, an inverted-index model will be good if:
You don't know the number of classes in advance
There are a lot of classes relative to the number of images
If your problem doesn't fit these criteria, a normal relational DB might work better, as Thomas suggested. If it meets #1 but not #2, you could investigate one of the "column oriented" non-relational databases. I'm not familiar enough with these to tell you how well they would work, but my intuition is that you'll need to replicate a lot of the functionality in an IR toolkit yourself.
Lucene is written in Java and I don't know of any C++ ports. Solr exposes Lucene as a web service, so it's easy enough to access it that way from whatever language you choose.
I don't know much about Lemur, but it looks like it has a similar vectorspace model, and it's written in C++, so that might be easier for you to use.
You can take a look at Lucene for image retrieval (LIRE) here: http://www.semanticmetadata.net/2006/05/19/lire-lucene-image-retrieval-04-released/
If I'm mistaken, you are trying to implement a typical bag of words image retrieval am I correct? If so you are probably trying to build an inverted file index. Lucene on its own is not suitable as you probably have already realized as it index text instead of numbers. Using its classes for querying the index would also be a problem as it is not designed to "parse" (i.e. detect keypoints, extract descriptors then vector-quantize them) image into the query vector.
LIRE on the other hand have been modified to index feature vectors. However, it does not appear to work out of the box for bag of words model. Also, I think I've read on the author's website that it currently uses brute force matching rather than the inverted file index to retrieve the images but I would expect it to be easier to extend than Lucene itself for your purposes.
Hope this helps.
Does anyone know any more details about google's web-crawler (aka GoogleBot)? I was curious about what it was written in (I've made a few crawlers myself and am about to make another) and if it parses images and such. I'm assuming it does somewhere along the line, b/c the images in images.google.com are all resized. It also wouldn't surprise me if it was all written in Python and if they used all their own libraries for most everything, including html/image/pdf parsing. Maybe they don't though. Maybe it's all written in C/C++. Thanks in advance-
you can find a bit about how googlebot works here:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=158587
for example the "fetch as googlebot" tool lets you see a page as Googlebot sees it.
The crawler is very likely written in C or C++, at least backrub's crawler was written in one of these.
Be aware that the crawler only takes a snapshot of the page, then stores it in a temporary database for later processing. The indexing and other attached algorithms will extract the data, for example the image references.
Officially allowed languages at Google, I think, are Python/C++/Java.
The bot likely uses all 3 for different tasks.