Given billions of URLs, how to determine duplicate content [closed]

Given billions of URLs, how to determine duplicate content [closed] - web-services

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I was asked this question in a programming interview. I have described the question in detail below. It was an open-ended question.
Given billions of URLs(deep links), how do I classify that which URLs point to the duplicate content. The question was further extended to finding out that in cases of duplicate pages, which of them was authentic. This was the first part.
My approach (with valid assumptions) was to classify them on the basis of domains and then match the contents of URLs in the same bucket.
In the second part, the interviewer narrowed down the question stating that:
Given just two URLs, URL1 is a wiki page about a celebrity, (eg: Brad Pitt) and URL2 contains information about many celebrities including Brad Pitt.
How do we identify which one is authentic and which is duplicate ?
My answer was based on comparing the two pages on the basis of their citations.
The interviewer asked me to build the answer from scratch, and wanted me to assume that we don't have any prior information about duplicate content on the URLs.
Since its an open-ended question, any lead would prove helpful.

You might find this paper to be helpful: "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" by Monika Henzinger at Google, as this problem has attracted a fair amount of research. From the paper:
A naive solution is to compare all pairs to documents. Since this is
prohibitively expensive on large datasets, Manber [11] and Heintze [9]
proposed first algorithms for detecting near-duplicate documents with
a reduced number of comparisons. Both algorithms work on sequences of
adjacent characters. Brin et al. 1 started to use word sequences to
detect copyright violations. Shivakumar and Garcia-Molina [13, 14]
continued this research and focused on scaling it up to multi-gigabyte
databases [15]. Broder et al. [3] also used word sequences to
efficiently find near duplicate web pages. Later, Charikar [4]
developed an approach based on random projections of the words in a
document. Recently Hoad and Zobel [10] developed and compared methods
for identifying versioned and plagiarised documents.
In other words, it's a complex problem with a variety of solutions of varying success, and not something with a 'right' answer. Most of the answers involve checking word or character sequences.

The above link for me did not work, but I found this page from Stanford, which had an interesting theorem involving shingles and the Jaccard coefficient.

Related

Is it recommended to remove duplicate words in word2vec algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a data that consists of DNA-sequences, where the words represented as kmers of length 6, and the sentences represented as DNA-sequences. Each DNA-sequence has 80 kmers (words)
The list of kmers I have is around 130,000 kmers, but after removing the duplicate elements, I would have 4500 kmers only. So, this huge gap confused me in regarding removing the duplicate kmers or not. My question is, is it recommended in this case to remove the duplicated kmers in the word2vec algorithm?
Thanks.

Without an example, it's unclear what you mean by "removing the duplicate elements". (Does that mean, when the same token appears twice in a row? Or twice in one "sentence"? Or, as I'm not familiar with what your data looks like in this domain, something else entirely?)
That you say there are 130,000 tokens in the vocabulary, but then 4,500 later, is also confusing. Typically the "vocabulary" size is the number of unique tokens. Removing duplicate tokens couldn't possibly change the number of unique tokens encountered.
In the usual domain of word2vec, natural language, words don't often repeat one-after-another. To the extent they sometimes might – as in say the utterance "it's very very hot in here" – it's not really an important enough case that I've noticed anyone commenting about handling that "very very" differently than any other two words.
(If a corpus had some artificially-duplicated full-sentences, it might be the case that you'd want to try discarding the exact-duplicate-sentences. Word2vec benefits from a variety of different usage-examples. Repeating the same sentence 10 times essentially just overweights those training examples – it's not nearly as good as 10 contrasting, but still valid, examples of the same words' usage.)
You're in a different domain that's not natural language, with different co-occurrence frequencies, and different end-goals. Word2vec might prove useful, but it's unlikely any general rules-of-thumb or recommendations from other domains will be useful. You should test things both ways, evaluate the results on your ultimate task in a robust repeatable way, and choose based on what you discover.

Ecology C++ Code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
New to C++ and stuck on how to start coding this problem which is an Ecology question at first to start a cell with plants antelopes and tigers. Based on initial population, birth rates, food supply, dying off and migration into other cells (once 1 cell is discovered then can expand more). Did some tests on paper to see that plants are going to need a Cap because plants multiply more than antelopes can eat them. I dont really know how to start this if anyone can give me a starting point then that will be grateful.
Thank you.

I sounds like you're trying to build an individual, agent-based, or microscale model: this being subsets of the more general topic of discrete event simulation. Looking into those topics and reading some of the literature and books around them would be a good start.
One way to get started conceptually might be to play with SimPy. Once you think you understand how its pieces fit together and how to build a model, you'll be in a better position to move to a higher-performance language, like C++, where you'll need to build more of the components yourself.
You should also learn how to program. Having to ask a question as general as you are at the beginning of this endeavour should give you pause: people have devoted careers figuring out how to do this the right way. That said: C++ is a decent choice of language because you'll need to run your model not just once, but tens of thousands of times, in order to get an idea of how variable your results are. Remembering that the number of interactions between variables grows exponentially in the number of variables, you'll also want to explore different combinations of environments with an eye to testing the strength of your assumptions.
All of this will also probably require the use of a high-performance environment: you'll want to learn about MPI, R's HPC packages, jug, or Spark: each of which would have to be tamed to work with your implementation of the model.
This paper I recently published has a relatively simple agent-based model, along with an analysis and source code, which might help you get started. It may also help you understand the enormity of the undertaking you propose.

Does This Method Have A Name? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
Not sure if this method exists for data analysis... or even if I can word my question clearly:
If you took multiple transparancies with a map of the world on it, and then placed a 'very light' dot of color at places of interest (one dot on each transparency), when you stacked all of the transparancies on top of each other (in any order really), the 'very light' dots of color would combine to form 'darker' spots indicating increased interest in these locations. Likewise, the 'answer' would become readily apparant just by looking at overlayed maps with little to no calculations
Does this sound like any establised technique that you have heard of? And if so, what is its name?

Yes.
This technique is commonly known as heat map (Wikipedia), and a standard visualization technique.
This is popularly used with multivariate density estimation (Wikipedia).
Picture from Wikipedia, see File:Old Faithful Geyser KDE with plugin bandwidth.png (CC by-sa-3.0 licensed).
I would not call this "data mining". This is much much older. It's a visualization technique popular in statistics, though; but not so much an "analysis" technique.

Approach to data-analysis [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm looking to write a reporting tool. The data resides in a ~6GB postgresql database. The application is an online store/catalog application that has items and orders. The stakeholders are requesting a feature that will allow them to search for an item and give a count of all those orders in the last 2 years.
Some rows contain quantities, and units of measure, which would require multiplication of quantity and UoM for each row.
It's also possible that other reporting functions will be necessary in the future.
I have not delved much into the data analysis aspect of programming. I enjoy Clojure, so I would be thrilled to find a solution that uses Clojure, but only if Clojure offers competitive tools for my needs.
Here are some options I'm considering:
merely SQL
Clojure
core.reducers
a clojure hadoop library
Hadoop
Can anyone shed some insight into these kinds of problems for me? Are there articles that you would recommend?

Hadoop is likely overkill for this project. It seems most likely that simply using Clojure-jdbc or Korma to read the data form the database and filter/reduce it in Clojure is likely to be fine. At work we routinely work with sequences of that size, though this depends on the expected response time. You may need to do some preprocessing and caching if instantaneous responses are expected.

How do you find a particular piece of functionality in a large codebase? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I was fascinated by the "press Tab to search site" feature in chromium, so naturally I wanted to see how exactly it was implemented in code.
A little background for anybody who aren't familiar with this. After navigating to some site, say wikipedia, and doing a search, chromium remembers the name of the query variable and will let you press tab and search the site directly from the address bar. Neat!
Problem is the codebase for chromium is huge and I've had no luck in finding the method/function that handles this.
How do you approach a large codebase when you are looking for the implementation of a particular piece of functionality? Any tricks for narrowing it down? Preferably it should not require building the software with debug symbols and following the flow through the program.

There is no one size fits all approach to this sort of problem. But for this one I would try these:
If there are any unique messages associated with the operation, grep all the source files for that string. A common pitfall of this technique is that messages might be assembled from pieces within the application, so it is often helpful to grep for a unique short phrase—or even a single word—to identify the source of the message. Once the text is found, then finding what references it often requires more text searches.
Trace execution from an easy-to-find point, like the command processing and dispatch loop. I'd look for a Tab key case and follow where it leads.
Look at source code directory and filenames for hints. Software is often constructed rationally, with good engineers dividing and conquering in a sensible way.

A test coverage tool is a good way to do this. They tell you what part of an application
is exercised by a test.
Instrument the application to collect test coverage. Execute the functionality you care about. Record what is executed. Execute something similar, but not the same as the functionality you want. Record this. Take the set difference over the coverage. The diff selects code involved in the functionality of interest, excluding code which is common to similar functionality.

Ask the Chromium team. They don't give points or bronze pixels but they're definitely the authority and right people to ask this sort of questions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js