store a high amount of HTML files - c++

i'm working on an academic project(a search engine), the main functions of this search engine are:
1/-crawling
2/-storing
3/-indexing
4/-page ranking
all the sites that my search engine will crawl are available locally which means it's an intranet search engine.
after storing the files found by the crawler, these files need to be served quickly for caching purpose.
so i wonder what is the fastest way to store and retrieve these file ?
the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.
i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.
NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !

Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.
Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.
Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.

Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.
I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.

Related

How to store and efficiently acces a lot of files in a search engine (C++)?

I am currently building a little search engine as part of a university project.
The search engine has to be able to perform searches on quite a lot of documents. 500k-1M I would say. The documents are pure .txt files and are not big (max 1MB).
I am writing the search engine in C++, but I am not quite sure how to store and access the documents efficiently (in memory).
I am using an inverted index, which stores for every term a list of document ids (pure integers) in which that term occurs.
I thought about creating a document class which takes into the constructor the filename of the document and creates a document object.
For example
Document d("25.txt");
The document class would also hold other information about the document, which I would present to the user, if the user decides to take a look at the document.
However, if I search for a term, that occurs in quite a lot of documents, for example "apple", then I would probably have to create hundreds or thousands of document objects. And when the queries get longer, that would blow up the heap (I guess?).
And I really need all of the potentially relevant documents in order to create a ranking.
What would be the right way to go here? Maybe somehow with serialization? Or using a DB? Or something else?
Please note, I can not use something like solr or Lucene, as this task is part of a university course.

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.
There is no universal solution.
Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)
Safest solution:
download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.
Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.
Using Head
Request page using HEAD verb and check the Header Tags:
Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.
Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch
Using GET
Request page using GET verb and using conditional Header Tags:
* If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified
Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.
Finally, maybe mix of above solution is optimum way for doing such action.
If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.
Hope this helps.
store the html files -- two versions..
one was the html which was taken before an hour. -- first.html
second is the html which was taken now -- second.html
Run the command :
$ diff first.html second.html > diffs.txt
If the diffs has some text then the file is changed.
Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.
You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.
Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:
2 37 en/index.html
2 insertions were made, 37 deletions were made to en/index.html
Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).
Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).
You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.
import requests
response = requests.head(url)
datetime_str = response.headers["last-modified"]
And keep checking if that field changes in a while loop and compare the datetime difference.
I did a little program on Python to do that:
https://github.com/javierdechile/check_updates_http

Replace strings in large file

I have a server-client application where clients are able to edit data in a file stored on the server side. The problem is that the file is too large in order to load it into the memory (8gb+). There could be around 50 string replacements per second invoked by the connected clients. So copying the whole file and replacing the specified string with the new one is out of question.
I was thinking about saving all changes in a cache on the server side and perform all the replacements after reaching a certain amount of data. After reaching that amount of data I would perform the update by copying the file in small chunks and replace the specified parts.
This is the only idea I came up with but I was wondering if there might be another way or what problems I could encounter with this method.
When you have more than 8GB of data which is edited by many users simultaneously, you are far beyond what can be handled with a flatfile.
You seriously need to move this data to a database. Regarding your comment that "the file content is no fit for a database": sorry, but I don't believe you. Especially regarding your remark that "many people can edit it" - that's one more reason to use a database. On a filesystem, only one user at a time can have write access to a file. But a database allows concurrent write access for multiple users.
We could help you to come up with a database schema, when you open a new question telling us how your data is structured exactly and what your use-cases are.
You could use some form of indexing on your data (in a separate file) to allow quick access to the relevant parts of this gigantic file (we've been doing this with large files successfully (~200-400gb), but as Phillipp mentioned you should move that data to a database, especially for the read/write access. Some frameworks (like OSG) already come with a database back-end for 3d terrain data, so you can peek there, how they do it.

url shortening algorithm in c/c++ - interview

Refer - https://stackoverflow.com/a/742047/161243
Above algo says that we use a DB to store the data. Now if interviewer says that you can't use a DB. Then in that case we can have a stucture:
struct st_short_url{
char * short_url;
char * url;
}
Then a hashtable - st_short_url* hashTable[N];
Now we can have an int id which is incremented each time or a random number generated id which is converted to base62.
Problem i see:
-- if this process terminates then i lose track of int id and complete hashTable from RAM. So do i keep writing the hashTable back to disk so that it is persisted? if yes, then a B-tree will be used? Also we need to write id to disk as well?
P.S. Hashtable+writing to disk is Database, but what if i can't use a DBMS? What if i need to come up with my own implementation?
Your thoughts please...
Another Question:
In general, How do we handle infinite redirects in URL shortening?
If you can't use a DB of any kind (i.e. no persistent storage; the file system is nothing but a primitive DB!), then the only way to do it which I see is lossless compression + encoding in allowed characters. The compression algorithm may employ knowledge about URLS (e.g. that it is very likely that they begin with either http:// or https://, quite a few go on with www. and the domain name most often ends in .com, .org or .net. Moreover you can always assume a slash after the host name (because http://example.org and http://example.org/ are equivalent). You also may assume that the URL only contains valid characters, and special-case some substrings which are very likely to occur in the URL (e.g. frequently linked domains, or known naming schemes for certain sites). Probaby the compression scheme should feature a version field so that you can update the algorithm when usage patterns change (e.g. a new web site gets popular and you want to special-case that as well, or a popular site changes its URL pattern which you special-cased) without risking the old links to go invalid.
Such a scheme could also be supported directly in the browser through an extension, saving server bandwidth (the server would still have to be there for those without a browser extension and as fallback if the extension doesn't yet have the newest compression data).
The requirement isn't practical, but you don't have to give a practical answer. Just use the file system and he won't realize that.
To store:
convert input URL to a string e.g. base64 conversion.
make a file of that name
return the inode number as the short url (e.g. ls -i filename ) or stat() etc.
To retrieve:
get the inode number from user.
find / -inum n -print or some other mechanism.
convert that back to a URL from filename.
A database is a data structure that supports insertion, removal and search of items. As has been pointed out in the comments to the OP, nearly everything is a database, so this constraint seems somewhat uninformed.
If you're not allowed to use an existing DBMS, you can resort to storing items on disk, making use of tmpnam() or a similar technique that doesn't suffer from race conditions. tmpnam() yields unique IDs, and you can use the associated file to store information.

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.
Edit: Any that aren't torrents? I can't get those at work.
The stackoverflow database is available for download.
Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.
One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).
If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.
If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.
Out of curiosity, what are you using all this data for?
You could use a web crawler and scrape 100MB of data?
There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.
One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).
https://dumps.wikimedia.org/metawiki/latest/
You want to look for any files with the -articles.xml.bz2 suffix.