url shortening algorithm in c/c++ - interview - c++

Refer - https://stackoverflow.com/a/742047/161243
Above algo says that we use a DB to store the data. Now if interviewer says that you can't use a DB. Then in that case we can have a stucture:
struct st_short_url{
char * short_url;
char * url;
}
Then a hashtable - st_short_url* hashTable[N];
Now we can have an int id which is incremented each time or a random number generated id which is converted to base62.
Problem i see:
-- if this process terminates then i lose track of int id and complete hashTable from RAM. So do i keep writing the hashTable back to disk so that it is persisted? if yes, then a B-tree will be used? Also we need to write id to disk as well?
P.S. Hashtable+writing to disk is Database, but what if i can't use a DBMS? What if i need to come up with my own implementation?
Your thoughts please...
Another Question:
In general, How do we handle infinite redirects in URL shortening?

If you can't use a DB of any kind (i.e. no persistent storage; the file system is nothing but a primitive DB!), then the only way to do it which I see is lossless compression + encoding in allowed characters. The compression algorithm may employ knowledge about URLS (e.g. that it is very likely that they begin with either http:// or https://, quite a few go on with www. and the domain name most often ends in .com, .org or .net. Moreover you can always assume a slash after the host name (because http://example.org and http://example.org/ are equivalent). You also may assume that the URL only contains valid characters, and special-case some substrings which are very likely to occur in the URL (e.g. frequently linked domains, or known naming schemes for certain sites). Probaby the compression scheme should feature a version field so that you can update the algorithm when usage patterns change (e.g. a new web site gets popular and you want to special-case that as well, or a popular site changes its URL pattern which you special-cased) without risking the old links to go invalid.
Such a scheme could also be supported directly in the browser through an extension, saving server bandwidth (the server would still have to be there for those without a browser extension and as fallback if the extension doesn't yet have the newest compression data).

The requirement isn't practical, but you don't have to give a practical answer. Just use the file system and he won't realize that.
To store:
convert input URL to a string e.g. base64 conversion.
make a file of that name
return the inode number as the short url (e.g. ls -i filename ) or stat() etc.
To retrieve:
get the inode number from user.
find / -inum n -print or some other mechanism.
convert that back to a URL from filename.

A database is a data structure that supports insertion, removal and search of items. As has been pointed out in the comments to the OP, nearly everything is a database, so this constraint seems somewhat uninformed.
If you're not allowed to use an existing DBMS, you can resort to storing items on disk, making use of tmpnam() or a similar technique that doesn't suffer from race conditions. tmpnam() yields unique IDs, and you can use the associated file to store information.

Related

procmail: getting procmail to exclude hostname while saving Maildir format messages

How do I get procmail to save messages in my Maildir folder, but not include the hostname in the file (message name)? I get the following message names in my new/ sub-folder:
1464003587.H805375P95754.gator3018.hostgator.com, S=20238_2
I just want to eliminate the hostname. Is that possible to do, using procmail? How? Separately, it is possible to replace the first time stamp with the time-sent time-stamp? Is it possible to prescribe a format for procmail?
No, you can't override Maildir's filename format, not least because it's prescribed to be in a particular way for interoperability reasons. The format is guaranteed to be robust against clashes when multiple agents on multiple hosts concurrently write to the same message store. This can only work correctly if they all play by the same rules. An obvious part of those rules is the one which dictates that the host name where the agent is running must be included in the filename of each new message.
The Wikipedia Maildir article has a good overview of the format's design and history, and of course links to authoritative standards and other primary sources.
If you don't particularly require Maildir compatibility (with the tmp / new / cur subdirectories etc) you can simply create a unique mbox file on each run; if you can guarantee that it is unique, you don't need locking when you write to it.
For example, if you have a tool called uuid which generates a guaranteed unique identifier on each invocation, you can use that as the file name easily;
:0 # or maybe :0r
`uuid`
It should be easy to see how to supply your own tool instead, if you really think you can create your own solution for concurrent delivery. (Maildir solves concurrent and distributed delivery, so the requirements for that are stricter.)
The other formats supported by Procmail have their own hardcoded rules for how file names are generated, though perhaps the simple MH folder format, with a (basically serially incrementing) message number as the file name, would be worth investigating as well. The old mini-FAQ has a brief overview of the supported formats and how to select which one Procmail uses for delivery in each individual recipe.

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.
There is no universal solution.
Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)
Safest solution:
download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.
Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.
Using Head
Request page using HEAD verb and check the Header Tags:
Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.
Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch
Using GET
Request page using GET verb and using conditional Header Tags:
* If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified
Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.
Finally, maybe mix of above solution is optimum way for doing such action.
If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.
Hope this helps.
store the html files -- two versions..
one was the html which was taken before an hour. -- first.html
second is the html which was taken now -- second.html
Run the command :
$ diff first.html second.html > diffs.txt
If the diffs has some text then the file is changed.
Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.
You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.
Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:
2 37 en/index.html
2 insertions were made, 37 deletions were made to en/index.html
Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).
Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).
You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.
import requests
response = requests.head(url)
datetime_str = response.headers["last-modified"]
And keep checking if that field changes in a while loop and compare the datetime difference.
I did a little program on Python to do that:
https://github.com/javierdechile/check_updates_http

What are the best practices for user uploads with S3?

I was wondering what you recommend for running a user upload system with s3. I plan on using MongoDB for storing metadata such as the uploader, size, etc. How should I go about storing the actual file in s3.
Here are some of my ideas, what do you think is the best? All of these examples would involve saving the metadata to MongoDB.
1.Should I just store all the files in a bucket?
2. Maybe organize them into dates (e.g. 6/8/2014/mypicture.png)?
3.Should I save them all in one bucket, but with an added string (such as d1JdaZ9-mypicture.png) to avoid duplicates.
4. Or should I generate a long string for a folder, and store the file in that folder. (to retain the original file name). e.g. sh8sb36zkj391k4dhqk4n5e4ndsqule6/mypicture.png
This depends primarily on how you intend to use the pictures and which objects/classes/modules/etc. in your code will actually deal with retrieving them.
If you find yourself wanting to do things like - "all user uploads on a particular day" - A simple naming convention with folders for the year, month and day along with a folder at the top level for the user's unique ID will solve the problem.
If you want to ensure uniqueness and avoid collisions in your bucket, you could generate a unique string too.
However, since you've got MongoDB which (i'm assuming) will actually handle these queries for user uploads by date, etc., it makes the choice of your bucket more aesthetic than functional.
If all you're storing in mongoDB is the key/URL, it doesn't really matter what the actual structure of your bucket is. Nevertheless, it makes sense to still split this up in some coherent way - maybe group all a user's uploads and give each a unique name (either generate a unique name or prefix a unique prefix to the file name).
That being said, do you think there might be a point when you might look at changing how your images are stored? You might move to a CDN. A third party might come up with an even cheaper/better product which you might want to try. In a case like that, simply storing the keys/URLs in your MongoDB is not a good idea since you'll have to update every entry.
To make this relatively future-proof, I suggest you give your uploads a definite structure. I usually opt for:
bucket_name/user_id/yyyy/mm/dd/unique_name.jpg
Your database then only needs to store the file name and the upload time stamp.
You can introduce a middle layer in your logic (a new class perhaps or just a helper function/method) which then generates the URL for a file based on this info. That way, if you change your storage method later, you only need to make a small change in this middle layer (after migrating your files of course) and not worry about MongoDB.

store a high amount of HTML files

i'm working on an academic project(a search engine), the main functions of this search engine are:
1/-crawling
2/-storing
3/-indexing
4/-page ranking
all the sites that my search engine will crawl are available locally which means it's an intranet search engine.
after storing the files found by the crawler, these files need to be served quickly for caching purpose.
so i wonder what is the fastest way to store and retrieve these file ?
the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.
i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.
NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !
Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.
Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.
Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.
Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.
I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.

What is the best way to encrypt hardcoded strings in C++?

Warning: C++ noob
I've read multiple posts on StackOverflow about string encryption. By the way, they don't answer my doubts.
I must insert one or two hardcoded strings in my code but I would like to make it difficult to read in plain text when debugging/reverse engineering. That's not all: my strings are URLs, so a simple packet analyzer (Wireshark) can read it.
I've said difficult because I know that, when the code runs, the string is somewhere (in RAM?) decrypted as plain text and somebody can read it. So, assuming that is not possible to completely secure my string, what is the best way of encrypting/decrypting it in C++?
I was thinking of something like this:
//I've omitted all the #include and main stuff of course...
string encryptedUrl = "Ajdu67gGHhbh34590Hb6vfu6gu" //Encrypted url with some known algorithm
URLDownloadToFile(NULL, encryptedUrl.decrypt(), C:\temp.txt, 0, NULL);
What about packet analyzing? I'm sure there's no way to hide the URL but maybe I'm missing something? Thank you and sorry for my worst english!
Edit 1: What my application does?
It's a simple login script. My application downloads a text file from an URL. This file contains an encrypted string that is read using fstream library. The string is then decrypted and used to login on another site. It is very weak, because there's no database, no salt, no hashing. My achievement is to ensure that neither the url nor the login string are "easy" to read from a static analisys of the binary, and possibly as hard as possible with a dynamic analysis (debugging, revers engineering, etc).
If you want to stymie packet inspectors, the bare minimum requirement is to use https with a hard-coded server certificate baked into your app.
There is no panacea for encrypting in-app content. A determined hacker with the right skills will get at the plain url, no matter what you do. The best you can hope for is to make it difficult enough that most people will just give up. The way to achieve this is to implement multiple diverse obfuscation and tripwire techniques. Including, but not limited to:
Store parts of the encrypted url and the password (preferably a one-time key) in different locations and bring them together in code.
Hide the encrypted parts in large strings of randomness that looks indistinguishable from the parts.
Bring the parts together piecemeal. E.g., Concatenate the first and second third of the encrypted url into a single buffer from one initialisation function, concatenate this buffer with the last third in a different unrelated init function, and use the final concatenation in yet another function, all called from different random places in your code.
Detect when the app is running under a debugger and have different functions trash the encrypted contents at different times.
Detection should be done at various call sites using different techniques, not by calling a single "DetectDebug" function or testing a global bool, both of which create a single point of attack.
Don't use obvious names, like, "DecryptUrl" for the relevant functions.
Harvest parts of the key from seemingly unrelated, but consistent sources. E.g., read the clock and only use a few of the high bits (high enough that that they won't change for the foreseeable future, but low enough that they're not all zero), or use a random sampling of non-volatile results from initialisation code.
This is just the tip of the iceberg and will only throw novices off the scent. None of it is going to stop, or even significantly slow down, a skillful attacker, who will simply intercept calls to the SSL library using a stealth debugger. You therefore have to ask yourself:
How much is it worth to me to protect this url, and from what kind of attacker?
Can I somehow change the system design so that I don't need to secure the url?
Try XorSTR [1, 2]. It's what I used to use when trying to hamper static analysis. Most results will come from game cheat forums, there is an html generator too.
However as others have mentioned, getting the strings is still easy for anyone who puts a breakpoint on URLDownloadToFile. However, you will have made their life a bit harder if they are trying to do static analysis.
I am not sure what your URL's do, and what your goal is in all this, but XorStr + anti-debug + packing the binary will stop most amateurs from reverse engineering your application.