server side technology for business logic - server-side

Say I want to develop a web application that will have registered users and will be registered as a twitter app (allowing users to give it permissions to view their timeline and post on their behalf). The sole function of the application will be to re-tweet tweets from users' timeline according to users' settings and desires.
I understand that the website for this app will use the common technologies like HTML, CSS and JS on the client-side. The server side (where the user defines what kind of tweets the application should retweet) will have to be coded in PHP/Python/Perl/... based on a DB MySQL/Postgre/...
What I don't understand, and would really appreciate your help with, is where the real "business logic" will be coded? For example, what technology should I use to code the function that will sit on my server: contacting Twitter server every 5 minutes, reading the timeline of every user I have, checking whether there are tweets worth retweeting (according to what the user has defined), and sending Tweeter the necessary commands to retweet the chosen tweets on behalf of my users.
All that will happen off-line for the user, and will be an on-going and cyclic process - but what technology should I use to code it?
Thanks!

I have heard about this API for PHP. It is actually the only one that I have heard of for PHP, though. I know that there are some good Python libraries out there, but I don't know about Perl.
I am actually working on a new API for C# (won't be a good fit for you, as you're clearly not using Windows Servers), and started building it while working on an enterprise web application that prompted several questions similar to your own.
Here is what you are going to have to do:
Before you start, you are going to have to get in touch with one of Twitter's data partners (I believe that you may contact Twitter for the reference)
The reason for this is that you are going to be requiring many more requests than you think
Twitter's time interval used for Twitter's recorded rate cap is 900 seconds (5 minutes)
With the general rate limit if you are querying a user's timeline only once every rate limit, you are limiting your number of visitors on your site to 300 at a time
Here's where it gets tricky - if every user makes one Tweet (meaning you send the Tweet - not rate limited - and then refresh the timeline - rate limited - so that they can see the updated tweet) you have now dropped your maximum number of active users at any given time to 150
Factor in the company's own timeline (-1 visitor), plus the number of visitors who leave their browsers open (now you need more logic, and you have to either kick them off or simply keep track of whose timeline you won't be refreshing), the number of users who make more than one tweet (-1 visitor for each Tweet), etc.
Moral of the story: contact one of their data partners and get yourself either unlimited requests, or at least a significant enough amount to accommodate your number of visitors/users (plus a bit of padding)
If you adhere to this advise, skip steps 2 and 3, otherwise, skip step 4
(Note: Steps 2 and 3 are only for rate-capped implementations) Using your desired language, make a service that runs on the server and makes the queries to Twitter
Based on the information that you gave, I suggest that you use Python to make this service
The service will run at all times and be on it's own clock to base the 5-minute intervals between the requests on
You will have to use a caching or a database system for storing the data
(Note: Steps 2 and 3 are only for rate-capped implementations) Add the necessary code to make a request to the service that you created for the data and perform these requests every 5 minutes
I suggest that the clock used for making these requests to the service be running a little bit behind the clock used for the service to account for instances of slow data transfer, etc.
You will also have to have to call some methods on the service for adding/removing users from the queue
(Note: Step 4 is only for unlimited request implementations) Forget about the service and simply include the request code directly in the page that the user is on.
The user's timeline will be updated based on when they visited the site or when their timeline was last refreshed (if a Tweet was made)
The only caveat to this implementation is that you will have to pay for the unlimited/larger data rate limit

Related

What is the cheapest way to send fast website update alerts to users as push notifications?

I'm trying to help an animal shelter deliver faster updates when a new pet is added to their website. This is likely to happen between 0-20 times a day.
The website is a simple data dump, animals are in tables with row delineation (easy to parse) and have unique IDs. When a new pet is added, ideally this would trigger a mobile notification to subscribed users (could also be an email message). The faster updates are sent, the better, but checking every 30 mins or so would be fine. Because this is for a charity, I want to spend as little as possible on resources (because I also want to be able to scale this up for other shelters that might want to use this).
For instance mobile notifications, Twitter seems to be a good candidate. It looks like my needs wont run into fees/restrictions.
The part that I'm stuck on is how best to ping the site for updates and publish those updates to twitter. The two options I've come up with are:
Build my own system. Use a web crawler like Scrapy to periodically crawl the site and check for new petIDs. Using AWS, I think I could get by with a nano instance (~$57 a year). Using dynamoDB to cache existing petIDs seems like a small additional cost. Use twitter API to post updates
Use an RSS feed generator like Feedity. These seem to be pretty expensive: Feedity is $180/year for hourly updates and $390 for 15 minute updates. Has API integrated with Twitter.
I'd like to know if there are any better/simpler/cheaper/more obvious options I may be overlooking. Thanks!

Designing a distributed web scraper

The Problem
Lately I've been thinking about how to go about scraping the contents of a certain big, multi-national website, to get specific details about the products the company offers for sale. The website has no API, but there is some XML you can download for each product by sending a GET request with the product ID to a specific URL. So at least that's something.
The problem is that there are hundreds of millions of potential product ID's that could exist (between, say, 000000001 and 500000000), yet only a few hundred thousand products actually exist. And it's impossible to know which product ID's are valid.
Conveniently, sending a HEAD request to the product URL yields a different response depending on whether or not the product ID is valid (i.e. the product actually exists). And once we know that the product actually exists, we can download the full XML and scrape it for the bits of data needed.
Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server, so I'd like to take the opportunity to learn how to develop some sort of distributed application (totally new territory for me). At this point, I should mention that this particular website can easily handle a massive amount of incoming requests per second without risk of DOS. I'd prefer not to name the website, but it easily gets millions of hits per day. This scraper will have a negligible impact on the performance of the website. However, I'll immediately put a stop to it if the company complains.
The Design
I have no idea if this is the right approach, but my current idea is to launch a single "coordination server", and some number of nodes to communicate with that server and perform the scraping, all running as EC2 instances.
Each node will launch some number of processes, and each process will be designated a job by the coordination server containing a distinct range of potential product ID's to be scraped (e.g. product ID 00001 to 10000). These jobs will be stored in a database table on the coordination server. Each job will contain info about:
Product ID start number
Product ID end number
Job status (idle, in progress, complete, expired)
Job expiry time
Time started
Time completed
When a node is launched, a query will be sent to the coordination server asking for some configuration data, and for a job to work on. When a node completes a job, a query will be sent updating the status of the job just completed, and another query requesting a new job to work on. Each job has an expiry time, so if a process crashes, or if a node fails for any reason, another node can take over an expired job to try it again.
To maximise the performance of the system, I'll need to work out how many nodes should be launched at once, how many processes per node, the rate of HTTP requests sent, and which EC2 instance type will deliver the most value for money (I'm guessing high network performance, high CPU performance, and high disk I/O would be the key factors?).
At the moment, the plan is to code the scraper in Python, running on Ubuntu EC2 instances, possibly launched within Docker containers, and some sort of key-value store database to hold the jobs on the coordination server (MongoDB?). A relational DB should also work, since the jobs table should be fairly low I/O.
I'm curious to know from more experienced engineers if this is the right approach, or if I'm completely overlooking a much better method for accomplishing this task?
Much appreciated, thanks!
You are trying to design a distributed workflow system which is, in fact, a solved problem. Instead of reinventing the wheel, I suggest you look at AWS's SWF, which can easily do all state management for you, leaving you free to only worry about coding your business logic.
This is how a system designed using SWF will look like (Here, I'll use SWF's standard terminologies- you might have to go through the documentation to understand those exactly):
Start one workflow per productID.
1st activity will check whether this productID is valid, by making a HEAD request as you mentioned.
If it isn't, terminate workflow. Otherwise, 2nd activity will fetch relevant XML content, by making the necessary GET request, and persist it, say, in S3.
3rd activity will fetch the S3 file, scrape the XML data and do whatever with it.
You can easily change the design above to have one workflow process a batch of product IDs.
Some other points that I'd suggest you keep in mind:
Understand the difference between crawling and scraping: crawling means fetching relevant content from the website, scraping means extracting necessary data from it.
Ensure that what you are doing is strictly legal!
Don't hit the website too hard, or they might blacklist your IP ranges. You have two options:
Add delay between two crawls. This too can be easily achieved in SWF.
Use anonymous proxies.
Don't rely too much on XML results from some undocumented API, because that can change anytime.
You'll need high network performance EC2 instances. I don't think high CPU or memory performance would matter to you.

Using geobytes web service to get cities list

I wanted a free web service to get cities list and found geobytes. Its good. I wanted to know What is the meaning of 50000 request? On every key pressed it makes a HTTP request.So do they count this way?
but if you expect to be performing more than 50,000 requests per day (your average unique visitors X 5), then please tell us
Anyone who has used this please help.
I would imagine what it means is that going over 50,000 requests can be penalized in someway. A key press is not a request - but entering a city and fetching that cities' details would constitute 1 of the 50,000 requests.
Hope this helps.
I am the author and administrator of Goebytes'AutoCompleteCity API and there is now no practical limit to genuine use, and the reference to 50,000 lookup per day has been removed from the web site. I say practical, because it does have DOS attack prevention measures, but as the API is intended to be called from the browser (as opposed to a server - for that you would use the GetCityDetails API) its DOS protection measure of "1024 look ups per IP Address, per hour", should never cut in under any circumstances that I can imagine.

Facebook Graph API-Account suspension

I have a .Net application that uses list of names/email addresses and finds there match on Facebook using the graph API. During testing, my list had 900 names...I was checking facebook matches for each name in in a loop...The process completed...After that when I opened my Facebook page...it gave me message that my account has been suspended due to suspicious activities?
What am I doing wrong here? Doesn't facebook allow to search large number requests to their server? And 900 doesn't seem to be a big number either..
per the platform policies: https://developers.facebook.com/policy/ this may be the a suspected breach of their "Principals" section.
See Policies I.5
If you exceed, or plan to exceed, any of the following thresholds
please contact us by creating confidential bug report with the
"threshold policy" tag as you may be subject to additional terms: (>5M
MAU) or (>100M API calls per day) or (>50M impressions per day).
Also IV.5
Facebook messaging (i.e., email sent to an #facebook.com address) is
designed for communication between users, and not a channel for
applications to communicate directly with users.
Then the biggie, V. Enforcement. No surprise, it's both automated and also monitored by humans. So maybe seeing 900+ requests coming from your app.
What I'd recommend doing:
Storing what you can client side (in a cache or data store) so you make fewer calls to the API.
Put logging on your API calls so you, the developer, can see exactly what is happening. You might be surprise at what you find there.

Speeding up page loads while using external API's

I'm building a site with django that lets users move content around between a bunch of photo services. As you can imagine the application does a lot of api hits.
for example: user connects picasa, flickr, photobucket, and facebook to their account. Now we need to pull content from 4 different apis to keep this users data up to date.
right now I have a a function that updates each api and I run them all simultaneously via threading. (all the api's that are not enabled return false on the second line, no it's not much overhead to run them all).
Here is my question:
What is the best strategy for keeping content up to date using these APIs?
I have two ideas that might work:
Update the apis periodically (like a cron job) and whatever we have at the time is what the user gets.
benefits:
It's easy and simple to implement.
We'll always have pretty good data when a user loads their first page.
pitfalls:
we have to do api hits all the time for users that are not active, which wastes a lot of bandwidth
It will probably make the api providers unhappy
Trigger the updates when the user logs in (on a pageload)
benefits:
we save a bunch of bandwidth and run less risk of pissing off the api providers
doesn't require NEARLY the amount of resources on our servers
pitfalls:
we either have to do the update asynchronously (and won't have
anything on first login) or...
the first page will take a very long time to load because we're
getting all the api data (I've measured 26 seconds this way)
edit: the design is very light, the design has only two images, an external css file, and two external javascript files.
Also, the 26 seconds number comes from the firebug network monitor running on a machine which was on the same LAN as the server
Personally, I would opt for the second method you mention. The first time you log in, you can query each of the services asynchronously, showing the user some kind of activity/status bar while the processes are running. You can then populate the page as you get the results back from each of the services.
You can then cache the results of those calls per user so that you don't have to call the apis each time.
That lightens the load on your servers, loads your page fast, and provides the user with some indication of activity (along with incrimental updates to the page as their content loads). I think those add up to the best User Experience you can provide.