File web service architecture - web-services

I need to implement a web service which could provide requested files to other internal applications or components running on different networks. Files are dispersed across different servers in different locations and can be big as few gigabytes.
I am thinking to create a RESTful web service which will have implementation to discover the file, redirect the HTTP request to another web service on different location and send the file via HTTP.
Is it a good idea to send the file via HTTP or will it be better for the web service to copy the file to the location where requester component could access it?

The biggest problem with distributing large files over HTTP is that you will come across all sorts of limits that prevent it. As a simple example, WCF allows you to configure maximum payload size but you can only configure it up to 2 GB. You will likely run across issues like this in all layers of your stack. I doubt any of them are insurmountable (to work around the above limitation you can stream chunks of the file, rather than the entire file, although that introduces it's own problems), but you will likely have lots of timeouts and random failures, which are fixed by tweaking the configuration of this or that service or client.
Also, when dealing with large files, you have to carefully consider how you deal with the inevitable failures during transfer (e.g. the network drops out). Depending on the specific technologies you use, they may have some "resume" functionality, but you will want to be sure this is reliable before committing to it.
One possibility would be to do what Facebook does when distributing large binaries - use BitTorrent. So, your web-service serves a torrent of the file, not the file itself. The big advantages of BitTorrent are it is very robust, and can scale well. It's worth considering, but it will depend a lot on your environment and specific workload.

If the files you are going to serve, do not change often or do not change at all, you could use many strategies, since the one advised by RB, or use pure HTTP which supports partial data operations, see RFC 2616.
But depending on your usage scenario, I would also suggest you to take a look at the Amazon Web Services - S3 (Simple Storage Service), which probably does already what you are trying to do, it's cheap and have high availability.

Related

Webmachine on different machines

I have a webmachine REST API server running on one machine. in anticipation of having more traffic that this machine cannot handle, i would need to expand to other nodes on other cpu`s. is there a way of configuring this?
if not what is the right way of distribution here, would i need to do it manually through OTP, concurrent workers and supervisors? where a worker is spawned and ships the request to the neighboring machine.
It kind of depends on your use case. Best way would be observing where you experience problems, and react accordingly.
You could look at your application as three separate parts. First one would be REST interface, second could be businesses logic (little more later), and third would be the data itself (resources, lets call it data store, but it could be even just another service).
Data
This one is simplest. I assume you are using separate service for this (like Riak cluster), where you could do your scaling separately. One thing you could look into is just making sure connection between Webmachine and your data store can scale enough for your needs.
Interface
If your server just can not handle enough requests, just put another one next to it. You can router to dispatch request to both of therm, ans since they will use same data store, they will stay in "sync".
REST being based on http assumes stateless communication. Meaning, any two requests (form same user or two different ones) don't share any resources and can be handled by different applications (you also don't have to share anything between your Webmachine instances).
Domain logic
In theory you should not have any of this in your REST API server, but still lets discuss this a little bit.
Some of your requests might require little more work than just serving content. You might be doing some computation (like serving statistics that need to be generated). You might be updating some resource, that need to change more that one data in one place (one could think of it as of transaction). It could call for more computational power, or state synchronization, which would make scaling harder.
Way around this would be separating REST from such logic. Especially introducing micro-services, which you could scale up or down independently from Webmachine itself.
In Erlang you could actually introduce separate applications inside Erlang VM. Those again could be scaled up with use of Distributed Erlang (and little more in this topic and pull of workers (like poolboy. I would recommend this approach for start, since it is easiest to implement, and due to async nature of Erlang it could always easily be ported to external micro-service.
PS. System resources
You also should check if your box can handle such traffic. One of most common mistakes is not increasing maximal number of file descriptors in your production. But again, first you should observe such problems, and then react to them. Premature optimization in most cases doesn't pay off.
PS2 What and when
You can monitor our applications and system resources with tools like Exometer or more out-of-the-box WombatOAM.
And you can (should) stress test your application with tools like tsung or basho-bench

Rest based services in back-end

we build a 30M+ users' online community, which has RESTful services in it's back-end and a front-end which utilizes them.
My concern is: Is it OK to use REST as internal data transfer protocol, or it will significantly drop the performance, compared with Java's binary serialization protocol (language dependent)? What other approaches/protocols can be used to keep it language independent and maximally fast?
The REST approach can be quite ok, but the http layer can slow things down.
If your use REST in the back-end, your should make sure that the connection between your back-end and front-end is kept open and not reopened with every request.
More details about http keep-alive can be found here: http://en.wikipedia.org/wiki/HTTP_persistent_connection
One advantage that REST gives you between front- and back-end layers is the flexibility to add a layer of HTTP caching in between to boost the performance without needing to modify the either of the existing layers. The same holds true for load-balancing for scaling out the back-end, since HTTP load-balancers are very well understood and easy to deploy.
These two benefits of REST can result in a major benefit over more traditional RPC serialization techniques, depending on the situation, especially if you have "slow" back-end processes that can benefit from caching or being load-balanced.
The other place REST wins out is if you need to expand the client base using the back-end services (which I think you hinted at with the desire for language independence). No only does a REST-based service layer allow you to intermingle client languages freely, but it also allows you to easily open up your API to 3rd-party developers with almost no extra effort. Having a platform for others to build on has proven to be wildly successful as a business model and it never hurts to keep your development as open and flexible as possible.
This is something that you will have to measure and compare before making decisions. It depends on what information is transferred, how often etc. Serialization may not be the bottleneck. But it will be a good idea to consider Protocol Buffers at this scale.

sftp versus SOAP call for file transfer

I have to transfer some files to a third party. We can invent the file format, but want to keep it simple, like CSV. These won't be big files - a few 10s of MB at most and there won't be many - 3 files per night.
Our preference for the protocol is sftp. We've done this lots in the past and we understand it well.
Their preference is to do it via a web service/SOAP/https call.
The reasons they give is reliability, mainly around knowing that they've fully received the file.
I don't buy this as a killer argument. You can easily build something into your file transfer process using sftp to make sure the transfer has completed, e.g. use headers/footers in the files, or move file between directories, etc.
The only other argument I can think of is that over http(s), ports 80/443 will be open, so there might be less firewall work for our infrastructure guys.
Can you think of any other arguments either way on this? Is there a consensus on what would be best practice here?
Thanks in advance.
File completeness is a common issue in "managed file transfer". If you went for a compromise "best practice", you'd end up running either AS/2 (a web service-ish way to transfer files that incorporates non-repudiation via signed integrity checks) or AS/3 (same thing over FTP or FTPS).
One of the problems with file integrity and SFTP is that you can't arbitrarily extend the protocol like you can FTP and FTPS. In other words, you can't add an XSHA1 command to your SFTP transfer just because you want to.
Yes, there are other workarounds (like transactional files that contain hashes of files received), but at the end of the day someone's going to have to do some work...but it really shouldn't be this hard.
If the third party you're talking to really doesn't have a non-web service call to accept large files, you might be their guinea pig as they try to navigate a brand new world. (Or, they may have jsut fired all their transmissions folks and are not just realizing that the world doesn't operate on SOAP...yet - seen that happen too.)
Either way, unless they GIVE you the magic code/utility/whatever to do the file-to-SOAP transaction for them (and that happens too), I'd stick to your sftp guns until they find the right guy on their end to talk bulk data transmissions.
SFTP is the protocol for secure file transfer, soap is an API protocol - which can be used for sending file attachments (i.e. MIME attachments), or as Base64 encoded data.
SFTP adds additional potential complexity around separate processes for encrypting/decrypting files (at-rest, if they contain sensitive data), file archiving, data latency, coordinating job scheduling, and setting-up FTP service accounts.

Cost of serialization in web service

My next project involves the creation of a data API within an enterprise framework. The data will be consumed by several applications running on different software platforms. While my colleagues generally favour SOAP, I would like to use a RESTful architecture.
Most of the applications will only need a few objects at every call. Other applications will however sometimes need to make several sequential calls each involving thousands of records. I'm concerned about performance. Serialization/deserialization & network usage are where I fear to find a bottleneck. If each request involves a large delay, all of the enterprise's applications will be sluggish.
Are my fears realistic? Will serialization to a voluminous format like XML or JSON be a problem? Are there alternatives?
In the past, we've had to do these large data transfers using a "flatter"/leaner file format such as CSV for performance. How can I hope to achieve the performance I need using a web service?
While I'd prefer replies specific to REST, I'm interested in hearing how SOAP users might deal with this as well.
One advantage of REST is that you are free to use whatever media type you like. Why not continue to use text/csv? You could also enable HTTP compression to further reduce bandwidth consumption.
REST services are great for taking advantage of all different kinds of data formats. Whatever format fits your scenario best.
We offer both XML and JSON. Your mentioned rendering time really can be an issue. On server side we have JAXB whose standard sun-implementation is somewhat slow, when it comes to marshall XML. XML has the disadvantage of verbosity, but is also nice in interoperability and has schema + explicit versioning.
We compensated the verbosity in several ways (especially limiting the result-set):
In case you have a container with items in it, offer paging in your xml response (both page-size and page-number, e.g. /items?page=0&size=3) . The client can itself reduce the size by reducing the page-size.
Offer collapsing elements, for instance several clients are only interested in one data field of your whole item. Do this with a parameter (e.g. /items?select=name), then only the nested element 'name' is included inline of your item element. This dramatically decreases size.
Generally give the clients the power to use result-set limiting. They will definitley use it, because it speeds up response time also on their side :)
Also use compression, it reduces verbose XML extremely (in our case the payload got 10 times smaller). From client side you can do it by header 'Accept-Encoding: gzip'. If you use Apache, server configuration is also straight-forward
I'd like to offer three guidelines:
one is the observation that there are many SOAP Web services out there (especially built with .NET 2.0 "ASMX" technology) that send down their data transfer objects serialized in XML. There are of course many RESTful services that send down XML or JSON. XML serialization/deserialization is rarely the constraining factor.
one common cause of bottlenecks in Web services is an interface that encourages client applications to get data by making those thousands of sequential calls (there is a term for it: a chatty interface). This is what you should avoid when you design your Web service's interface, regardless of what four-letter acronym you decide to go ahead with.
one thing to remember about REST is that it (partially) stands for a transfer of state, which may be ill-suited to some operations where you don't want to transfer the state of a business object from the server to a client application. In those cases, a SOAP Web service (as suggested by your colleagues) is more appropriate; or perhaps a combination of SOAP and REST services, where the REST services would take care of operations where the state transfer is appropriate, and the SOAP services would implement the rest (pun unintended :-)) of the operations.

Web Service vs. Shared Library

This question has been asked a few times on SO from what I found:
When should a web service not be used?
Web Service or DLL?
The answers helped but they were both a little pointed to a particular scenario. I wanted to get a more general thought on this.
When should a Web Service be considered over a Shared Library (DLL) and vice versa?
Library Advantages:
Native code = higher performance
Simplest thing that could possibly work
No risk of centralized service going down and impacting all consumers
Service Advantages:
Everyone gets upgrades immediately and transparently (unless versioned API offerred)
Consumers cannot decompile the code
Can scale service hardware separately
Technology agnostic. With a shared library, consumers must utilize a compatible technology.
More secure. The UI tier can call the service which sits behind a firewall instead of directly accessing the DB.
My thought on this:
A Web Service was designed for machine interop and to reach an audience
easily by using HTTP as the means of transport.
A strong point is that by publishing the service you are also opening the use of the
service to an audience that is potentially vast (over the web or at least throughout the
entire company) and/or largely outside of your control / influence / communication channel
and you don't mind or this is desired. The usage of the service is much easier as clients
simply have to have an internet connection and consume the service. Unlike a library which
may not be so easily done (but can be done). The usage of the service is largely open. You are making it available to whomever feels they could use it and however they feel to use it.
However, a web service is in general slower and is dependent on an internet connection.
It's in general harder to test than a code library.
It may be harder to maintain. Much of that depends on your maintainance and coding practices.
I would consider a web service if several of the above features are desired or at least one of them
is considered paramount and the downsides were acceptable or a necessary evil.
What about a Shared Library?
What if you are far more in "control" of your environment or want to be? You know who will be using the code
(interface isn't a problem to maintain), you don't have to worry about interop. You are in a situation where
you can easily achieve sharing without a lot of work / hoops to jump through.
Examples in my mind of when to use:
You have many applications in your control all hosted on the same server or two that will use the library.
Not so good example, you have many applications but all hosted on a dozen or so servers. Web Service may be a better choice.
You are not sure who or how your code could be used but know it is of good value to many. Web Service.
You are writing something only used by a limited set of applications, perhaps some helper functions. Library.
You are writing something highly specialized and is not suited for consumption by many. Such as an API for your Line of Business
Application that no one else will ever use. Library.
If all things being equal, it would be easier to start with a shared library and turn it into a web service but not so much vice versa.
There are many more but these are some of my thoughts on it...
Based on multiple sources...
Common Shared Library
Should provide a set of well-known operations that perform common tasks (e.g., String parsing, numerical manipulations, builders)
Should Encapsulate common reusable code
Have minimal dependencies on other libraries
Provide stable interfaces
Services
Should provide reusable application-components
Provide common business services (e.g., rate-of-return calculations, performance reports, or transaction history services)
May be used to connect existing software from disparate systems or exchange data between applications
Here are 5 options and reasons to use them.
Service
has peristent state
you need to release updates often
solves major business problem and owns data related to it
need security: user can't see your code, user can't access you storage
need agnostic intereface like REST (you can auto generate shallow REST clients for client languages esily)
need to scale separately
Library
you simply need a collection of resusaable code
needs to run on client side
can't tolerate any downtime
can't tolerate even few milliseconds of latency
simplest solution that couldd possibly work
need to ship code to data (high thoughput or map-reduce)
First provide library. Then service if need arises.
agile approach, you start with simplest solution than expand
needs might evolve and become more like "Service" cases
Library that starts local service.
many apps on the host need to connect to it and send some data to it
Neither
you can't seriously justify even the library case
business value is questionable
Ideally if I want both advantages, I'll need a portable library, with the agnostic interface glue, automatically updated, with obfuscated (hard to decompile) or secure in-house environment.
Possible using both webservice and library to turn it viable.