upload & download large file over internet - web-services

I have a requirement where I need to upload a large file (can be 10 GB) to a shared space(windows) ( say APP1) . and we have a separate application( say APP2) different network now I need to download the same file from in second application via internet.
My approach is I need to create webservice to upload the document to shared space. and then expose a webservice for outer world to download the document.
My point is how I can manage the huge files upload/download through webservice ?
Please suggest if some one have any idea. I have flexibikity to use any third party APIs.but the application can talk only through webservices.

From your question it's not really clear which development platform you mean, .NET, Java, etc.
Also it's important to know how interoperable your services should be, security requirements, etc. Anyway will try to come up with a couple of solutions which you might research in more detail if you found them useful.
.NET
It's relatively easy to built such a web service with WCF. It supports streaming which could be interoperable, reliable and secure to some extend. You can read more on this here. This approach implies you have a huge disk to store files and you have a good recovery plan for that in case it goes down or just dies.
.NET, Java, etc. - cloud based
There are a lot of vendors who provide cloud storage and APIs to work with it. It's an ideal solution for a quick start. They take care of data availability, redundancy, etc. All you have to do is to use their API to upload and download files and to pay them for this :) In many cases it's worth it. Personally I used to work with Amazon S3. Their API is simple to use and there's plenty of docs for that.
EDIT:
Amazon S3 provides a simple web-services interface that can be used to
store and retrieve any amount of data, at any time, from anywhere on
the web.
I think you should take a look at Amazon S3 overview here.
This also provides API for a number of different platforms - Java, .NET, Node.js, etc. You an find the full list here.
Hope it helps!

Related

Evaluating Google Cloud Storage (GCS) API options

I have been tasked to develop an interface layer that will be used internally by other developers as a means to access Google Cloud Storage (GCS). For me the process began by reading the online documentation. I am now at the point of making a decision as to which API we'll use. There are a couple of outstanding questions and that documentation mentions posting questions to SO, so here I am.
A quick tidbit of background seems in order. We are predominately a C house, though we do have means to build and invoke C++ methods from C. The internal users will write C code that invokes my C code, which in turn must provide access to GCS. So that's the high-level call stack. The question then becomes one of how do I provide that access in the best manner, with performance being the top criteria?
To answer that, I began by reading the online documentation regarding the different GCS options. There is an XML and a JSON REST API to be had. We have an internal HTTP mechanism that would enable the C code to invoke those JSON/XML API methods directly. The documentation states the following:
"...JSON API is RESTful...and is specifically intended to be used with the Google Cloud Client Libraries."
What are Google Cloud libraries I wondered. Reading the documentation gleaned that they are libraries that are written in a particular language and can be utilized to access GCS. The documentation seems to steer you in the direction of using one of these client libraries versus the JSON/XML ones.
So the first question I had was "what does the 'specifically intended" business mean exactly?" After more reading, I arrived at the notion that these client libraries are an abstraction of the RESTful interfaces. These libraries invoke the JSON API methods, just as you could do yourself through the JSON API directly. Is that correct? If so they seem like a means of convenience to interact with GCS. The documentation even states that the cloud libraries "...provide better performance and usability" versus the JSON HTTP interface.
In the end, I see two paths forward:
Invoke the JSON API directly from C using our HTTP mechanism
Use the bridge to invoke C++ methods from C. Those C++ methods then invoke methods in the client libraries, which ultimately (if the above is correct) invoke the JSON API.
Note that I've already written some C code in-house that uses the aforementioned internal HTTP mechanism to interact with the Apache WebHDFS API, which is also RESTful and for which there are no client libraries. Thus I could leverage a fair portion of that code to re-use in this new development. For me it boils down to a question of performance. The second option above seems rather circuitous in comparison to the first. Thus the first would seem to yield some performance improvement over the second. However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that? The documentation states that the client libraries handle all of the low level communication with the server, including authentication. Is this part of the reason?
And so I am posing this question to those who are more experienced with GCS (and perhaps GCP for that matter): which route would provide better performance in your opinion? Invoking the JSON API directly or using the client libraries (C++ in our case)?
Thanks!
Disclaimer: I work on the C++ client libraries for Google Cloud, specifically the C++ GCS library.
However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that?
The libraries are tuned to use the RPCs "well". They have good defaults for retry and backoff strategies, they pick the right RPC if two alternatives are available. They can recover efficiently from an interrupted download or upload. They may implement higher-level abstractions to improve performance.
If you are asking "how can running their code to call an RPC be more efficient than me calling the same RPC", then the answer is "probably it cannot". I do not think that is the interesting problem. For example, if you want to upload a large file (say multiple GiB) the most efficient way in GCS is to split the file in chunks, use several parallel uploads to different GCS objects, then compose the objects, and delete the intermediate objects. That would be several times faster than uploading the file sequentially. Could you do the same? Sure! Do you want to? Maybe not.
We also worry about resuming an interrupted download correctly. And verifying the integrity of your data as you upload and download it (you can turn off those features, of course).
Finally, we can change the protocol under the hood if there is a more efficient way to do things. For example, we already switch between XML and JSON when we both are equivalent and one seems to perform better than the other. We are also working on supporting gRPC instead of REST, which is showing good improvements over the REST baseline (unfortunately it is not GA yet, and I do not have an estimated date).
HTH

File web service architecture

I need to implement a web service which could provide requested files to other internal applications or components running on different networks. Files are dispersed across different servers in different locations and can be big as few gigabytes.
I am thinking to create a RESTful web service which will have implementation to discover the file, redirect the HTTP request to another web service on different location and send the file via HTTP.
Is it a good idea to send the file via HTTP or will it be better for the web service to copy the file to the location where requester component could access it?
The biggest problem with distributing large files over HTTP is that you will come across all sorts of limits that prevent it. As a simple example, WCF allows you to configure maximum payload size but you can only configure it up to 2 GB. You will likely run across issues like this in all layers of your stack. I doubt any of them are insurmountable (to work around the above limitation you can stream chunks of the file, rather than the entire file, although that introduces it's own problems), but you will likely have lots of timeouts and random failures, which are fixed by tweaking the configuration of this or that service or client.
Also, when dealing with large files, you have to carefully consider how you deal with the inevitable failures during transfer (e.g. the network drops out). Depending on the specific technologies you use, they may have some "resume" functionality, but you will want to be sure this is reliable before committing to it.
One possibility would be to do what Facebook does when distributing large binaries - use BitTorrent. So, your web-service serves a torrent of the file, not the file itself. The big advantages of BitTorrent are it is very robust, and can scale well. It's worth considering, but it will depend a lot on your environment and specific workload.
If the files you are going to serve, do not change often or do not change at all, you could use many strategies, since the one advised by RB, or use pure HTTP which supports partial data operations, see RFC 2616.
But depending on your usage scenario, I would also suggest you to take a look at the Amazon Web Services - S3 (Simple Storage Service), which probably does already what you are trying to do, it's cheap and have high availability.

Service Oriented Architecture suggestions

For personal and university research reasons I am thinking of building a simple CRM using a service oriented architecture. Its meaning is just to explain the architecture itself, not commercial use.
I was thinking of implementing a CRM that offers a simple analytics service and customer care (user storing, personal comments, and few other things).
The architecture that I'm designing defines:
- WebGUI (a client of the other services)
- AnalyticsService (a service that receives data, analyzes and collect it)
- CustomerCareService (a service that uses RESTful APIs to apply CRUD operations).
Each service has it own database, being completely independent from others. They expose a public interface. The interface of course must provide some sort of authentication, to deny unautorized requests.
The advantages I'd like to explain in this kind of architecture is the possibility to have all things indepentent and the ability to combine them to offer new services (for example if there was an OrderService to handle orders it would be easy to combine it with Customer using the public APIs). The big advantage to me is that it'd be easy enough to build other clients that use these services.
I don't know what is some good Authentication method, that could be easy to implement, I'm also not sure about how to make this APIs (use XML or plain REST APIs with GET/POST data). I've worked with Amazon, PayPal and other company APIs, they seem to use REST services (paypal uses an ugly _cmd GET parameter while Amazon uses better URI) to know what to do, but reading something about SOAs it appears that people also use XML. Of course I also need to take into account that the web interface must be able to recognize the logged in user, get the permissions (token or whatever else) and use it with services to show information.
So I'm not sure SOA is the kind of architecture I'm really building up... is it SaaS instead of SOA?
I think it would be better to use RESTful applications, with JSON or something like that to implement it (I'm not a big fan of XML, I find it to be too verbose).
For clarity I'm listing here my questions:
Is this kind of architecture called SOA or SaaS (or both)?
What is a good implementation for what I want to obtain? (please explain it as more detailed as possible)
What sort of authentication is more suitable for a client (user token vs OAuth or similiar)
Do you have some suggestion for this kind of project?
I've about 3 months to do it, so I cannot do something real complex (beside the fact that it would not be realistic for a single programmer).
I know Python (WSGI frameworks), Ruby on Rails, C/C++ and other languages (.net excluded) and I'd like to develop it under a Linux environment (MySQL or Postgres, or even a NoSQL if you have any suggestion for the right choice), I could also combine several languages being these services independent programs.
What I'd like here is to have some good point of view and some good suggestion.
Thanks!
I would define SaaS as a Business model rather than an architecture; however like all business domain requirements it will influence systems architecture but it, itself is not. What you have defined is essential a Service Oriented Architecture.
Your statement "independent and the ability to combine them to offer new services" is the essential non-functional design requirement that suggests SOA.
Good implementation for SOA is about having well defined and flexible interfaces, with very clear delineation of responsibilities. However it is difficult subject to be prescriptive about. The proof is in the eating; does it provide that flexible reuse. My suggestion is spend time reading SOA design pattern resources, and understand the defining characteristics with regard to the appropriate context for use. Then apply the Single Responsibility principle appropriate level of abstraction. c.f. (Domain) Space Based Architecture is kind of SOA meta-pattern.
In regard to Authorisation, I recommend following the service approach, use a distribute directory services system like open LDAP, and note that is entirely reasonable for service provides and users to have their own credentials and you can use Public-Private keys for signing messages.
The main suggestion is study and learn from experience of others:
http://www.soapatterns.org/
http://martinfowler.com/eaaCatalog/
SOA doesn't forces to use XML.
Currently web technologies dominate, and define future.
So we in my company selected JSON RESTful services as foundation. And SOA as principles.
There is no sense to suggest languages, because the purpose of SOA and good implementation is
- to enable any language or framework to be used
(FYI we use Java with Spring MVC-based web-services, Node.js, PHP)

Little known or useful Web Services we all should know about

Web services and web APIs have managed to increase the accessibility of the information stored and catalogued on the internet. They have also opened up a vast array of enterprise power functionality for smaller thin client applications.
By taping into these services developers can provide functionality that would have taken them months perhaps years to set up. They can combine them into single applications that make life generally easier for its users.
Whether displaying information about the music being played, finding items of interest in the locale of the user or just simply tweeting and blogging from the same application - the possibilities are growing everyday.
I want to know about the most interesting or useful services that are out there, especially ones that most of us may not have heard about yet. Do you maintain an API or service? or do you have a clever mash up that provides even more benefits than the originals?
YQL - Yahoo provide a tool that lets you query many different API's across the web, even for sites that don't provide an API as such.
From the site:
The Yahoo! Query Language is an
expressive SQL-like language that lets
you query, filter, and join data
across Web services.
...
With YQL, developers can access and
shape data across the Internet through
one simple language, eliminating the
need to learn how to call different
APIs.
The World Bank API is pretty cool. Google uses it in search results. My favourite implementations are the cartograms at worldmapper.
(source: worldmapper.org)
It's very niche, but I happen to think the OpenCongress API is amazing.
Less niche: Google Translate has an API which will guess the language of something. You'd be AMAZED how frequently this comes in handy (even though it's not as tweakable as you'd like and is not trained on small samples).
I was just about to have a stab at using the SoundCloud API
I know many people who already use for sharing their musical masterpieces and its a pretty good site. Hopefully the api will be as well!
I like the RESTful API for weather.com. It's free and very useful for the new age of location-aware apps: https://registration.weather.com/ursa/xmloap/step1
It does require registration, but they don't spam you or anything - it's just to provide you a key to use the API.
Ah yes - here's another one I've been meaning to check out but haven't tried yet
The BBC offer a bunch of apis/feeds that look very promising
http://ideas.welcomebackstage.com/data
They include apis for accessing schedule data for both TV and Radio listings along with all kinds of news searches. It even looks like they'll be offering some sort of geo-location service soon so it will be interesting to see what that has to offer
Another interesting one for liberal brits! ;)
The Guardian news paper have their own api
http://www.guardian.co.uk/open-platform
MuiscBrainz
Excellent service for music mashups.
Not so many knows that Last.FM initial database was scraped from this service.
The United States Postal Service offers a web service that does address standardization. Quite useful in reducing clutter and cleaning data before it gets put into your database.

Web Service vs. Shared Library

This question has been asked a few times on SO from what I found:
When should a web service not be used?
Web Service or DLL?
The answers helped but they were both a little pointed to a particular scenario. I wanted to get a more general thought on this.
When should a Web Service be considered over a Shared Library (DLL) and vice versa?
Library Advantages:
Native code = higher performance
Simplest thing that could possibly work
No risk of centralized service going down and impacting all consumers
Service Advantages:
Everyone gets upgrades immediately and transparently (unless versioned API offerred)
Consumers cannot decompile the code
Can scale service hardware separately
Technology agnostic. With a shared library, consumers must utilize a compatible technology.
More secure. The UI tier can call the service which sits behind a firewall instead of directly accessing the DB.
My thought on this:
A Web Service was designed for machine interop and to reach an audience
easily by using HTTP as the means of transport.
A strong point is that by publishing the service you are also opening the use of the
service to an audience that is potentially vast (over the web or at least throughout the
entire company) and/or largely outside of your control / influence / communication channel
and you don't mind or this is desired. The usage of the service is much easier as clients
simply have to have an internet connection and consume the service. Unlike a library which
may not be so easily done (but can be done). The usage of the service is largely open. You are making it available to whomever feels they could use it and however they feel to use it.
However, a web service is in general slower and is dependent on an internet connection.
It's in general harder to test than a code library.
It may be harder to maintain. Much of that depends on your maintainance and coding practices.
I would consider a web service if several of the above features are desired or at least one of them
is considered paramount and the downsides were acceptable or a necessary evil.
What about a Shared Library?
What if you are far more in "control" of your environment or want to be? You know who will be using the code
(interface isn't a problem to maintain), you don't have to worry about interop. You are in a situation where
you can easily achieve sharing without a lot of work / hoops to jump through.
Examples in my mind of when to use:
You have many applications in your control all hosted on the same server or two that will use the library.
Not so good example, you have many applications but all hosted on a dozen or so servers. Web Service may be a better choice.
You are not sure who or how your code could be used but know it is of good value to many. Web Service.
You are writing something only used by a limited set of applications, perhaps some helper functions. Library.
You are writing something highly specialized and is not suited for consumption by many. Such as an API for your Line of Business
Application that no one else will ever use. Library.
If all things being equal, it would be easier to start with a shared library and turn it into a web service but not so much vice versa.
There are many more but these are some of my thoughts on it...
Based on multiple sources...
Common Shared Library
Should provide a set of well-known operations that perform common tasks (e.g., String parsing, numerical manipulations, builders)
Should Encapsulate common reusable code
Have minimal dependencies on other libraries
Provide stable interfaces
Services
Should provide reusable application-components
Provide common business services (e.g., rate-of-return calculations, performance reports, or transaction history services)
May be used to connect existing software from disparate systems or exchange data between applications
Here are 5 options and reasons to use them.
Service
has peristent state
you need to release updates often
solves major business problem and owns data related to it
need security: user can't see your code, user can't access you storage
need agnostic intereface like REST (you can auto generate shallow REST clients for client languages esily)
need to scale separately
Library
you simply need a collection of resusaable code
needs to run on client side
can't tolerate any downtime
can't tolerate even few milliseconds of latency
simplest solution that couldd possibly work
need to ship code to data (high thoughput or map-reduce)
First provide library. Then service if need arises.
agile approach, you start with simplest solution than expand
needs might evolve and become more like "Service" cases
Library that starts local service.
many apps on the host need to connect to it and send some data to it
Neither
you can't seriously justify even the library case
business value is questionable
Ideally if I want both advantages, I'll need a portable library, with the agnostic interface glue, automatically updated, with obfuscated (hard to decompile) or secure in-house environment.
Possible using both webservice and library to turn it viable.