Evaluating Google Cloud Storage (GCS) API options - c++

I have been tasked to develop an interface layer that will be used internally by other developers as a means to access Google Cloud Storage (GCS). For me the process began by reading the online documentation. I am now at the point of making a decision as to which API we'll use. There are a couple of outstanding questions and that documentation mentions posting questions to SO, so here I am.
A quick tidbit of background seems in order. We are predominately a C house, though we do have means to build and invoke C++ methods from C. The internal users will write C code that invokes my C code, which in turn must provide access to GCS. So that's the high-level call stack. The question then becomes one of how do I provide that access in the best manner, with performance being the top criteria?
To answer that, I began by reading the online documentation regarding the different GCS options. There is an XML and a JSON REST API to be had. We have an internal HTTP mechanism that would enable the C code to invoke those JSON/XML API methods directly. The documentation states the following:
"...JSON API is RESTful...and is specifically intended to be used with the Google Cloud Client Libraries."
What are Google Cloud libraries I wondered. Reading the documentation gleaned that they are libraries that are written in a particular language and can be utilized to access GCS. The documentation seems to steer you in the direction of using one of these client libraries versus the JSON/XML ones.
So the first question I had was "what does the 'specifically intended" business mean exactly?" After more reading, I arrived at the notion that these client libraries are an abstraction of the RESTful interfaces. These libraries invoke the JSON API methods, just as you could do yourself through the JSON API directly. Is that correct? If so they seem like a means of convenience to interact with GCS. The documentation even states that the cloud libraries "...provide better performance and usability" versus the JSON HTTP interface.
In the end, I see two paths forward:
Invoke the JSON API directly from C using our HTTP mechanism
Use the bridge to invoke C++ methods from C. Those C++ methods then invoke methods in the client libraries, which ultimately (if the above is correct) invoke the JSON API.
Note that I've already written some C code in-house that uses the aforementioned internal HTTP mechanism to interact with the Apache WebHDFS API, which is also RESTful and for which there are no client libraries. Thus I could leverage a fair portion of that code to re-use in this new development. For me it boils down to a question of performance. The second option above seems rather circuitous in comparison to the first. Thus the first would seem to yield some performance improvement over the second. However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that? The documentation states that the client libraries handle all of the low level communication with the server, including authentication. Is this part of the reason?
And so I am posing this question to those who are more experienced with GCS (and perhaps GCP for that matter): which route would provide better performance in your opinion? Invoking the JSON API directly or using the client libraries (C++ in our case)?
Thanks!

Disclaimer: I work on the C++ client libraries for Google Cloud, specifically the C++ GCS library.
However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that?
The libraries are tuned to use the RPCs "well". They have good defaults for retry and backoff strategies, they pick the right RPC if two alternatives are available. They can recover efficiently from an interrupted download or upload. They may implement higher-level abstractions to improve performance.
If you are asking "how can running their code to call an RPC be more efficient than me calling the same RPC", then the answer is "probably it cannot". I do not think that is the interesting problem. For example, if you want to upload a large file (say multiple GiB) the most efficient way in GCS is to split the file in chunks, use several parallel uploads to different GCS objects, then compose the objects, and delete the intermediate objects. That would be several times faster than uploading the file sequentially. Could you do the same? Sure! Do you want to? Maybe not.
We also worry about resuming an interrupted download correctly. And verifying the integrity of your data as you upload and download it (you can turn off those features, of course).
Finally, we can change the protocol under the hood if there is a more efficient way to do things. For example, we already switch between XML and JSON when we both are equivalent and one seems to perform better than the other. We are also working on supporting gRPC instead of REST, which is showing good improvements over the REST baseline (unfortunately it is not GA yet, and I do not have an estimated date).
HTH

Related

Avro message for Google Cloud Pub-Sub?

What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format.
Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?
Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.
In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:
There are two major factors at work here: more efficient data encoding
and HTTP/2. gRPC keeps data in binary both in client memory and on the
wire by building on HTTP/2 and Protocol Buffers. This eliminates
processing and space required for string encoding schemes such as
Base64 or JSON. In addition, HTTP/2 itself makes things go faster with
multiplexed requests over a single connection and header compression.
I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.
Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:
Making sure you are using gRPC
Batching where possible, to reduce the number of calls and eliminate latency.
Only compressing when needed and after benchmarking (implies extra logic in your application)
Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:
Don't underestimate the importance of capacity planning.
Make sure your pub/sub system is fault-tolerant.
NSM: Never Stop Monitoring.
There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.
Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.

upload & download large file over internet

I have a requirement where I need to upload a large file (can be 10 GB) to a shared space(windows) ( say APP1) . and we have a separate application( say APP2) different network now I need to download the same file from in second application via internet.
My approach is I need to create webservice to upload the document to shared space. and then expose a webservice for outer world to download the document.
My point is how I can manage the huge files upload/download through webservice ?
Please suggest if some one have any idea. I have flexibikity to use any third party APIs.but the application can talk only through webservices.
From your question it's not really clear which development platform you mean, .NET, Java, etc.
Also it's important to know how interoperable your services should be, security requirements, etc. Anyway will try to come up with a couple of solutions which you might research in more detail if you found them useful.
.NET
It's relatively easy to built such a web service with WCF. It supports streaming which could be interoperable, reliable and secure to some extend. You can read more on this here. This approach implies you have a huge disk to store files and you have a good recovery plan for that in case it goes down or just dies.
.NET, Java, etc. - cloud based
There are a lot of vendors who provide cloud storage and APIs to work with it. It's an ideal solution for a quick start. They take care of data availability, redundancy, etc. All you have to do is to use their API to upload and download files and to pay them for this :) In many cases it's worth it. Personally I used to work with Amazon S3. Their API is simple to use and there's plenty of docs for that.
EDIT:
Amazon S3 provides a simple web-services interface that can be used to
store and retrieve any amount of data, at any time, from anywhere on
the web.
I think you should take a look at Amazon S3 overview here.
This also provides API for a number of different platforms - Java, .NET, Node.js, etc. You an find the full list here.
Hope it helps!

File web service architecture

I need to implement a web service which could provide requested files to other internal applications or components running on different networks. Files are dispersed across different servers in different locations and can be big as few gigabytes.
I am thinking to create a RESTful web service which will have implementation to discover the file, redirect the HTTP request to another web service on different location and send the file via HTTP.
Is it a good idea to send the file via HTTP or will it be better for the web service to copy the file to the location where requester component could access it?
The biggest problem with distributing large files over HTTP is that you will come across all sorts of limits that prevent it. As a simple example, WCF allows you to configure maximum payload size but you can only configure it up to 2 GB. You will likely run across issues like this in all layers of your stack. I doubt any of them are insurmountable (to work around the above limitation you can stream chunks of the file, rather than the entire file, although that introduces it's own problems), but you will likely have lots of timeouts and random failures, which are fixed by tweaking the configuration of this or that service or client.
Also, when dealing with large files, you have to carefully consider how you deal with the inevitable failures during transfer (e.g. the network drops out). Depending on the specific technologies you use, they may have some "resume" functionality, but you will want to be sure this is reliable before committing to it.
One possibility would be to do what Facebook does when distributing large binaries - use BitTorrent. So, your web-service serves a torrent of the file, not the file itself. The big advantages of BitTorrent are it is very robust, and can scale well. It's worth considering, but it will depend a lot on your environment and specific workload.
If the files you are going to serve, do not change often or do not change at all, you could use many strategies, since the one advised by RB, or use pure HTTP which supports partial data operations, see RFC 2616.
But depending on your usage scenario, I would also suggest you to take a look at the Amazon Web Services - S3 (Simple Storage Service), which probably does already what you are trying to do, it's cheap and have high availability.

Rest vs. Soap. Has REST a better performance?

I read some questions already posted here regarding Soap and Rest
and I didn't find the answer I am looking for.
We have a system which has been built using Soap web services.
The system is not very performant and it is under discussion
to replace all Soap web services for REST web services.
Somebody has argued that Rest has a better performance.
I don't know if this is true. (This was my first question)
Assuming that this is true, is there any disadvantage using
REST instead of Soap? (Are we loosing something?)
Thanks in advance.
Performance is broad topic.
If you mean the load of the server, REST has a bit better performance because it bears minimal overhead on top of HTTP. Usually SOAP brings with it a stack of different (generated) handlers and parsers. Anyway, the performance difference itself is not that big, but RESTful service is more easy to scale up since you don't have any server side sessions.
If you mean the performance of the network (i.e. bandwidth), REST has much better performance. Basically, it's just HTTP. No overhead. So, if your service runs on top of HTTP anyway, you can't get much leaner than REST. Furthermore if you encode your representations in JSON (as opposed to XML), you'll save many more bytes.
In short, I would say 'yes', you'll be more performant with REST. Also, it (in my opinion) will make your interface easier to consume for your clients. So, not only your server becomes leaner but the client too.
However, couple of things to consider (since you asked 'what will you lose?'):
RESTful interfaces tend to be a bit more "chatty", so depending on your domain and how you design your resources, you may end up doing more HTTP requests.
SOAP has a very wide tool support. For example, consultants love it because they can use tools to define the interface and generate the wsdl file and developers love it because they can use another set of tools to generate all the networking code from that wsdl file. Moreover, XML as representation has schemas and validators, which in some cases may be a key issue. (JSON and REST do have similar stuff coming but the tool support is far behind)
SOAP requires an XML message to be parsed and all that <riduculouslylongnamespace:mylongtagname>extra</riduculouslylongnamespace:mylongtagname> stuff to be sent and receieved.
REST usually uses something much more terse and easily parsed like JSON.
However in practice the difference is not that great.
Building a DOM from XMLis usually done a superfast superoptimised piece of code like XERCES in C++ or Java whereas most JSON parsers are in the roll your own or interpreted catagory.
In fast network environment (LAN or Broadband) there is not much difference between sending a one or two K versus 10 to 15 k.
You phrase the question as if REST and SOAP were somehow interchangeable in an existing system. They are not.
When you use SOAP (a technology), you usually have a system that is defined in 'methods', since in effect you are dealing with RPC.
When you use REST (an architectural style, not a technology) then you are creating a system that is defined in terms of 'resources' and not at all in methods. There is no 1:1 mapping between SOAP and REST. The system architecture is fundamentally different.
Or are you merely talking about "RPC via URI", which often gets confused with REST?
I'm definitely not an expert when it comes to SOAP vs REST, but the only performance difference I know of is that SOAP has a lot of overhead when sending/receiving packets since it's XML based, requires a SOAP header, etc. REST uses the URL + querystring to make a request, and thus doesn't send that many kB over the wire.
I'm sure there are other ppl here on SO who can give you better and more detailed answers, but at least I tried ;)
One thing the other answers seem to overlook is REST support for caching and other benefits of HTTP. While SOAP uses HTTP, it does not take advantage HTTP's supporting infrastructure. The SOAP 1.1 binding only defines the use of the POST verb. This was fixed with version 1.2 with the introduction of GET bindings, however this may be an issue if using the older version or not using the appropriate bindings.
Security is another key performance concern. REST applications typically use TLS or other session layer security mechanisms. TLS is much faster than using application level security mechanisms such as WS Security (WS Security also suffers from security flaws).
However, I think that these are mostly minor issues when comparing SOAP and REST based services. You can find work arounds for either SOAP's or REST's performance issues. My personal opinion is that neither SOAP, nor REST (by REST I mean HTTP-based REST services) are appropriate for services requiring high throughput and low-latency. For those types of services, you probably want to go with something like Apache Thrift, 0MQ, or the myriad of other binary RPC protocols.
It all depends. REST doesn't really have a (good) answer for the situation where the request data may become large. I feel this point if sometimes overlooked when hyping REST.
Let's imagine a service that allows you to request informational data for thousands of different items.
The SOAP developer would define a method that would allow you retrieve the information for one or as many items as you like ... in a single call.
The REST developer would be concerned that his URI would become too long so he would define a GET method that would take a single item as parameter. You would then have to call this multiple times, once for each item, in order to get your data. Clean and easy to understand ... but.
In this case there would be a lot more round-trips required for the REST service to accomplish what can be done with a single call on the SOAP service.
Yes, I know there are workarounds for how to handle large request data in the REST scenario. For example you can pack stuff into the body of your request. But then you will have to define carefully (on both the server and the client side) how this is to be interpreted. In these situations you start to feel the pain that REST is not really a standard (like SOAP) but more of a way of doing things.
For situations where only relatively limited amount of data is exchanged REST is a very good choice. At the end of the day this is the majority of use cases.
Just to add a little to wuher's answer.
Http Header bytes when requesting this page using the Chrome web browser: 761
Bytes required for the sample soap message in wikipedia article: 299
My conclusion: It is not the size of bytes on the wire that allows REST to perform well.
It is highly unlikely that simply converting a SOAP service over to REST is going to gain any significant performance benefits. The advantage REST has is that if you follow the constraints then you can take advantage of the mechanisms that HTTP provides for producing scalable systems. Caching and partitioning are the tools in your toolbelt.
In general, a REST based Web service is preferred due to its simplicity, performance, scalability, and support for multiple data formats. SOAP is favored where service requires comprehensive support for security and transactional reliability.
The answer really depends on the functional and non-functional requirements. Asking the questions listed below will help you choose.
REf: http://java-success.blogspot.ca/2012/02/java-web-services-interview-questions.html
Does the service expose data or business logic? (REST is a better choice for exposing data, SOAP WS might be a better choice for logic).
Do the consumers and the service providers require a formal contract? (SOAP has a formal contract via WSDL)
Do we need to support multiple data formats?
Do we need to make AJAX calls? (REST can use the XMLHttpRequest)
Is the call synchronous or asynchronous?
Is the call stateful or stateless? (REST is suited for statless CRUD operations)
What level of security is required? (SOAP WS has better support for security)
What level of transaction support is required? (SOAP WS has better support for transaction management)
Do we have limited band width? (SOAP is more verbose)
What’s best for the developers who will build clients for the service? (REST is easier to implement, test, and maintain)
You don't have to make the choice, modern frameworks allow you to expose data in those formats with a minimum change. Follow your business requirements and load-test the specific implementation to understand the throughput, there is no correct answer to this question without correct load test of a specific system.

Web Service vs. Shared Library

This question has been asked a few times on SO from what I found:
When should a web service not be used?
Web Service or DLL?
The answers helped but they were both a little pointed to a particular scenario. I wanted to get a more general thought on this.
When should a Web Service be considered over a Shared Library (DLL) and vice versa?
Library Advantages:
Native code = higher performance
Simplest thing that could possibly work
No risk of centralized service going down and impacting all consumers
Service Advantages:
Everyone gets upgrades immediately and transparently (unless versioned API offerred)
Consumers cannot decompile the code
Can scale service hardware separately
Technology agnostic. With a shared library, consumers must utilize a compatible technology.
More secure. The UI tier can call the service which sits behind a firewall instead of directly accessing the DB.
My thought on this:
A Web Service was designed for machine interop and to reach an audience
easily by using HTTP as the means of transport.
A strong point is that by publishing the service you are also opening the use of the
service to an audience that is potentially vast (over the web or at least throughout the
entire company) and/or largely outside of your control / influence / communication channel
and you don't mind or this is desired. The usage of the service is much easier as clients
simply have to have an internet connection and consume the service. Unlike a library which
may not be so easily done (but can be done). The usage of the service is largely open. You are making it available to whomever feels they could use it and however they feel to use it.
However, a web service is in general slower and is dependent on an internet connection.
It's in general harder to test than a code library.
It may be harder to maintain. Much of that depends on your maintainance and coding practices.
I would consider a web service if several of the above features are desired or at least one of them
is considered paramount and the downsides were acceptable or a necessary evil.
What about a Shared Library?
What if you are far more in "control" of your environment or want to be? You know who will be using the code
(interface isn't a problem to maintain), you don't have to worry about interop. You are in a situation where
you can easily achieve sharing without a lot of work / hoops to jump through.
Examples in my mind of when to use:
You have many applications in your control all hosted on the same server or two that will use the library.
Not so good example, you have many applications but all hosted on a dozen or so servers. Web Service may be a better choice.
You are not sure who or how your code could be used but know it is of good value to many. Web Service.
You are writing something only used by a limited set of applications, perhaps some helper functions. Library.
You are writing something highly specialized and is not suited for consumption by many. Such as an API for your Line of Business
Application that no one else will ever use. Library.
If all things being equal, it would be easier to start with a shared library and turn it into a web service but not so much vice versa.
There are many more but these are some of my thoughts on it...
Based on multiple sources...
Common Shared Library
Should provide a set of well-known operations that perform common tasks (e.g., String parsing, numerical manipulations, builders)
Should Encapsulate common reusable code
Have minimal dependencies on other libraries
Provide stable interfaces
Services
Should provide reusable application-components
Provide common business services (e.g., rate-of-return calculations, performance reports, or transaction history services)
May be used to connect existing software from disparate systems or exchange data between applications
Here are 5 options and reasons to use them.
Service
has peristent state
you need to release updates often
solves major business problem and owns data related to it
need security: user can't see your code, user can't access you storage
need agnostic intereface like REST (you can auto generate shallow REST clients for client languages esily)
need to scale separately
Library
you simply need a collection of resusaable code
needs to run on client side
can't tolerate any downtime
can't tolerate even few milliseconds of latency
simplest solution that couldd possibly work
need to ship code to data (high thoughput or map-reduce)
First provide library. Then service if need arises.
agile approach, you start with simplest solution than expand
needs might evolve and become more like "Service" cases
Library that starts local service.
many apps on the host need to connect to it and send some data to it
Neither
you can't seriously justify even the library case
business value is questionable
Ideally if I want both advantages, I'll need a portable library, with the agnostic interface glue, automatically updated, with obfuscated (hard to decompile) or secure in-house environment.
Possible using both webservice and library to turn it viable.