SOA/Web Service Pagination - web-services

In SOA we should not be building or holding state (or designing dependencies) between client and server. This is understood. But what patterns can be followed in the case that a client wants to consume a real-time service that may return an open ended number of 'rows'?
Web applications, similar to SOA but allowing for state (sessions) have solved this with pagination. Pagination requires (in most cases, especially with SQL) that the server holds the data and that the client request the data in chunks.
If we where to consider pagination-like scenarios for web services, what patterns would these follow that would still allow the tenets of SOA to be adhered (or as close as possible).
Some rules for the thinkers:
1) Backed by a SQL database (therefore there is no concept of a row number in a select set)
2) It is important to not skip a row or duplicate a row in a set during pagination
3) Data may be inserted and deleted at any time into the database by other clients
4) There is no need to consider the dataset a live (update-able) dataset
Personally, I think that 1 and 2 above already spell our the solution by constraining the solution space with the requirements.
My proposed solution would have the data (as much as is selected) be stored in a read-only store/cache where it can be assigned a row number within the result set and allow pagination to occur on this data snapshot. I have would have infrastructure to store snapshots (servers, external caches, memcached or ehcache - this must scale quite large). The result of such a query would be a snapshot ID and clients could retrieve the data from the snapshot using a snapshot API (web services) and the snapshot ID. Results would be processed in a read-only, forward only manner for x records at a time where x was something reasonable.
Competing thoughts and ideas, criticisms or accolades would be greatly appreciated.

Paginated results in a Web Service is actually quite easy to achieve.
All you have to do is add two parameters to the web service call: Page Size, Page Number.
Page Size is the number of results to include in a page. Page Number is the number of the page of results you are looking for.
Your web service then goes back to the database (or cache), retreives the results, figures out which results fit on the requested page, and return only those results.
The client then has to make a single request per page of results they want from the service.

What you propose with memcached will also work with a caching table. The first service call would (1) INSERT results INTO the caching table with a snapshot ID (2) return the first page from the caching table and the snapshot ID. Subsequent calls would return pages based on page size and page number by querying the caching table using the snapshot ID.
I should think this could also be optimized by using an in-memory caching table, but that depends on whether your database supports INSERT-INTO from a disk table to an in-memory table. That might get complicated in a clustered environment though.
Such a cache is stateful by its very nature if you are retaining a client-specific copy between requests, whether storage is in a session object, database table or memcached data store. Given the requirements though, you have no choice but to cache results in some form or another, except you risk the chance of returning deleted or no-longer-relevant records as legitimate results.

SOA is not meant for such low level functionality.
SOA is meant to glue together business areas, not frontends to backends. Not because your application talks to the back end using webservices you have a "SOA" application. This is non sense since SOA is meaningless in the context of 1 isolated system.
From that point of view, it is then clear that, in SOA, the caller should not have known about the SQL table you are paginating, that’s an implementation detail that SOA should hide. In the other hand the server should not know about the client's state, because it should be agnostic to the details of the clients, to be really open.
So, just understand that pagination is not SOA. Do as you wish, just understand that the webservice you are using to paginate is an internal artifact of your application, not to be used for external clients in a SOA bus. Also remember that it can not be transaction consistent with out state in the server. Probably the problem is that you have only one service layer for the application's UI and the SOA bus, you need to separate them.
Using this webservice in a SOA bus would be bad. I can not be consistent as the user paginates and as other applications hang to it they become tied to the specific SQL.
... then you might as well have granted direct SQL access to the table for all that matters.
SOA is for business messages between systems, not to glue an application's frontend to the backend.

Same problem, resolved using the Navision approach.
$ws->getList($first_record_id, $limit)
This return a page of $limit element that start from the the passed id
select * from collection where collection.id > $first_record_id ASC limit $limit
ordered by id ASC
Navision use Key (each element has a key) but in MySQL an autoincrement id is better.
In this case pagination is intended for handle large result sets and not for a frontend pagination...

I am not sure if SOA is of concern here. The problem you have seems to be with paginating your API's. I will point you to how twitter handles their pagination dev.twitter.com/rest/public/timelines

Related

Handling multiple users concurrently populating a PostgreSQL database

I'm currently trying to build a web app that would allow many users to query an external API (I cannot retrieve all the data served by this API at regular intervals to populate my PostgreSQL database for various reasons). I've read several thing about ACID and MVCC but still, I'm not sure there won't be any problem if several users are populating/reading my PostgreSQL database at the very same time. So here I'm asking for advice (I'm very new to this field)!
Let's say my users query the external API to retrieve articles. They make their search via a form, the back end gets it, queries the api, populates the database, then query the database to return some data to the front end.
Would it be okay to simply create a unique table to store the articles returned by the API when users are querying it ?
Shall I rather store the articles returned by the API and associate each of them to the user that requested it (the Article model will contain a foreign key mapping to a User model)?
Or shall I give each user a table (data isolation would be good but that sounds very inefficient)?
Thanks for your help !
Would it be okay to simply create a unique table to store the articles returned by the API when users are querying it ?
Yes. If the articles have unique keys (doi?) you could use INSERT...ON CONFLICT DO NOTHING to handle the (presumably very rare) case that an article is requested by two people nearly simultaneously.
Shall I rather store the articles returned by the API and associate each of them to the user that requested it (the Article model will contain a foreign key mapping to a User model)?
Do you want to? Is there a reason to? Do you care who requested each article? It sounds like you anticipating storing only the first person to request each article, and not every request?
Or shall I give each user a table (data isolation would be good but that sounds very inefficient)?
Right, you would be hitting the API a lot more often (assuming some large fraction of articles are requested more than once) and storing a lot of duplicates. It might not even solve the problem, if one person hits "submit" twice in a row, or has multiple tabs open, or writes a bot to hit your service in parallel.

PostgreSQL with Django: should I store static JSON in a separate MongoDB database?

Context
I'm making, a Django web application that depends on scraped API data.
The workflow:
A) I retrieve data from external API
B) Insert structured, processed data that I need in my PostgreSQL database (about 5% of the whole JSON)
I would like to add a third step, (before or after the "B" step) which will store the whole external API response in my databases. For three reasons:
I want to "freeze" data, as an "audit trail" in case of the API changes the content (It happened before)
API calls in my business are expensive, and often limited to 6 months of history.
I might decide to integrate more data from the API later.
Calling the external API again when data is needed is not possible because of 2) and 3)
Please note that the stored API responses will never be updated and read performance is not really important. Also, being able to query the stored API responses would be really nice, to perform exploratory analysis.
To provide additional context, there is a few thousand API calls a day, which represent around 50GB of data a year.
Here comes my question(s)
Should I store the raw JSON in the same PostgreSQL database I'm using for the Django web application, or in a separate datastore (MongoDB or some other NoSQL database)?
If I go with storing the raw JSON in my PostgreSQL database, I fear that my web application performance will decrease due to the database being "bloated" (50Mb of parsed SQL data in my Django database are equivalent to 2GB of raw JSON from the external API, so integrating the full API responses in my database will multiply its size by 40)
What about cost? as all this is hosted on a DBaaS. I understand that the cost will increase greatly (due to the DBs size increase), but is any of the two options more cost effective?

Migrating a relational DB into AWS services

I have a terabyte size SQL Server DB table which has only two columns:
Id,
HTML Content
There are few applications that call this Table to retrieve the HTML content by providing the Id of the row.
The DB is residing On-premises, and the maintenance cost and size of it is getting higher and higher. I am thinking to move this DB into AWS Dynamo DB. Reason I have choose Dynamo DB is the cost and the performance I have read about it.
Are the any concerns I should know about before choosing Dynamo DB?
Are the any other services in AWS that I could possibly use over
Dynamo DB?
I understand that SQL Server is a Relational DB, while DynamoDB is no sql. And it seems a No Sql DB could be a potential solution for this scenario. I have no kind of joins nor transactions against that Table. All I am doing with the table is to Insert, and Select.
Are the any concerns I should know about before choosing Dynamo DB?
As with any NoSql bigdata DB, Dynamo is "eventually consistent", so, if your application writes and then immediately reads the same record - you should expect failures (inconsistencies).
I'm not familiar with "Prem" and assuming you mean that you're working with your private servers I feel obligated to provide the following warning: working in the cloud is very different from working with your own servers: requests fail more often, latency pattern is different and you should architect your software to handle these sort of issues. If you're planning on moving to the cloud I'd start with migrating your application and leave the DB to be last.
If you really need real time updates of your data, You should reconsider moving on Dynamo. Also dynamo is useful when you do need a dynamic number of columns for each row. So except the cost, i don't see any benefits here.
If you don't need realtime updates, you can look into AWS Redshift or Google BigQuery, and these will be cheaper solutions compare to Dynamo.
Like you have mentioned, you just have two columns, take a look into "redis" also. A plain key value structure will help in performance. But since Redis stores everything in the Physical memory, costing will be high and you'll still need permanent storage/ DB like SQL, MySQL. So in terms of performance, yes you ll be able to see huge difference. but you'll be more thn the current cost.
How about AWS Aurora? At least AWS claims of 1/10th of cost compare to other SQL/MySQL instances. It have backward compatibility also.

How to handle different structure of same Entity in Hibernate L2 cache with Coherence for caching

I am using Hibernate L2 cache with Coherence for caching in two different web services.
Scenario
First web service has an entity class Employee with 5 fields
Second web service has the same entity class Employee but with 3 fields.
Both are pointing to same table/schema and the package hierarchy is also same.
Now when fresh request for employeeId=1 comes to second web service, it fetches from the value from the database and caches the 3 columns; keeps the other 2 as null.
Now when a request for employeeId=1 comes from the first web service, it directly fetches from cache by providing 3 columns and returns the other 2 as null, even though in the database the 2 columns have non-null values.
Is there a way by which I can force it get these column from database?
Approaches already tried
If I keep the columns in both the web services as same the problem goes away but this is not a acceptable solution in my scenario.
I tried added different serialVersion but it doesn't work.
Keeping the fully qualified name different works, but this is force us add overhead to performing manual eviction
You should be able to use the Evolvable interface for this, which will allow you to insert an object into the grid that is both forward and backward compatible. You just need to ensure that Second Webservice sets a lower version than First.

RESTful API design question - how should one allow users to create new resource instances?

I'm working in a research group where we intend to publish implementations of some of the algorithms we develop on the web via a RESTful API. Most of these algorithms work on small to medium size datasets, and in many cases, a user of our services might want to run multiple queries (with different parameters) on the same dataset, so for me it seems reasonable to allow users to upload their datasets in advance and refer to them in their queries later. In this sense, a dataset could be a resource in my API, and an algorithm could be another.
My question is: how should I let the users upload their own datasets? I cannot simply let users upload their data to /dataset/dataset_id as letting the users invent their own dataset_ids might result in ID collision and users overwriting each other's datasets by accident. (I believe one of the most frequently used dataset ID would be test). I think an ideal way would be to have a dedicated URL (like /dataset/upload) where users can POST their datasets and the response would contain a unique ID under which the dataset was stored, but I'm not sure that it does not violate the basic principles of REST. What is the preferred way of dealing with such scenarios?
According to this you should not have dedicated URI, and rather handle POST to /dataset/ as creation.
Your idea is not violating the principles of REST :)
The preferred way is to use POST and return the path to the newly created resource in the Location header.
In your case. Client POSTs to /dataset. The server generates an identifier and returns a reference to the dataset in the Location-header:
Location: /dataset/1234
The response status should be 201 (created)