Cronjob: Web Service query

Cronjob: Web Service query - web-services

I have a cronjob that runs every hours and parse 150,000+ records. Each record is summarized individually in a MySQL tables. I use two web services to retrieve the user information.
User demographic (ip, country, city etc.)
Phone information (if landline or cell phone and if cell phone what is the carrier)
Every time I get 1 record I check if I have information and if not I call these web services. After tracing my code I found out both of these calls takes 2 to 4 seconds and it makes my cronjob very slow and I can't compile statistics on time.
Is there a way to make these web service faster?
Thanks

simple:
get the data locally and use mellissa data:
for ip: http://w10.melissadata.com/dqt/websmart/ip-locator.htm
for phone: http://www.melissadata.com/fonedata.html
you can also cache them using memcache or APC which will make it faster since he does not have to request the data from the api or database.

A couple of ideas... if the same users are returning, caching the data in another table would be very helpful... you would only look it up once and have it for returning users. Upon re-reading the question it looks like you are doing that.
Another option would be to spawn new threads when you need to do the look-ups. This could be a new thread for each request, or if this is not feasible you could have n service threads ready to do the look-ups and update the results.

Related

Cannot get data (100k+ rows) for a dashboard

Pretty new to the dynamoDb and the whole AWS, it's very exciting but I feel the learning curve is a bit steep. Anyway, here is my situation and my problem.
We have a mobile react native app which stores into a dynamoDb table one row each time the users are doing a search. (the database is a search history with a UUID and then the search criteria). On average we have a few thousands new searches into the table every day. The table has just a primary key which is the search id.
The app is quite new but we are reaching the few hundred thousand rows in the table already and can expect having a million in the following months. The data is plain simple data with unique id and string and numbers in the other attributes. No connection, no relationship, etc... That's already when I felt maybe DynamoDb may not have been the best choice but still, I read everywhere it can be suitable for anything if properly managed.
Next to this there is a webapp dashboard which -thanks to a rest api using nodejs lambdas- queries the dynamoDB to make statistics about the searches: how many searches per day, list of last searches... the problem is DynamoDb is not really suitable to query hundred thousands of data (the 1mb limit, query limitations, credits...).
When I do a scan I get only 3000 searches. I tried to make a loop on the scan using the last index requested but after a few test I did not get data and I blocked the maximum throughput. It seems really clear that I don't have the right approach to bring all these searches to my web app. So now what would be the right approach? My ideas are the following but I am open to more experienced one:
Switching to a SQL database (using the aws migration ?). Will it really be easier then?
creating lambdas to execute scheduled jobs every night to make statistics every day so that I don't have to query the full database all the time but just some of the most recent searches and the statistics rows? Is it doable? any node.js / lambdas tutorial you may know regarding this?
better management of indexes? I am still very lost regarding those.
Looking forward to your opinions.

Add another layer to take care for full text search.
For example, with Elasticsearch, or Algolia or other similars.
Notes:
Elasticsearch may be cost you a lot if compare the cost on dynamodb
Reference:
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/

Managing identifiers when offline in an event sourced system

I have an event sourced system that runs on a server with clients that need to work offline from time to time. To make this work I have the domain events streamed from the server to the client when online so that the offline database is up to date in case the client goes offline. This works just fine.
When offline the user might add a new customer with the following sequence...
Add new customer command.
Customer aggregate added.
Customer aggregate creates initial appointment aggregate.
Query of read data returns new appointment details.
Command used to modify the appointment.
When back online I cannot reply the events for the server. Adding the new customer is fine but the resultant new appointment has an identifier I do not know about. So trying to replay the appointment update command fails because I have no idea what the correct appointment id should be.
Any ideas?

You need to review Greg Young's talk CQRS, not just for server systems.
Also Stack overflow question Occasionally Connected CQRS Systems, and the dddcqrs topic Merging Events in Occasionally Connected Clients.
I have no idea what the correct appointment id should be
Generate the ids when you generate the commands; you'll know what the appointment id is going to be, because you told the customer aggregate what id to use when creating the appointment.

API Gateway generating 11 sql queries per second on REG_LOG

We have sysdig running on our WSO2 API gateway machine and we notice that it fires a large number of SQL queries to the database for a minute, than waits a minute and repeats.
The query looks like this:
Every minute it goes wild, waits for a minute and goes wild again with a request of the following format:
SELECT REG_PATH, REG_USER_ID, REG_LOGGED_TIME, REG_ACTION, REG_ACTION_DATA
FROM REG_LOG
WHERE REG_LOGGED_TIME>'2016-02-29 09:57:54'
AND REG_LOGGED_TIME<'2016-03-02 11:43:59.959' AND REG_TENANT_ID=-1234
There is no load on the server. What is causing this? What can we do to avoid this?
screen shot sysdig api gateway process

This particular query is the result of the registry indexing task that runs in the background. The REG_LOG table is being queried periodically to retrieve the latest registry actions. The indexing task cannot be stopped. However, one can configure the frequency of the indexing task through the following parameter that is in the registry.xml. See [1] for more information.
indexingFrequencyInSeconds
If this table is filled up, one can clean the data using a simple SQL query. However, when deleting the records, one must be careful not to delete all the data. The latest records of each resource path should be left in the REG_LOG table since reindexing of data requires at least one reference of each resource path.
Also, if required, before clearing up the REG_LOG table, you can take a dump of the data in case you do not want to loose old records. Hope this answer provides information you require.
[1] - https://docs.wso2.com/display/Governance510/Configuration+for+Indexing

How can I limit database query time during web requests?

We've got a pretty typical django app running on postgresql 9.0. We've recently discovered some db queries that have run for over 4 hours, due to inefficient searches in the admin interface. While we plan to fix these queries, as a safeguard we'd like to artificially constrain database query time to 15 seconds--but only in the context of a web request; batch jobs and celery tasks should not be bounded by this constraint.
How can we do that? Or is it a terrible idea?

The best way to do this would be to set up a role/user that is only used to run the web requests, then set the statement_timeout on that role.
ALTER ROLE role_name SET statement_timeout = 15000
All other roles will use the global setting of statement_timeout (which is disabled in a stock install).

You will need to handle this manually. That is checking for the 15 second rule and killing the queries that violate it.
Query pg_stat_activity and find the violators and issue calls to pg_terminate_backend(procpid) to kill the offenders.
Something like this in a loop:
SELECT pg_terminate_backend(pg_stat_activity.procpid)
FROM pg_stat_activity
WHERE pg_stat_activity.datname = 'TARGET_DB'
AND usename = 'WEBUSERNAME'
AND (now()-query_start) > '00:00:15';

As far as the timing goes, you could pass all of your queries through a class which, on instantiation, spawns two threads: one for the query, and one for a timer. If the timer reaches 15 seconds, then kill the thread with the query.
As far as figuring out if the query is instantiated from a web request, I don't know enough about Django to be able to help you. Simplistically, I would say, in your class that handles your database calls, an optional parameter to the constructor could be something like context, which could be http in the event of a web request and "" for anything else.

SOA/Web Service Pagination

In SOA we should not be building or holding state (or designing dependencies) between client and server. This is understood. But what patterns can be followed in the case that a client wants to consume a real-time service that may return an open ended number of 'rows'?
Web applications, similar to SOA but allowing for state (sessions) have solved this with pagination. Pagination requires (in most cases, especially with SQL) that the server holds the data and that the client request the data in chunks.
If we where to consider pagination-like scenarios for web services, what patterns would these follow that would still allow the tenets of SOA to be adhered (or as close as possible).
Some rules for the thinkers:
1) Backed by a SQL database (therefore there is no concept of a row number in a select set)
2) It is important to not skip a row or duplicate a row in a set during pagination
3) Data may be inserted and deleted at any time into the database by other clients
4) There is no need to consider the dataset a live (update-able) dataset
Personally, I think that 1 and 2 above already spell our the solution by constraining the solution space with the requirements.
My proposed solution would have the data (as much as is selected) be stored in a read-only store/cache where it can be assigned a row number within the result set and allow pagination to occur on this data snapshot. I have would have infrastructure to store snapshots (servers, external caches, memcached or ehcache - this must scale quite large). The result of such a query would be a snapshot ID and clients could retrieve the data from the snapshot using a snapshot API (web services) and the snapshot ID. Results would be processed in a read-only, forward only manner for x records at a time where x was something reasonable.
Competing thoughts and ideas, criticisms or accolades would be greatly appreciated.

Paginated results in a Web Service is actually quite easy to achieve.
All you have to do is add two parameters to the web service call: Page Size, Page Number.
Page Size is the number of results to include in a page. Page Number is the number of the page of results you are looking for.
Your web service then goes back to the database (or cache), retreives the results, figures out which results fit on the requested page, and return only those results.
The client then has to make a single request per page of results they want from the service.

What you propose with memcached will also work with a caching table. The first service call would (1) INSERT results INTO the caching table with a snapshot ID (2) return the first page from the caching table and the snapshot ID. Subsequent calls would return pages based on page size and page number by querying the caching table using the snapshot ID.
I should think this could also be optimized by using an in-memory caching table, but that depends on whether your database supports INSERT-INTO from a disk table to an in-memory table. That might get complicated in a clustered environment though.
Such a cache is stateful by its very nature if you are retaining a client-specific copy between requests, whether storage is in a session object, database table or memcached data store. Given the requirements though, you have no choice but to cache results in some form or another, except you risk the chance of returning deleted or no-longer-relevant records as legitimate results.

SOA is not meant for such low level functionality.
SOA is meant to glue together business areas, not frontends to backends. Not because your application talks to the back end using webservices you have a "SOA" application. This is non sense since SOA is meaningless in the context of 1 isolated system.
From that point of view, it is then clear that, in SOA, the caller should not have known about the SQL table you are paginating, that’s an implementation detail that SOA should hide. In the other hand the server should not know about the client's state, because it should be agnostic to the details of the clients, to be really open.
So, just understand that pagination is not SOA. Do as you wish, just understand that the webservice you are using to paginate is an internal artifact of your application, not to be used for external clients in a SOA bus. Also remember that it can not be transaction consistent with out state in the server. Probably the problem is that you have only one service layer for the application's UI and the SOA bus, you need to separate them.
Using this webservice in a SOA bus would be bad. I can not be consistent as the user paginates and as other applications hang to it they become tied to the specific SQL.
... then you might as well have granted direct SQL access to the table for all that matters.
SOA is for business messages between systems, not to glue an application's frontend to the backend.

Same problem, resolved using the Navision approach.
$ws->getList($first_record_id, $limit)
This return a page of $limit element that start from the the passed id
select * from collection where collection.id > $first_record_id ASC limit $limit
ordered by id ASC
Navision use Key (each element has a key) but in MySQL an autoincrement id is better.
In this case pagination is intended for handle large result sets and not for a frontend pagination...

I am not sure if SOA is of concern here. The problem you have seems to be with paginating your API's. I will point you to how twitter handles their pagination dev.twitter.com/rest/public/timelines

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js