I will be using AppFabric caching for an application we are going to build.
We will be maintaining two different caches using the technology.
So there should be proper communication if something happens in one or the other cache.
I have read about notification based caching which notifies on adding, removing or replacing items or regions. I have also tried the same and it is working fine.
What i am looking for is How to notify on Cache Invalidation. ?
Read from MSDN that there are two ways for invalidation cache - 1. TimeOut based and 2. Notification based.
I am looking for some same codes for the same.
Can any one help me out on this.
I don't think there is any way of notification which helps invalidation in the main cache. However, the same are available for local cache invalidation.
Related
Currently our organization is using the Akamai Fast Purge v3 API to invalidate cache records by cache tag. The problem I'm running into is that some of our lower environments are configured with a TTL of 0 seconds, apparently to facilitate business user testing.
As a result, I'm finding it strangely difficult to manually test the new purge system we have in place because Akamai isn't actually caching anything. We're working with the business to set this to closer match production environments, but in the meantime I'm wondering if there are any debug headers that can be used to figure out if and when an invalidation / deletion occurred.
I have of course googled and found this somewhat useful if not incomplete and IMO strangely half-baked article from Akamai themselves that discuss debug cache headers and their meaning, but it seems very incomplete.
As far as I can tell, X-Cache is my best option from this article and as mentioned because of the 0 second TTL I will always get a MISS.
Are there any additional debug headers that could be useful in determining if my purge logic is effective, despite the 0 second TTL?
The better resource for Akamai's Pragma headers is the documentation site.
That said, there is no response header that will tell you when a cache purge was issued to the platform. Instead, you would have to look at the Control Center Event Viewer where you see and filter all events including just Fast Purge calls.
My use case is as follow:
We have about 500 servers running in an autoscaling EC2 cluster that need to access the same configuration data (layed out in a key/value fashion) several million times per second.
The configuration data isn't very large (1 or 2 GBs) and doesn't change much (a few dozen updates/deletes/inserts per minute during peak time).
Latency is critical for us, so the data needs to be replicated and kept in memory on every single instance running our application.
Eventual consistency is fine. However we need to make sure that every update will be propagated at some point. (knowing that the servers can be shutdown at any time)
The update propagation across the servers should be reliable and easy to setup (we can't have static IPs for our servers, or we don't wanna go the route of "faking" multicast on AWS etc...)
Here are the solutions we've explored in the past:
Using regular java maps and use our custom built system to propagate updates across the cluster. (obviously, it doesn't scale that well)
Using EhCache and its replication feature. But setting it up on EC2 is very painful and somehow unreliable.
Here are the solutions we're thinking of trying out:
Apache Ignite (https://ignite.apache.org/) with a REPLICATED strategy.
Hazelcast's Replicated Map feature. (http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#replicated-map)
Apache Geode on every application node. (http://geode.apache.org/)
I would like to know if each of those solutions would work for our use case. And eventually, what issues I'm likely to face with each of them.
Here is what I found so far:
Hazelcast's Replicated Map is somehow recent and still a bit unreliable (async updates can be lost in case of scaling down)
It seems like Geode became "stable" fairly recently (even though it's supposedly in development since the early 2000s)
Ignite looks like it could be a good fit, but I'm not too sure how their S3 discovery based system will work out if we keep adding / removing node regularly.
Thanks!
Geode should work for your use case. You should be able to use a Geode Replicated region on each node. You can choose to do synchronous OR asynchronous replication. In case of failures, the replicated region gets an initial copy of the data from an existing member in the system, while making sure that no in-flight operations are lost.
In terms of configuration, you will have to start a couple/few member discovery processes (Geode locators) and point each member to these locators. (We recommend that you start one locator/AZ and use 3 AZs to protect against network partitioning).
Geode/GemFire has been stable for a while; powering low latency high scalability requirements for reservation systems at Indian and Chinese railways among other users for a very long time.
Disclosure: I am a committer on Geode.
Ignite provides native AWS integration for discovery over S3 storage: https://apacheignite-mix.readme.io/docs/amazon-aws. It solves main issue - you don't need to change configuration when instances are restarted. In a nutshell, any nodes that successfully joins topology writes its coordinates to a bucket (and removes them when fails or leaves). When you start a new node, it reads this bucket and connects to one the listed addresses.
Hazelcast's Replicated Map will not work for your use-case. Note that it is a map that is replicated across all it's nodes not on the client nodes/servers. Also, as you said, it is not fully reliable yet.
Here is the Hazelcast solution:
Create a Hazelcast cluster with a set of nodes depending upon the size of data.
Create a Distributed map(IMap) and tweak the count & eviction configurations based on size/number of key/value pairs. The data gets partitioned across all the nodes.
Setup Backup count based on how critical the data is and how much time it takes to pull the data from the actual source(DB/Files). Distributed maps have 1 backup by default.
In the client side, setup a NearCache and attach it to the Distributed map. This NearCache will hold the Key/Value pair in the local/client side itself. So the get operations would end up in milliseconds.
Things to consider with NearCache solution:
The first get operation would be slower as it has to go through network to get the data from cluster.
Cache invalidation is not fully reliable as there will be a delay in synchronization with the cluster and may end reading stale data. Again, this is same case across all the cache solutions.
It is client's responsibility to setup timeout and invalidation of Nearcache entries. So that the future pulls would get fresh data from cluster. This depends on how often the data gets refreshed or value is replaced for a key.
I have a java web server and am currently using the Guava library to handle my in-memory caching, which I use heavily. I now need to expand to multiple servers (2+) for failover and load balancing. In the process, I switched from a in-process cache to Memcache (external service) instead. However, I'm not terribly impressed with the results, as now for nearly every call, I have to make an external call to another server, which is significantly slower than the in-memory cache.
I'm thinking instead of getting the data from Memcache, I could keep using a local cache on each server, and use RabbitMQ to notify the other servers when their caches need to be updated. So if one server makes a change to the underlying data, it would also broadcast a message to all other servers telling them their cache is now invalid. Every server is both broadcasting and listening for cache invalidation messages.
Does anyone know any potential pitfalls of this approach? I'm a little nervous because I can't find anyone else that is doing this in production. The only problems I see would be that each server needs more memory (in-memory cache), and it might take a little longer for any given server to get the updated data. Anything else?
I am a little bit confused about your problem here, so I am going to restate in a way that makes sense to me, then answer my version of your question. Please feel free to comment if I am not in line with what you are thinking.
You have a web application that uses a process-local memory cache for data. You want to expand to multiple nodes and keep this same structure for your program, rather than rely upon a 3rd party tool (memcached, Couchbase, Redis) with built-in cache replication. So, you are thinking about rolling your own using RabbitMQ to publish the changes out to the various nodes so they can update the local cache accordingly.
My initial reaction is that what you want to do is best done by rolling over to one of the above-mentioned tools. In addition to the obvious development and rigorous testing involved, Couchbase, Memcached, and Redis were all designed to solve the problem that you have.
Also, in theory you would run out of available memory in your application nodes as you scale horizontally, and then you will really have a mess. Once you get to the point when this limitation makes your app infeasible, you will end up using one of the tools anyway at which point all your hard work to design a custom solution will be for naught.
The only exceptions to this I can think of are if your app is heavily compute-intensive and does not use much memory. In this case, I think a RabbitMQ-based solution is easy, but you would need to have some sort of procedure in place to synchronize the cache between the servers on occasion, should messages be missed in RMQ. You would also need a way to handle node startup and shutdown.
Edit
In consideration of your statement in the comments that you are seeing access times in the hundreds of milliseconds, I'm going to advise that you first examine your setup. Typical read times for a single item in the cache from a Memcached (or Couchbase, or Redis, etc.) instance are sub-millisecond (somewhere around .1 milliseconds if I remember correctly), so your "problem child" of a cache server is several orders of magnitude from where it should be in terms of performance. Start there, then see if you still have the same problem.
We're using something similar for data which is read-only and doesn't require updated every time. I'm in doubt, that this is good plan for you. Just imagine you should have one more additional service on each instance, which will monitor queue, and process change to in-memory storage. This is very hard to test.
Are you sure that most of the time is spent on communication between your servers? Maybe you run multiple calls?
as some of you are aware Django has multi-db support. This can be achieved by writing a dbrouter to send writes to the master database and all the reads to the slave, but as stated on the Django Docs For Master/Slave Configuration
The master/slave configuration described is also flawed – it doesn’t provide any solution for handling replication lag (i.e., query inconsistencies introduced because of the time taken for a write to propagate to the slaves). It also doesn’t consider the interaction of transactions with the database utilization strategy.
How can I account for replication lag and the query inconsistencies due to the time it takes for the write to propagate to the slaves? Is there any sort of code I can implement for this?
If you're writing to MASTER and immediately reading from SLAVE, you will always run the risk of inconsistencies. You can mitigate the risk but you can never avoid it.
Doing everything possible to profile, tune and minimize the replication lag will help. Delaying the read query until the last possible moment will help. But it can't be avoided entirely if you're intent on never reading from MASTER.
If you time the replication lag under typical usage, you could configure something like django-multidb-router to read from MASTER for a specified period of time after write. Still is not 100% safe but you can configure it to be 99.9% safe for your setup.
Note that in MySQL 5.6 they have introduced a "Semi-Sync" mode (http://dev.mysql.com/doc/refman/5.6/en/replication-semisync.html), which guarantees a synchronous slave when used in a simple Master-slave (and only one slave) setup. This avoids that very specific problem, but what you gain in consistency you lose in some added transaction time.
I want to know whether I can use Infinispan for cached data synchronization with Oracle database. This is my scenario. I have two main applications. One is highly concurrent use app and the second is used as admin module. Since it is highly concurrent I want to reduce database connections (Load the entities into cache (read write enable) and use it from this place without calling database). But meantime I want to update database according to the cache changes because admin module is using database directly. Can that update process (cache to database) handle in entity level without involving application? Please let me know whether Infinispan supports this scenario or not. If supports please share ideas.
Yes, it is possible. Infinispan supports this use case.
This should be simple configuration "problem" only. All you need to use is properly configured CacheStore with passivation disabled. It will keep your cache (used by highly concurrent application) synchronized with database.
What does it exactly cause?
When passivation is disabled, whenever an element is modified, added
or removed, then that modification is persisted in the backend store
via the cache loader. There is no direct relationship between eviction
and cache loading. If you don't use eviction, what's in the persistent
store is basically a copy of what's in memory.
By memory is meant a cache here.
If you want to know even more about this and about other interesting options please see: https://docs.jboss.org/author/display/ISPN/Cache+Loaders+and+Stores#CacheLoadersandStores-cachepassivation
Maybe it's worth to consider aforementioned eviction. Whether to disable or enable it. It depends mainly on load generated by your highly concurrent application.
Actually, this only works when you use Infinispan in the same cluster for the admin module. If you load A in memory with Infinispan, change A to something else in the database directly with the admin module, then Infinispan will not know A has been updated.