AppFabric Regions and Local Cache - appfabric

(1) I'm wondering if I use Regions in my AppFabric Caching is it possible for the Regions to exist on the local cache? Or do regions only exist on the cluster?
(2) As a separate question how can I tell if my data is coming from the Cluster or from my local cache? Is there some kind of AppFabric tool that I can use to analyse where the data is coming from?
I am using configuration in code to set up my local cache properties to put my items in to local cache like so
localCacheConfig = new DataCacheLocalCacheProperties(10000, localTimeout, DataCacheLocalCacheInvalidationPolicy.TimeoutBased);
// Setup the DataCacheFactory configuration.
DataCacheFactoryConfiguration factoryConfig = new DataCacheFactoryConfiguration();
factoryConfig.Servers = servers;
factoryConfig.LocalCacheProperties = localCacheConfig;
//code to put items in cache....etc
Do I need to do anything special on the 'Get' or is it smart enough to get it from local cache if it exists there?

Region is a logical concept in AppFabric. It will allow you to query items by tag, but items will be stored in only one host (limited distribution and high availability too).
Local cache is not bound to region. it's just a local copy of your latest call to the cache cluster.
Local cache is enabled on client side, meaning that you can avoid it if you want.
The lifetime of an object in the local cache is dependent on several factors, such as the maximum number of objects in the local cache and the invalidation policy (timeout or notification).
You cannot know is data is coming from local cache or distributed cache. That's why local cache is recommended for reference data.

1.Regions do not exist in local cache, regions are specific to cache cluster only.
2.You can get statistics of a particular named cache by executing the below command in Appfabric powershell console,
get-cachestatistics <CacheName>
It gives you a result set which has ReadRequestCount, if a request goes to local cache this count will not increase for the particular request.
By this way you can check the local cache working.

Regions can't exist in a local cache, they only exist inside a cache on the cluster.
Because local cache is optional, client code should be completely agnostic about whether a cached item is coming from the cluster or from the local cache. To enable that, when a requested item is in the local cache it will be returned from there (avoiding a cross-server call to the cluster).

Related

AWS S3 Block Size to calculate total number of mappers for Hive Workload

Does S3 stores the data in form of blocks? if yes, what is the default block size? Is there a way to alter the block size?
Block Size is not applicable to Amazon S3. It is an object storage system, not a virtual disk.
There is believed to be some partitioning of uploaded data into the specific blocks it was uploaded -and if you knew those values then readers may get more bandwidth. But certainly the open source hive/spark/mapreduce applications don't know the API calls to find this information out or look at these details. Instead the S3 connector takes some configuration option (for s3a: fs.s3a.block.size) to simulate blocks.
It's not so beneficial to work out that block size if it took an HTTP GET request against each file to determine the partitioning...that would slow down the (Sequential) query planning before tasks on split files were farmed out to the worker nodes. HDFS lets you get the listing and block partitioning + location in one API call (listLocatedStatus(path)); S3 only has a list call to return the list of (objects, timestamps, etags) under a prefix, (S3 List API v2) so that extra check would slow things down. If someone could fetch that data and show that there'd be benefits, maybe it'd be useful enough to implement. For now, calls to S3AFIleSystem.listLocatedStatus() against S3 just get some made up list of locations splitting of blocks by the fs.s3a.block.size value and with the location (localhost). All the apps known that location == localhost means "whatever"

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

Difference between CloudFront TTL expiry and Invalidation?

What are the practical differences between when CloudFront expires objects at an origin via the CloudFront TTL setting versus when one calls invalidate?
The general idea is that you use TTLs to set the policy that CloudFront uses to determine the maximum amount of time each individual object can potentially be served from the CloudFront cache with no further interaction with the origin.
Default TTL: the maximum time an object can persist in a CloudFront cache without being considered stale, when no relevant Cache-Control directive is supplied by the origin. No Cache-Control header is added to the response by CloudFront.
Minimum TTL: If the origin supplies a Cache-Control: s-maxage value (or, if not present, then a Cache-Control: max-age value) smaller than this, CloudFront ignores it and assumes it can retain the object in the cache for not longer than this. For example, if Minimum TTL is set to 900, but the response contains Cache-Control: max-age=300, CloudFront ignores the 300 and may cache the object for up to 900 seconds. The Cache-Control header is not modified, and is returned to the viewer as received.
Maximum TTL: If the origin supplies a Cache-Control directive indicating that the object can be cached longer than this, CloudFront ignores the directive and assumes that the object must not continue to be served from cache for longer than Maximum TTL.
See Specifying How Long Objects Stay in a CloudFront Edge Cache (Expiration) in the Amazon CloudFront Developer Guide.
So, these three values control what CloudFront uses to determine whether a cached response is still "fresh enough" to be returned to subsequent viewers. It does not mean CloudFront purges the cached object after the TTL expires. Instead, CloudFront may retain the object, but will not serve it beyond the expiration without first sending request to the origin to see if the object has changed.
CloudFront does not proactively check the origin for new versions of objects that have expired -- it only checks if they are requested again, while still in the cache, and then determined to have been expired. When it does this, it usually sends a conditional request, using directives like If-Modfied-Since. This gives the origin the option of responding 304 Not Modified, which tells CloudFront that the cached object is still usable.
A misunderstanding that sometimes surfaces is that the TTL directs CloudFront how long to cache the objects. That is not what it does. It tells CloudFront how long it is allowed to cache the response with no revalidation against the origin. Cache storage inside CloudFront has no associated charge, and caches by definition are ephemeral, so, objects that are rarely requested may be purged from the cache before their TTL expires.
If an object in an edge location isn't frequently requested, CloudFront might evict the object—remove the object before its expiration date—to make room for objects that have been requested more recently.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
On the next request, CloudFront will request the object from the origin again.
Another misunderstanding is that CloudFront's cache is monolithic. It isn't. Each of the global edges has its own independent cache, caching objects in edges through which they are being requested. Each global edge also has an upstream regional cache (in the nearest EC2 region; there may be more than one per region, but this isn't documented) where the object will also be stored, allowing other nearby global edges to find the object in the nearest regional cache, but CloudFront does not search any further, internally, for cached objects. For performance, it just goes to the origin on a cache miss.
See How CloudFront Works with Regional Edge Caches.
Invalidation is entirely different, and is intended to be used sparingly -- only the first 1000 invalidation paths submitted each month by an AWS account are free. (A path can match many files, and the path /* matches all files in the distribution).
An invalidation request has a timestamp of when the invalidation was created, and sends a message to all regions, directing them to do something along these lines (the exact algorithm isn't documented, but this accurately descibes the net effect):
Delete any files matching ${path} from your cache, if they were cached prior to ${timestamp} and
Meanwhile, since that could take some time, if you get any requests for files matching ${path} that were cached prior to ${timestamp}, don't use the cached files because they are no longer usable.
The invalidation request is considered complete as soon as the entire network has received the message. Invalidations are essentially an idempotent action, in the sense that it is not an error to invalidate files that don't actually exist, because an invalidation is telling the edges to invalidate such files if they exist.
From this, it should be apparent that the correct course of action is not to choose one or the other, but to use both as appropriate. Set your TTLs (or select "use origin cache headers" and configure your origin server always to return them with appropriate values) and then use invalidations only as necessary to purge your cache of selected or all content, as might be necessary if you've made an error, or made significant changes to the site.
The best practice, however, is not to count on invalidations but instead to use cache-busting techniques when an object changes. Cache busting means changing the actual object being requested. In the simplest implementstion, for example, this might mean you change /pics/cat1.png to /pics/cat2.png in your HTML rather than saving a new image as /pics/cat1.png when you want a new image. The problem with replacing one file with another at the same URL is that the browser also has a cache, and may continue displaying the old image.
See also Invalidating Objects.
Note also that the main TTLs are not used for error responses. By default, responses like 404 Not Found are cached for 5 minutes. This is the Error Caching Minimum TTL, designed to relieve your origin server from receiving requests that are likely to continue to fail, but only for a few minutes.
If we are looking at practical differences:
CloudFront TTL: You can control how long your objects stay in a CloudFront cache before CloudFront forwards another request to your origin.
Invalidate: Invalidate the object from edge caches. The next time a viewer requests the object, CloudFront returns to the origin to fetch the latest version of the object.
So the main difference is speed. If you deploy a new version of your application you might want to invalidate immediately.

Cloudfront decision to fetch data from cache or server?

From Amazon cloud front
Amazon CloudFront is a web service that speeds up distribution of your
static and dynamic web content, such as .html, .css, .php, and image
files, to your users. CloudFront delivers your content through a
worldwide network of data centers called edge locations.
Per mine undserstanding, CloudFront must be caching the content with URL as key. URL can serve both static and dynamic content. Say i have 100 weburl's , out of which 30 serves the static content and 70 serves dynamic content(user specific data). I have one question each on static and dynamic content
Dynamic content :-
Say user_A access his data through url_A from US. That data has been cached . He updates the first name. Now same user will access the data from same location in US or from
another location in UK. We he see data prior to updation. If yes how will edge location come to know data needs to fetched from server not from cache ?
Does edge location continue to display the data from cache for configurable amount of time and if time is passed then fetch it from server ?
Does cloudfront allows to configure specific URL's that needs to fetched from server instead of cache always ?
Static content :-
There are chances that even static data may change will with each release. How cloud front will know that cached static content is stale and needs to be fetched from server ?
Amazon CloudFront uses an expiration period (or Time To Live - TTL) that you specify.
For static content, you can set the default TTL for the distribution or you can specify the TTL as part of the headers. When the TTL has expired, the CloudFront edge location will check to see whether the Last Modified timestamp on the object has changed. If it has changed, it will fetch the updated copy. If it is not changed, it will continue serving the existing copy for the new time period.
If static content has changed, your application must send an Invalidation Request to tell CloudFront to reload the object even when the TTL has not expired.
For dynamic content, your application will normally specify zero as the TTL. Thus, that URL will always be fetched from the origin, allowing the server to modify the content for the user.
A half-and-half method is to use parameters (eg xx.cloudfront.net/info.html?user=foo). When configuring the CloudFront distribution, you can specify whether a different parameter (eg user=fred) should be treated as a separate object or whether it should be ignored.
Also, please note that each CloudFront edge location has its own cache. So, if somebody accessed a page from the USA, that would not cause it to be cached in the UK.
See the documentation: Specifying How Long Objects Stay in a CloudFront Edge Cache (Expiration)

LocalInsertRemoteInsert error in SyncFramework 2.1 when changing directions

I am using sync framework 2.1.
What i am doing is changing the directions of sync continously
Example. first i set bidirectional, then may be upload and then download.
I am creating new scopes whenever any change happend and deprovision existing scope.
Now- After i set bidirectional and then upload direction - upload does not work at all.
After then i change it to bidirection then all the changes on local is overriden by server.
While uploading all the records are conflicted with LocalInsertRemoteInsert
There are also no scope overlapping as i found there are no scopes for this table in scope_info
I also referred this LocalInsertRemoteInsert conflicts on initial sync
Any help is appreciated
if you previously provisioned and synched the databases, then each copy contains data already. when you deprovision, Sync Fx removes the sync metadata including information on what was previously synched but not the data itself.
so when you reprovision and try to sync, since the previous information on what was synched was already wiped out by deprovisioning, Sync Fx has no idea that the replicas already contains the same set of rows.
when you sync, it will try to send the rows from one replica to the other, since the data already exists on the other side, you get a conflict (a duplicate PK error when inserting rows).