General guidance around Bigtable performance - google-cloud-platform

I'm using a single node Bigtable cluster for my sample application running on GKE. Autoscaling feature has been incorporated within the client code.
Sometimes I experience slowness (>80ms) for the GET calls. In order to investigate it further, I need some clarity around the following Bigtable behaviour.
I have cached the Bigtable table object to ensure faster GET calls. Is the table object persistent on GKE? I have learned that objects are not persistent on Cloud Function. Do we expect any similar behaviour on GKE?
I'm using service account authentication but how frequently auth tokens get refreshed? I have seen frequent refresh logs for gRPC Java client. I think Bigtable won't be able to serve the requests over this token refreshing period (4-5 seconds).
What if client machine/instance doesn't scale enough? Will it cause slowness for GET calls?
Bigtable client libraries use connection pooling. How frequently connections/channels close itself? I have learned that connections are closed after minutes of inactivity (>15 minutes or so).
I'm planning to read only needed columns instead of entire row. This can be achieved by specifying the rowkey as well as column qualifier filter. Can I expect some performance improvement by not reading the entire row?

According to GCP official docs you can get here the cause of slower performance of Bigtable. I would like to suggest you to go through the docs that might be helpful. Also you can see Troubleshooting performance issues.

Related

AWS RDS Performance Insights not showing SQL Queries

I enabled Performance Insights on an existing SQL Server database (MySql 5.6.46) in AWS RDS.
But still, it shows 0 sessions and “No active sessions in the selected time range” no matter what duration I've select from the top list.
Is there some condition I need to meet in order to have my query get recorded in Performance Insights? What're the criteria? How can I troubleshoot this?
I created AWS Support case where AWS Engineer explained to me:
Unfortunately, this is a known issue from our end where Performance Insights does not get enabled when it is issued in the same API call as engine version upgrade as RDS follows a priority in executing multiple requests that have been submitted as part of the same API call - for example in this case, request to enable Performance Insights and request to upgrade the instance to 11.1 version. Performance Insights call is evaluated first followed by the engine upgrade. This means that when Performance Insights request was being considered, the instance was still on the previous incompatible version, hence the request did not go through successfully.
The workaround to resolve this issue is to disable Performance insights, wait a few minutes and then re-enable Performance Insights.
Enabling/disabling Performance Insights does not cause an outage/downtime. The Performance Insights agent is designed to stay out of your database workloads' way. When Performance Insights detects heavy load or depleted resources, it backs off, still collecting data, but only when it is safe to do so.

Apache SuperSet is very slow

Any recommendation on how to make superset faster?
Cache seems to load full data from the cache, I thought it load only old data from the cache, and real-time data from the database, isn't it like this?
What about some parallel processing?
This answer is valid as of Superset 0.37.0.
At the moment, dashboard performance is affected by a few different factors. I'll enumerate them below along with methods to improve performance:
Database concurrency limits can have an impact on dashboard performance. Dashboards load their information in parallel via concurrent web requests. Make sure that the database user provided allows enough concurrency that queries aren't being queued at the database layer.
Cache performance your caching layer should be able to return multiple results, if not in parallel, extremely quickly. We've had success leveraging S3 for our cache.
Cache hit percentage Superset will hit the cache only for queries that exactly match one that has been run recently. Otherwise the full query will fall through to the underlying analytical DB (Druid in this case). You can reduce the query load on Druid by using a less granular resolution on your dashboard - if it's possible to have it update less frequently, say a couple of times a day rather than in real-time, this can hit cache for all requests other than the first request in the new period under consideration.
Python Web Process Concurrency Limits make sure that your web application server can handle enough parallel requests. The browser will request multiple charts' data at the same time, and the system will need to be able to handle these requests in parallel.
Chart Query Performance As data is frequently requested, especially for real-time data from a database like Druid, optimizing the queries run by the charts can be very useful. I'd take a look at any virtual datasources that are being leveraged to see if they can be materialized or made more efficient.
Web browser concurrent request limits By default most web browsers limit the number of concurrent requests that can be made to the same FQDN. If you have more than 6 charts on the same dashboard, it can be helpful to balance requests across multiple FQDNs running Superset to get around this browser limitation. There's more information on the approach to that in the issue history on Github, but Superset does support this type of configuration.
The community is very interested in improving performance over time, and as such there have been recommendations to move all analytical queries to Celery as well as making other architectural changes to improve performance. I hope this description helps and that something in here will help you track down the issue!

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.

How to relieve a rate-limited API?

We run a website which heavily relies on the Amazon Product Advertising API (APAA). What happens is that when we experience a sudden spike in users it happens that we hit the rate-limit and all functions relying on the APAA shut down for a while. What can we do so that doesn't happen?
So, obviously we have some basic caching in place, but the APAA doesn't allow us to cache data for a very long time, and APAA queries can vary a lot so there may not be any cached data at all to query.
I think that your only option is to retry the API calls until they work — but do so in a smart way. Unfortunately, that's what everybody that gets throttled does and AWS expects people to handle that themselves.
You can implement an exponential backoff and add jitter to prevent cluster calls. AWS has a great blog post about solutions for this kind of problem: https://www.awsarchitectureblog.com/2015/03/backoff.html

Sync Framework 2.1 - Large data over WCF

I have implemented data sync using MS Sync framework 2.1 over WCF to sync multiple SQL Express databases with a central SQL server. Syncing is happening every three minutes through a windows service. Recently, we noticed that huge amounts of data is being exchanged over the network (~100 MB every 15 minutes). When I checked using Fiddler, the client calls the service with a GetKnowledge request four times in a session and each response is around 6 MB in size, although there are no changes at all in either database. This does not seem to be normal? How do I optimize the system to reduce such heavy traffic? Please help.
I have defined two scopes with first one having 15 tables all download only. The second one has 3 tables with upload only direction.
The XML response has a very huge number of <range> tags under coreFragments/coreFragment/ranges tag which is the major portion contributing to the response size.
Let me know if any additional information is required.
must be the sync knowledge. do you do lots of deletes? or do you have lots of replicas? try running a metadata cleanup and see if it compacts the sync knowledge.
Creating one to one scopes and re-provisioning fixed the issue. I am not still sure what caused the original issue.
Do you happen to have any join tables and use any ORM. If you do, then this post might help.
https://kumarkrish.wordpress.com/2015/01/07/microsoft-sync-frameworks-heavy-traffic/