Cannot set consistency level when querying Amazon Keyspaces service from DataGrip - amazon-web-services

I'm trying to perform inserts on Amazon's Managed Cassandra service from IntelliJ's DataGrip IDE, however I recieve the following error:
Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM
This is due to Amazon using the LOCAL_QUORUM consistency level for writes.
I tried to set the consistency level with CONSISTENCY LOCAL_QUORUM; before running other queries but it returned the following error:
line 1:0 no viable alternative at input 'CONSISTENCY' ([CONSISTENCY])
From my understanding, this is because CONSISTENCY is a cqlsh command and not a CQL command.
I cannot find any way to set the consistency level from within DataGrip so that I can run scripts and populate my tables.
Ultimately, I will use plain cqlsh if I cannot find a solution but I was hoping to use DataGrip as I find it useful and have many databases already configured. I hope someone can shed some light on the issue, this seems like it should be a basic feature.

I am Max from DataGrip team, and the correct answer is:
It could be JDBC driver issue and the desired method hasn't been implemented yet. Since you're trying to run pure cqlsh command as SQL. Follow the issue DBE-10638.

It's a DataGrip bug, see https://youtrack.jetbrains.com/issue/DBE-10182 :
Cassandra 'CONSISTENCY' command is not supported
So upvote that bug, and maybe add a comment that it makes DataGrip useless for writing to Amazon Managed Cassandra

Amazon Keyspaces (Apache Cassandra)
Now I used DataGrip version 2020.1.3 (Buy Licensed)
Encounter problems as well.
Cannot change type CONSISTENCY ONE to LOCAL_QUORUM
I have opened an issue already and waiting for the investigation.
So, I try so many tools and found that DBeaver is working,
The CONSISTENCY can be selected in the configuration GUI.
https://dbeaver.com/download

Related

When I try fetch data from Amazon Keyspaces with Pyspark, I get Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner Error

I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax. I'm using Pyspark to fetch data from Cassandra. I can successfully connect to Keyspaces/Cassandra cluster. But, when I try to fetch data from it.
df = spark.sql("SELECT * FROM cass.tutorialkeyspace.tutorialtable")
print ("Table Row Count: ")
print (df.count())
I get this error:
Unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner
Yes, keyspace & table exists and has data. How can I fix/workaround this? Thanks!
As an FYI, Keyspaces now supports using the RandomPartitioner, which enables reading and writing data in Apache Spark by using the open-source Spark Cassandra Connector.
Docs: https://docs.aws.amazon.com/keyspaces/latest/devguide/spark-integrating.html
Launch announcement: https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-keyspaces-read-write-data-apache-spark/
Spark Cassandra Connector is relying on specific partitioner implementation to define data splits, etc. There is no workaround for this problem right now, until somebody adds the implementation of corresponding TokenFactory into this code. It shouldn't be very complex, just should be done by someone who is interested in it.
Thank you for the feedback. At this time, You can write to Keyspaces using the Cassandra Spark Connector. Reading requires support for token rage. Please see the following doc page to see list of supported APIs https://docs.aws.amazon.com/keyspaces/latest/devguide/cassandra-apis.html.
Although we don't have timelines to share at the moment, we prioritize our roadmap based on customer feedback. We are releasing new features all the time. To learn more about our roadmap and upcoming features please contact your AWS Account manager.

How to connect pgBadger to Google Cloud SQL

I have a database on a Google Cloud SQL instance. I want to connect the database to pgBadger which is used to analyse the query. I have tried finding various methods, but they are asking for the log file location.
I believe there are 2 major limitations preventing an easy set up that would allow you to use pgBadger with logs generated by a Cloud SQL instance.
The first is the fact that Cloud SQL logs are processed by Stackdriver, and can only be accessed through it. It is actually possible to export logs from Stackdriver, however the outcome format and destination will still not meet the requirements for using pgBadger, which leads to the second major limitation.
Cloud SQL does not allow changes in all required configuration directives. The major one is the log_line_prefix, which currently does not follow the required format and it is not possible to change it. You can actually see what flags are supported in Cloud SQL in the Supported flags documentation.
In order to use pgBadger you would need to reformat the log entries, while exporting them to a location where pgBadger could do its job. Stackdriver can stream the logs through Pub/Sub, so you could develop an app to process and store them in the format you need.
I hope this helps.

How can I fix PostgreSQL canceling statement error on Google SQL?

We have PostgreSQL instances (1 master + 1 read replica) on Google SQL. Our Django (1.11.12) application uses these databases via PostGIS engine. When we try to use the database, we saw this error message:
django.db.utils.OperationalError: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
When I search for a solution, they generally say that I need to change hot_standby_feedback flag. But as you know Google SQL service has some restrictions about settings. I can't set the flag.
How can I fix this?
If “Google SQL” allows that, you can set max_standby_streaming_delay to -1 so that replication is delayed if a conflict is detected.
Then the query will not be canceled, but replication may lag if applying changes would cause a conflict.
Consider getting an “unfettered” PostgreSQL.
If you would like set hot_standby_feedback = on, I'll suggest that you indicate your interest in the open feature request on Google Cloud Platform's Public Issue Tracker tool. That way someone can look into the handling query conflict issues your Cloud SQL PostgreSQL instance encountered.
I've also been monitoring an open thread in the Issue Tracker about making max_standby_archive_delay and max_standby_streaming_delay flags available to users to set. You can track it there as well. Hope this helps!

Trouble configuring Presto's memory allocation on AWS EMR

I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node and query.max-memory. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?
As advertised, increasing query.max-memory-per-node, and also by necessity the -Xmx property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.
As a matter of fact, there are configuration classifications available for Presto in EMR. However, please note that these may vary depending on the EMR release version. For a complete list of the available configuration classifications per release version, please visit 1 (make sure to switch between the different tabs according to your desired release version). Specifically regarding to jvm.config properties, you will see in 2 that these are not currently configurable via configuration classifications. That being said, you can always edit the jvm.config file manually per your needs.
Amazon EMR 5.x Release Versions
1
Considerations with Presto on Amazon EMR - Some Presto Deployment Properties not Configurable:
2

Installing the Kmeans PostgreSQL extension on Amazon RDS

I take part in some Django poroject and we use geo data (with GeoDjango).
I have installed PostGis as it described on AWS docs.
We have a lot of some points (markers) on the map. And we need to cluster them.
I found one library anycluster. This library need the PostgreSQL extension named kmeans-postgresql be installed on the Postgre database.
But my database is located on Amazon RDS. And I can't connect to it by SSH in order to install an extension...
Anybody knows how can I install kmeans-postgresql extension on my Amazon RDS database?
Or maybe you can advise me other ways of clustering?
The K-Means It is a really complex calculation that is useful to data mining and cluster analysis ( you can see more about it in the wikipedia page https://en.wikipedia.org/wiki/K-means_clustering ). It have a big complexity when have to deal with many points. The K-means extension to postgresql http://pgxn.org/dist/kmeans/doc/kmeans.html it is written in C and compiled in the database machine. This brings a better performance compared to an procedure in plpgsql. Unfortunately as #estevao_lucas answered, this extension it is not enabled into Amazon RDS.
If you really need the k-means result, I translated this implementation of it, created by Joni Salonen in http://jonisalonen.com/2012/k-means-clustering-in-mysql/ and changed to plpgsql https://gist.github.com/thiagomata/a9737c3455d6248bef9f. This function uses temporary table. It is possible change it to use only arrays of Pins, if you wanna to.
But, if you only need to show some pins in a map, you will probably be happy with a really faster and simpler function that groups the results into an [x,y] matrix. I have created such function because the kmeans function was taking too much time to process my database (with a lot more than 400K elements). So this implementation is really faster, but does not have all the features you would expect from the K-means module. Besides that, this grid function https://gist.github.com/thiagomata/18ea14853998468c1a1d returns very good results, when the goal it is to show a big number of pins in a map.
You can just install supported extensions on Amazon RDS and Kmeans isn't it.
ERROR: Extension "kmeans" is not supported by Amazon RDS
DETAIL: Installing the extension "kmeans" failed, because it is not on the list of extensions supported by Amazon RDS.
HINT: Amazon RDS allows users with rds_superuser role to install supported extensions. See: SHOW rds.extensions;
alexandria_development=> SHOW rds.extensions
RDS extensions:
btree_gin,
btree_gist,
chkpass,
citext,
cube,
dblink,
dict_int,
dict_xsyn,
earthdistance,
fuzzystrmatch,
hstore,
intagg,
intarray,
isn,
ltree,
pgcrypto,
pgrowlocks,
pg_prewarm,
pg_stat_statements,
pg_trgm,
plcoffee,
plls,
plperl,
plpgsql,
pltcl,
plv8,
postgis,
postgis_tiger_geocoder,
postgis_topology,
postgres_fdw,
sslinfo,
tablefunc,
test_parser,
tsearch2,
unaccent,
uuid-ossp