Installing the Kmeans PostgreSQL extension on Amazon RDS - django

I take part in some Django poroject and we use geo data (with GeoDjango).
I have installed PostGis as it described on AWS docs.
We have a lot of some points (markers) on the map. And we need to cluster them.
I found one library anycluster. This library need the PostgreSQL extension named kmeans-postgresql be installed on the Postgre database.
But my database is located on Amazon RDS. And I can't connect to it by SSH in order to install an extension...
Anybody knows how can I install kmeans-postgresql extension on my Amazon RDS database?
Or maybe you can advise me other ways of clustering?

The K-Means It is a really complex calculation that is useful to data mining and cluster analysis ( you can see more about it in the wikipedia page https://en.wikipedia.org/wiki/K-means_clustering ). It have a big complexity when have to deal with many points. The K-means extension to postgresql http://pgxn.org/dist/kmeans/doc/kmeans.html it is written in C and compiled in the database machine. This brings a better performance compared to an procedure in plpgsql. Unfortunately as #estevao_lucas answered, this extension it is not enabled into Amazon RDS.
If you really need the k-means result, I translated this implementation of it, created by Joni Salonen in http://jonisalonen.com/2012/k-means-clustering-in-mysql/ and changed to plpgsql https://gist.github.com/thiagomata/a9737c3455d6248bef9f. This function uses temporary table. It is possible change it to use only arrays of Pins, if you wanna to.
But, if you only need to show some pins in a map, you will probably be happy with a really faster and simpler function that groups the results into an [x,y] matrix. I have created such function because the kmeans function was taking too much time to process my database (with a lot more than 400K elements). So this implementation is really faster, but does not have all the features you would expect from the K-means module. Besides that, this grid function https://gist.github.com/thiagomata/18ea14853998468c1a1d returns very good results, when the goal it is to show a big number of pins in a map.

You can just install supported extensions on Amazon RDS and Kmeans isn't it.
ERROR: Extension "kmeans" is not supported by Amazon RDS
DETAIL: Installing the extension "kmeans" failed, because it is not on the list of extensions supported by Amazon RDS.
HINT: Amazon RDS allows users with rds_superuser role to install supported extensions. See: SHOW rds.extensions;
alexandria_development=> SHOW rds.extensions
RDS extensions:
btree_gin,
btree_gist,
chkpass,
citext,
cube,
dblink,
dict_int,
dict_xsyn,
earthdistance,
fuzzystrmatch,
hstore,
intagg,
intarray,
isn,
ltree,
pgcrypto,
pgrowlocks,
pg_prewarm,
pg_stat_statements,
pg_trgm,
plcoffee,
plls,
plperl,
plpgsql,
pltcl,
plv8,
postgis,
postgis_tiger_geocoder,
postgis_topology,
postgres_fdw,
sslinfo,
tablefunc,
test_parser,
tsearch2,
unaccent,
uuid-ossp

Related

How to copy an AWS RDS database, NOT the entire instance

I am somewhat new to AWS RDS and their terminology. What they call a "database" seems to me to be an SQL Server instance. I have a database (as defined by SSMS--with tables, data, stored procedures, etc.) on RDS named "prod" and I want to duplicate it for testing purposes to be named "test" with all the content, and leave "prod" as-is.
All the instructions I've found by doing many, many searches seem to be related to duplicating the entire instance. Can someone help me with instructions on how to create a duplicate of just the (ssms term) database?
Thanks in advance for any help!
P.S. What does AWS/RDS call the object that is equivalent to an SSMS database?
I've found multiple posts here about duplicating an entire instance. It could be that I don't fully understand the terminology because I know this must be a common task but I am not understanding how to do it.
This is a production environment so I am proceeding very cautiously. I do have nightly snapshots made so I know I could recover but would rather do it right the first time.
I usually use a command like this to backup a single database to s3:
exec msdb.dbo.rds_backup_database #source_db_name='<mydatabasename>', #S3_arn_to_backup_to='arn:aws:s3:::<mys3objectname>', #type='FULL'
There is a bit of one-time configuration you need to do first, see the link below , and then its as simple as executing commands from SSMS to backup a database to S3 and then restoring it from S3 - maybe not exactly what you are looking for, but it works great.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/SQLServer.Procedural.Importing.html
What you are looking for can't really be done via any RDS commands/tools/interface. RDS only concerns itself with the database server itself, and isn't really even aware of the different databases, schemas, tables, etc. you may have created on the server.
You will need to use the tools for the DBMS you are using, in this case it sounds like you are using Microsoft SQL Server, so you will need to use MS SQL Server tools (perhaps running on an EC2 instance) to dump a single database to a file, and then load it into another database.
P.S. What does AWS/RDS call the object that is equivalent to an SSMS database?
An RDS database is a "database server". You might also see it called a "database instance". AWS/RDS calls the object equivalent to an SSMS database simply a database. The terminology is confusing because both the physical server/computer running the database software, and the logical grouping of tables/logs/data inside the software are both generally referred to as a "database".

Cannot set consistency level when querying Amazon Keyspaces service from DataGrip

I'm trying to perform inserts on Amazon's Managed Cassandra service from IntelliJ's DataGrip IDE, however I recieve the following error:
Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM
This is due to Amazon using the LOCAL_QUORUM consistency level for writes.
I tried to set the consistency level with CONSISTENCY LOCAL_QUORUM; before running other queries but it returned the following error:
line 1:0 no viable alternative at input 'CONSISTENCY' ([CONSISTENCY])
From my understanding, this is because CONSISTENCY is a cqlsh command and not a CQL command.
I cannot find any way to set the consistency level from within DataGrip so that I can run scripts and populate my tables.
Ultimately, I will use plain cqlsh if I cannot find a solution but I was hoping to use DataGrip as I find it useful and have many databases already configured. I hope someone can shed some light on the issue, this seems like it should be a basic feature.
I am Max from DataGrip team, and the correct answer is:
It could be JDBC driver issue and the desired method hasn't been implemented yet. Since you're trying to run pure cqlsh command as SQL. Follow the issue DBE-10638.
It's a DataGrip bug, see https://youtrack.jetbrains.com/issue/DBE-10182 :
Cassandra 'CONSISTENCY' command is not supported
So upvote that bug, and maybe add a comment that it makes DataGrip useless for writing to Amazon Managed Cassandra
Amazon Keyspaces (Apache Cassandra)
Now I used DataGrip version 2020.1.3 (Buy Licensed)
Encounter problems as well.
Cannot change type CONSISTENCY ONE to LOCAL_QUORUM
I have opened an issue already and waiting for the investigation.
So, I try so many tools and found that DBeaver is working,
The CONSISTENCY can be selected in the configuration GUI.
https://dbeaver.com/download

Saving images on files or on database when using docker

I'm using docker for a project, the main focus for its usage is to make the application available even if one of the node (it's a 6 nodes cluster with docker swarm) is down.
The application is basically a Django App that can save some images from users and others models. I'm currently saving the images as files, but since I need to specify a volume locally for a single machine, I would like to know if it would be better to save the images on database cluster, so it would be available even if the whole node goes down. Or is there another way?
#Edit
Note: The cluster runs locally and doesn't have internet access
The two options are two perform the file sharing via database or via the file system.
For file system sharing, you can use something like GlusterFS, so for each container it seems like they are mounting a host-local volume, but it's actually shared via GlusterFS between the hosts.
To my mind, if it's your application (e.g you can modify it at will), saving stuff in database would be the easier approach for most developers.
The best solution is often to go for a hosted option (such as MongoDB Atlas). Making a database resilient and highly available is really hard, and unless you are an expert on docker and mongo I would strongly recommend you to go for a hosted option.

Machine Learning (NLP) on AWS. Cloud9? SageMaker? EC2-AMI?

I have finally arrived in the cloud to put my NLP work to the next level, but I am a bit overwhelmed with all the possibilities I have. So I am coming to you for advice.
Currently I see three possibilities:
SageMaker
Jupyter Notebooks are great
It's quick and simple
saves a lot of time spent on managing everything, you can very easily get the model into production
costs more
no version control
Cloud9
EC2(-AMI)
Well, that's where I am for now. I really like SageMaker, although I don't like the lack of version control (at least I haven't found anything for now).
Cloud9 seems just to be an IDE to an EC2 instance.. I haven't found any comparisons of Cloud9 vs SageMaker for Machine Learning. Maybe because Cloud9 is not advertised as an ML solution. But it seems to be an option.
What is your take on that question? What have I missed? What would you advise me to go for? What is your workflow and why?
I am looking for an easy work environment where I can quickly test my models, exactly. And it won't be only me working on it, it's a team effort.
Since you are working as a team I would recommend to use sagemaker with custom docker images. That way you have complete freedom over your algorithm. The docker images are stored in ecr. Here you can upload many versions of the same image and tag them to keep control of the different versions(which you build from a git repo).
Sagemaker also gives the execution role to inside the docker image. So you still have full access to other aws resources (if the execution role has the right permissions)
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
In my opinion this is a good example to start because it shows how sagemaker is interacting with your image.
Some notes on other solutions:
The problem of every other solution you posted is you want to build and execute on the same machine. Sure you can do this but keep in mind, that gpu instances are expensive and therefore you might only switch to the cloud when the code is ready to run.
Some other notes
Jupyter Notebooks in general are not made for collaborative programming. I think they want to change this with jupyter lab but this is still in development and sagemaker only use the notebook at the moment.
EC2 is cheaper as sagemaker but you have to do more work. Especially if you want to run your model as docker images. Also with sagemaker you can easily build an endpoint for model inference which would be even more complex to realize with ec2.
Cloud 9 I never used this service and but on first glance it seems good to develop on, but the question remains if you want to do this on a gpu machine. Because you're using ec2 as instance you have the same advantage/disadvantage.
One thing I'd like to call out first is SageMaker notebook is not the only IDE environment in which you can interact with other components of SageMaker such as training and hosting. In fact you can make API calls to SageMaker training/hosting through Cloud9 or any IDEs you've installed on EC2 or even your laptop, as long as you have AWS SDK or SageMaker Python SDK installed.
Regarding the choice of the IDE, it's really up to your particular needs. SageMaker notebook is Jupyter based (now also supports JupyterLab beta), ML focused, and fully managed. Hundreds of Python packages that are commonly used in ML, as well as Tensorflow, Keras, MxNet, SageMaker Python SDK, etc., are preinstalled and automatically maintained for you. It also integrates more closely with other components of SageMaker as one can imagine.
Cloud9 is a managed IDE too but it is for general purpose rather than ML specific. If you want to use Jupyter on cloud9 it requires extra work from your side. It does not preinstall and maintain the version of common ML/DL related packages like SageMaker notebook does.

jar containing org.apache.hadoop.hive.dynamodb

I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process.
Unfortunately, I couldn't find the file as well :(.
Could someone answer the following questions for me (listed in order of priority).
Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format).
the jar containing org.apache.hadoop.hive.dynamodb.
Thanks!
It's in hive-bigbird-handler.jar. Unfortunately AWS doesn't provide any source or at least Java Doc about it. But you can find the jar on any node of an EMR Cluster:
/home/hadoop/.versions/hive-0.8.1/auxlib/hive-bigbird-handler-0.8.1.jar
You might want to checkout this Article:
Amazon DynamoDB Part III: MapReducin’ Logs
Unfortunately, Amazon haven’t released the sources for
hive-bigbird-handler.jar, which is a shame considering its usefulness.
Of particular note, it seems it also includes built-in support for
Hadoop’s Input and Output formats, so one can write straight on
MapReduce Jobs, writing directly into DynamoDB.
Tip: search for hive-bigbird-handler.jar to get to the interesting parts... ;-)
1- I am not aware of any such example, but you might find this library useful. It provides InputFormats, OutputFormats, and Writable classes for reading and writing data to Amazon DynamoDB tables.
2- I don't think they have made it available publically.