ElasticSearch Migrate Data - amazon-web-services

I want to migrate data from Amazon AWS ElasticSearch version 2.3 to 5.1 and have created a data snaphsot in S3, now how do I copy those dump files from S3 to ES 5.1?

You're lucky, according to Elasticsearch documentation:
A snapshot of an index created in 2.x can be restored to 5.x.
Also, Elasticsearch has:
repository-s3 for S3 repository support
,which should be very useful in your case, I assuem.
The full guide how to do that is there. I will try to some of them:
Before any snapshot or restore operation can be performed, a snapshot
repository should be registered in Elasticsearch. The repository
settings are repository-type specific. See below for details.
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
... repository specific settings ...
}
}
After several others steps, like verifying snapshot, you could restore it:
A snapshot can be restored using the following command:
POST /_snapshot/my_backup/snapshot_1/_restore

Related

Elasticsearch 6.3 (AWS) snapshot restore progress ERROR: "/_recovery is not allowed"

I take manual snapshots of an Elasticsearch index
These are stored in a snapshot repo on S3
I have created a new ES cluster, also version 6.3
I have connected the new cluster to the S3 snapshot repo via python script method mentioned in this blog post: https://medium.com/docsapp-product-and-technology/aws-elasticsearch-manual-snapshot-and-restore-on-aws-s3-7e9783cdaecb
I have confirmed that the new cluster has access to the snapshot repo via the GET /_snapshot/manual-snapshot-repo/_all?pretty command
I have initiated a snapshot restore to this new cluster via:
POST /_snapshot/manual-snapshot-repo/snapshot_name/_restore
{
"indices": "reports",
"ignore_unavailable": false,
"include_global_state": false
}
It is clear that this operation has at least partially succeeded as the cluster status has gone from "green" to "yellow" and a GET request to /_cluster/health yields information that suggests actions are occuring on an otherwise empty cluster... not to mention storage is starting to be utilized (when viewing cluster health on AWS).
I would very much like to monitor the progress of the restore operation.
Elasticsearch docs suggest to use the Recovery API. Docs Link: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/indices-recovery.html
It is clear from the docs that GET /_recovery?human or GET /my_index/_recovery?human should yield restore progress.
However, I encounter the following error:
"Message": "Your request: '/_recovery' is not allowed."
I get the same message when attempting the GET command in the following ways:
Via Kibana dev tools
Via chrome address bar (It's just a GET operation after all)
Via Advanced REST Client (a Chrome app)
I have not been able to locate any other mention of this particular error message.
How can I utilize the GET /_recovery?human command on my ElasticSearch 6.3 clusters?
Thank you!
The Amazon managed Elasticsearch does not have all the endpoints available.
For version 6.3 you can check this link for the endpoints available, and _recovery is not on the list, that is why you get that message.
Without the _recovery endpoint you will need to rely on _cluster/health.

aws kinesis data analytics application (flink) change property originally located at flink-conf.yaml

As the runtime of my flink application I use managed flink by AWS (Kinesis Data Analytics Application)
I added functionality (sink) for write processed events from kinesis queue in S3 in a parquet format.
Locally everything works for me, but when I try to run the application in the cloud I get the following exception:
"throwableInformation": [
"com.esotericsoftware.kryo.KryoException: Error constructing instance of class: org.apache.avro.Schema$LockableArrayList",
"Serialization trace:",
"types (org.apache.avro.Schema$UnionSchema)",
"schema (org.apache.avro.Schema$Field)",
"fieldMap (org.apache.avro.Schema$RecordSchema)",
After finding a solution to the problem, I found that I need to change following property (checked this on a local cluster):
classloader.resolve-order: child-first -> classloader.resolve-order: parent-first
Is it possible to change this configuration when using AWS managed Fink (not EMR, Kinesis Data Analytics applications) in any way?
aws support answer: No. This property cannot be changed.

If there are a way to get info at runtime about SparkMetrics configuration

I add metrics.properties file to resource directory (maven project) with CSV sinc. Everything is fine when I run Spark app locally - metrics appears. But when I file same fat jar to Amazon EMR I do not see any tries to put metrics into CSV sinc. So I want to check at runtime what are loaded settings for SparkMetrics subsystem. If there are any possibility to do this?
I looked into SparkEnv.get.metricsSystem but didn't find any.
That is basically because Spark on EMR is not picking up your custom metrics.properties file from the resources dir of the fat jar.
For EMR the preferred way to configure is through EMR Configurations API in which you need to pass the classification and properties in an embedded JSON.
For spark metrics subsystem here is an example to modify a couple of metrics
[
{
"Classification": "spark-metrics",
"Properties": {
"*.sink.csv.class": "org.apache.spark.metrics.sink.CsvSink",
"*.sink.csv.period": "1"
}
}
]
You can use this JSON when creating EMR cluster using Amazon Console or through SDK

Connect IntelliJ to Amazon Redshift

I'm using the latest version of IntelliJ and I've just created a cluster in Amazon Redshift. How do I connect IntelliJ to Redshift so that I can query it from my favorite IDE?
Download a jdbc driver:
http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver
On IntelliJ: View |Tool Windows | Database
Click on "Data Source
Properties" ()
Click Add (+) and select "Database Driver":
Uncheck "JDBC drivers", and add a jdbc driver, select a class from the dropdown and select a PostgreSQL dialect:
6.Add a new connection, and use this datasource for your connection: (+ | Data Source | RedShift).
7.Set URL templates:
jdbc:redshift://[{host::localhost}[:{port::5439}]][/{database::postgres}?][\?<&,user={user:param},password={password:param},{:identifier}={:param}>]
jdbc:redshift://\[{host:ipv6:\:\:1}\][:{port::5439}][/{database::postgres}?][\?<&,user={user:param},password={password:param},{:identifier}={:param}>]
jdbc:redshift:{database::postgres}[\?<&,user={user:param},password={password:param},{:identifier}={:param}>]
You can connect IntelliJ to Redshift by the using the JDBC Driver supplied by Amazon. In the Redshift Console, go to "Connect Client" to get the driver.
Then, in the IntelliJ Data Source window, add the JAR as a Driver file, and use the following settings:
Class: com.amazon.redshift.jdbc41.Driver
URL template: jdbc:redshift://{host}:{port}/{database}
Common Pitfalls:
If the driver file is not readable or marked as in quarantine by OS X, you won't be able to select the driver class.
For a more detailed guide, see this blog post: Connecting IntelliJ to Redshift
Note: There is no native Redshift support in IntelliJ yet. IntelliJ Issue DBE-1459
Update for 2019: I've just created a PostgreSQL connection and then filled the usual Redshift settings (don't forget port: 5439), no need to download Amazon's JDBC driver.
Only little issue is that the syntax check doesn't know Redshift specificities such as AS and some functions, but queries execute correctly.
Update for 2020: PyCharm (and possibly all other JetBrains IDEs) now supports connecting to Redshift through IAM AWS credentials without manual driver installation.
Here are the detailed setup instructions:
Grant a redshift:GetClusterCredentials permission to your AWS user. Either create and attach a new policy (docs) or use an existing one such as AmazonRedshiftFullAccess (not recommended: too permissive).
Create an AWS access key (access key id + secret access key pair) for your user (docs).
Create a text configuration file ~/.aws/credentials (no extension) with the following content (docs):
[default] # arbitrary profile name, will be used later
region = <your region>
aws_access_key_id = <your access key id> # created on the previous step
aws_secret_access_key = <your secret access key>
Create a new PyCharm database connection of type Amazon Redshift and set it up (docs):
Choose connection type = IAM cluster/region (right under the «General» tab of the connection settings window).
Authentication = AWS Profile
User = {your AWS login}
Profile = default or the one you have used in credentials file.
The credentials can possibly be provided through AccessKeyID/SecretAccessKey connection settings on the «Advanced» tab but it did not work for me (due to NullPointerException if Profile field is empty).
Database = {your database}, choose an existing one to not face non descriptive errors from the driver.
Region = {your region}
Cluster = {cluster name}, get it from Redshift AWS console.
Setup the connection:
Check necessary databases in the «Schemas» tab.
«Advanced» tab: AutoCreate = true (literal lowercase true as the setting value). This will automatically create a new database user with your AWS login.
Test connection.

CKAN: automatically delete datastore tables when a resource is removed

I have a ckan instance configured with the filestore, datastore and datapusher plugins enabled.
When I create a new resource, the datapusher plugin correctly adds a new table to the datasoredb and populates it with the data.
But if I update the resource, a new datapusher task is executed and everything updates correctly. On another ckan instance with a resource linked to it, I have to manually run the task, but everything works ok.
The problem comes if I delete the resource. The datastore tables are still available, and even the link to the file is still active.
Is there some way to configure it to autoremove every trace of the resource??? I mean, remove the files from the filestore, the tables from the datastore, the api, the links, etc.
I partially confirmed this behaviour with http://demo.ckan.org, which is currently ckan_version: "2.4.1"
Create a resource
Query resource via data pusher
Delete resource
Query resource via datastore_search API -> still works can query.
Attempt to access resource file -> 404 - not found.
Will file as bug.
Perhaps use this to delete ?http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_delete
This is possible through CLI:
sudo -u postgres psql datastore_default
(assumes datastore installed from package using these Datastore Extension settings and database name is datastore_defaultand postgres is superuser).
THEN (OPTIONAL TO FIND ALL RESOURCE UUID's):
\dt to list all tables
THEN:
DROP TABLE "{RESOURCE ID}";
(Replace {RESOURCE ID} with resource UUID)