Mapping geo_point data when importing data to AWS Elasticsearch

Mapping geo_point data when importing data to AWS Elasticsearch - amazon-web-services

I have a set of data inside dynamodb that I am importing to AWS Elasticsearch using this tutorial: https://medium.com/#vladyslavhoncharenko/how-to-index-new-and-existing-amazon-dynamodb-content-with-amazon-elasticsearch-service-30c1bbc91365
I need to change the mapping of a part of that data to geo_point.
I have tried creating the mapping before importing the data with:
PUT user
{
"mappings": {
"_doc": {
"properties": {
"grower_location": {
"type": "geo_point"
}
}
}
}
}
When I do this the data doesn't import, although I don't receive an error.
If I import the data first I am able to search it, although the grower_location: { lat: #, lon: # } object is mapped as an integer and I am unable to run geo_distance.
Please help.

I was able to fix this by importing the data once with the python script in the tutorial.
Then running
GET user/_mappings
Copying the auto generated mappings to clipboard, then,
DELETE user/
Then pasting the copied mapping to a new mapping and changing the type for the geo_point data.
PUT user/
{
"mappings": {
"user_type": {
"properties": {
...
"grower_location": {
"type": "geo_point"
}
...
}
}
}
}
Then re-importing the data using the python script in the tutorial.
Everything is imported and ready to be searched using geo_point!

Related

Removed link from json file while generating json from Informatica

I am creating JSON file from Informatica. What I did so far:
configure the basic entity in Informatica MDM hub
create a Base object in the provisioning tool
Then, I accessed the predefined URL by Informatica like
"server:port"/cmx/cs/"databaseid"/"baseobjectname"/id.json.
By default, Informatica places the link attribute inside of the JSON file for parent/child/self if any
Is there any way we can remove the link attribute?
I am getting below output
{
"link": [
{
"href": "serveraddress//1.json?depth=2",
"rel": "children"
},
{
"href": "serveraddress//1.json",
"rel": "self"
}
],
"rowidObject": "2"
}
Expected:
{
"rowidObject": "2"
}

Finally i am able to solve this using the suppressLinks=true in the url call
http://:/cmx/cs//?q=''&depth=3&suppressLinks=true

Confused about GCP Dataproc sofwareConfig values

I'm attempting to modify Airflow's dataproc operator to include Anaconda and Jupyter to the cluster.
I'm overriding DataprocClusterCreateOperator to include optionalComponents.
After reading Google docs I understand that I need to pass an enum;
Every time I try to run this task I encounter invalid value errors, or TypeError: Object of type 'EnumMeta' is not JSON serializable
I'd really appreciate it if someone can tell me how to correctly pass in this field.
cluster_data = {
'projectId': self.project_id,
'clusterName': self.cluster_name,
'config': {
'gceClusterConfig': {
},
'masterConfig': {
'numInstances': self.num_masters,
'machineTypeUri': master_type_uri,
'diskConfig': {
'bootDiskType': self.master_disk_type,
'bootDiskSizeGb': self.master_disk_size
}
},
'workerConfig': {
'numInstances': self.num_workers,
'machineTypeUri': worker_type_uri,
'diskConfig': {
'bootDiskType': self.worker_disk_type,
'bootDiskSizeGb': self.worker_disk_size
}
},
'secondaryWorkerConfig': {},
'softwareConfig': {
# I've tried the following:
'optionalComponents': 'ANACONDA,JUPYTER'
#from google.cloud.dataproc_v1 import enums
'optionalComponents': [enums.Component.ANACONDA.value]
},
}
}

You want to use a JSON list there ['ANACONDA', 'JUPYTER'].
As general guidance for figuring out how to structure things, you can create a cluster with gcloud and then run:
gcloud dataproc clusters describe my-cluster --format json
That --format json is the key. The result should be directly copy-pastable.

Google Cloud Vision Api only return "name"

I am trying to use Google Cloud Vision API.
I am using the REST API in this link.
POST https://vision.googleapis.com/v1/files:asyncBatchAnnotate
My request is
{
"requests": [
{
"inputConfig": {
"gcsSource": {
"uri": "gs://redaction-vision/pdf_page1_employment_request.pdf"
},
"mimeType": "application/pdf"
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
],
"outputConfig": {
"gcsDestination": {
"uri": "gs://redaction-vision"
}
}
}
]
}
But the response is always only "name" like below:
{
"name": "operations/a7e4e40d1e1ac4c5"
}
My "gs" location is valid.
When I write the wrong path in "gcsSource", 404 not found error is coming.
Who knows why my response is weird?

This is expected, it will not send you the output as a HTTP response. To see what the API did, you need to go to your destination bucket and check for a file named "xxxxxxxxoutput-1-to-1.json", also, you need to specify the name of the object in your gcsDestination section, for example: gs://redaction-vision/test.

Since asyncBatchAnnotate is an asynchronous operation, it won't return the result, it instead returns the name of the operation. You can use that unique name to call GetOperation to check the status of the operation.
Note that there could be more than 1 output file for your pdf if the pdf has more pages than batchSize and the output json file names change depending on the number of pages. It isn't safe to always append "output-1-to-1.json".
Make sure that the uri prefix you put in the output config is unique because you have to do a wildcard search in gcs on the prefix you provide to get all of the json files that were created.

Terraform load json object from AWS S3

I have a need to load data from a non public S3 bucket. Using this JSON I wanted be able to loop over lists within the terraform.
Example:
{
info: [
"10.0.0.0/24",
"10.1.1.0/24",
"10.2.2.0/24"
]
}
I can retrieve the JSON fine using the following:
data "aws_s3_bucket_object" "config" {
bucket = "our-bucket"
key = "global.json"
}
What I cannot do is utilize this as a map|list within terraform so that I can utilize this data. Any ideas?

After a good deal of trial and error I figured out a solution. Note that for this to work it appears the JSON source needs to be simple, by that I mean no nested objects like lists or maps.
{
foo1: "my foo1",
foo2: "my foo2",
foo3: "my foo3"
}
data "aws_s3_bucket_object" "config-json" {
bucket = "my-bucket"
key = "foo.json"
}
data "external" "config-map" {
program = ["echo", "${data.aws_s3_bucket_object.config-json.body}"]
}
output "foo" {
value = ["${values(data.external.config-map.result)}"]
}

How to upload large amounts of stopwords into AWS Elasticsearch

Is it possible to upload a stopwords.txt onto AWS Elasticsearch and specify it as a path by stop token filter?

If your using aws elasticsearch, the only option to do this is using the elasticsearch rest APIs.
To import large data sets, you can use the bulk API.

Edit: You can now upload "packages" to AWS Elasticsearch service, which lets you add custom lists of stopwords etc. See https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/custom-packages.html
No, it isn't possible to upload a stopwords.txt file to the hosted AWS Elasticsearch service.
What you will have to do is specify the stopwords in a custom analyzer. More details on how to do that can be found in the official documentation.
The official documentation then says to "close and reopen" the index, but again, AWS Elasticsearch doesn't allow that, so you will then have to reindex.
Example:
1. Create an index with your stopwords listed inline within a custom analyzer, e.g.
PUT /my_new_index
{
"settings": {
"analysis": {
"analyzer": {
"english_analyzer": {
"type": "english",
"stopwords": "['a', 'the', 'they', 'and']"
}
}
}
}
}
2. Reindex
POST _reindex
{
"source": {
"index": "my_index"
},
"dest": {
"index": "my_new_index"
}
}

Yes it is possible by setting stopwords_path while defining your stop token filter.
stopwords_path => A path (either relative to config location, or
absolute) to a stopwords file configuration. Each stop word should be
in its own "line" (separated by a line break). The file must be UTF-8
encoded.
Here is how I did it.
Copied stopwords.txt file in the config folder of my elasticsearch home path.
Created a custom token filter with the path set in stopwords_path
PUT /testindex
{
"settings": {
"analysis": {
"filter": {
"teststopper": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
}
}
}
}
Verified if the filter was working as expected with _analyze API.
GET testindex/_analyze
{
"tokenizer" : "standard",
"token_filters" : ["teststopper"],
"text" : "this is a text to test the stop filter",
"explain" : true,
"attributes" : ["keyword"]
}
The tokens 'a', 'an', 'the', 'to', 'is' were filtered out since I had added them in config/stopwords.txt file.
For more info:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.2/_explain_analyze.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Mapping geo_point data when importing data to AWS Elasticsearch - amazon-web-services

Related

Removed link from json file while generating json from Informatica

Confused about GCP Dataproc sofwareConfig values

Google Cloud Vision Api only return "name"

Terraform load json object from AWS S3

How to upload large amounts of stopwords into AWS Elasticsearch

Categories

Resources