How to upload large amounts of stopwords into AWS Elasticsearch - amazon-web-services

Is it possible to upload a stopwords.txt onto AWS Elasticsearch and specify it as a path by stop token filter?

If your using aws elasticsearch, the only option to do this is using the elasticsearch rest APIs.
To import large data sets, you can use the bulk API.

Edit: You can now upload "packages" to AWS Elasticsearch service, which lets you add custom lists of stopwords etc. See https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/custom-packages.html
No, it isn't possible to upload a stopwords.txt file to the hosted AWS Elasticsearch service.
What you will have to do is specify the stopwords in a custom analyzer. More details on how to do that can be found in the official documentation.
The official documentation then says to "close and reopen" the index, but again, AWS Elasticsearch doesn't allow that, so you will then have to reindex.
Example:
1. Create an index with your stopwords listed inline within a custom analyzer, e.g.
PUT /my_new_index
{
"settings": {
"analysis": {
"analyzer": {
"english_analyzer": {
"type": "english",
"stopwords": "['a', 'the', 'they', 'and']"
}
}
}
}
}
2. Reindex
POST _reindex
{
"source": {
"index": "my_index"
},
"dest": {
"index": "my_new_index"
}
}

Yes it is possible by setting stopwords_path while defining your stop token filter.
stopwords_path => A path (either relative to config location, or
absolute) to a stopwords file configuration. Each stop word should be
in its own "line" (separated by a line break). The file must be UTF-8
encoded.
Here is how I did it.
Copied stopwords.txt file in the config folder of my elasticsearch home path.
Created a custom token filter with the path set in stopwords_path
PUT /testindex
{
"settings": {
"analysis": {
"filter": {
"teststopper": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
}
}
}
}
Verified if the filter was working as expected with _analyze API.
GET testindex/_analyze
{
"tokenizer" : "standard",
"token_filters" : ["teststopper"],
"text" : "this is a text to test the stop filter",
"explain" : true,
"attributes" : ["keyword"]
}
The tokens 'a', 'an', 'the', 'to', 'is' were filtered out since I had added them in config/stopwords.txt file.
For more info:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.2/_explain_analyze.html

Related

GCP recommendation data format for catalog

I am currently working on recommendation AI. since I am new to GCP recommendation, I have been struggling with data format for catalog. I read the documentation and it says each product item JSON format should be on a single line.
I understand this totally, but It would be really great if I could get what the JSON format looks like in real because the one in their documentation is very ambiguous to me. and I am trying to use console to import data
I tried to import data looking like down below but I got error saying invalid JSON format 100 times. it has lots of reasons such as unexpected token and something should be there and so on.
[
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
...
]
Maybe it was because each item was not on a single line, but I am also wondering if the above is enough for importing. I am not sure if those data should be included in another property like
{
"inputConfig": {
"productInlineSource": {
"products": [
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
}
I can see the above in the documentation but it says it is for importing inline which is using POST request. it does not mention anything about importing with console. I just guess the format is also used for console but I am not 100% sure. that is why I am asking
Is there anyone who can show me the entire data format to import data by using console?
Problem Solved
For those who might have the same question, The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
Posting this Community Wiki for better visibility.
OP edited question and add solution:
The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
However I'd like to elaborate a bit.
There are a few ways to import Importing catalog information:
Importing catalog data from Merchant Center
Importing catalog data from BigQuery
Importing catalog data from Cloud Storage
I guess this is what was used by OP, as I was able to import catalog using UI and GCS with below JSON file.
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{"id":"111","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"222","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
{"id":"333","title":"Test Movie (2020)","categories":["Adventure","Children's","Fantasy"]}
]
}
}
}
Importing catalog data inline
At the bottom of the Importing catalog information documentation you can find information:
The line breaks are for readability; you should provide an entire catalog item on a single line. Each catalog item should be on its own line.
It means you should use something similar to NDJSON - convenient format for storing or streaming structured data that may be processed one record at a time.
If you would like to try inline method, you should use this format, however it's single line but with breaks for readability.
data.json file
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{
"id": "1212",
"category_hierarchies": [ { "categories": [ "Animation", "Children's" ] } ],
"title": "Toy Story (1995)"
},
{
"id": "5858",
"category_hierarchies": [ { "categories": [ "Adventure", "Fantasy" ] } ],
"title": "Jumanji (1995)"
},
{
"id": "321123",
"category_hierarchies": [ { "categories": [ "Comedy", "Adventure" ] } ],
"title": "The Lord of the Rings: The Fellowship of the Ring (2001)"
},
]
}
}
}
Command
curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
--data #./data.json \
"https://recommendationengine.googleapis.com/v1beta1/projects/[your-project]/locations/global/catalogs/default_catalog/catalogItems:import"
{
"name": "import-catalog-default_catalog-1179023525XX37366024",
"done": true
}
Please keep in mind that the above method requires Service Account authentication, otherwise you will get PERMISSION DENIED error.
"message" : "Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the translate.googleapis.com. We recommend that most server applications use service accounts instead. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/.",
"status" : "PERMISSION_DENIED"

How to automate the creation of elasticsearch index patterns for all days?

I am using cloudwatch subscription filter which automatically sends logs to elasticsearch aws and then I use Kibana from there. The issue is that everyday cloudwatch creates a new indice due to which I have to manually create the new index pattern each day in kibana. Accordingly I will have to create new monitors and alerts in kibana as well each day. I have to automate this somehow. Also if there is better option with which I can go forward would be great. I know datadog is one good option.
Typical work flow will look like this (there are other methods)
Choose a pattern when creating an index. Like staff-202001, staff-202002, etc
Add each index to an alias. Like staff
This can be achieved in multiple ways, easiest is to create a template with index pattern , alias and mapping.
Example: Any new index created matching the pattern staff-* will be assigned with given mapping and attached to alias staff and we can query staff instead of individual indexes and setup alerts.
We can use cwl--aws-containerinsights-eks-cluster-for-test-host to run queries.
POST _template/cwl--aws-containerinsights-eks-cluster-for-test-host
{
"index_patterns": [
"cwl--aws-containerinsights-eks-cluster-for-test-host-*"
],
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
},
"aliases": {
"cwl--aws-containerinsights-eks-cluster-for-test-host": {}
}
}
Note: If unsure of mapping, we can remove mapping section.

Google Cloud Vision Api only return "name"

I am trying to use Google Cloud Vision API.
I am using the REST API in this link.
POST https://vision.googleapis.com/v1/files:asyncBatchAnnotate
My request is
{
"requests": [
{
"inputConfig": {
"gcsSource": {
"uri": "gs://redaction-vision/pdf_page1_employment_request.pdf"
},
"mimeType": "application/pdf"
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
],
"outputConfig": {
"gcsDestination": {
"uri": "gs://redaction-vision"
}
}
}
]
}
But the response is always only "name" like below:
{
"name": "operations/a7e4e40d1e1ac4c5"
}
My "gs" location is valid.
When I write the wrong path in "gcsSource", 404 not found error is coming.
Who knows why my response is weird?
This is expected, it will not send you the output as a HTTP response. To see what the API did, you need to go to your destination bucket and check for a file named "xxxxxxxxoutput-1-to-1.json", also, you need to specify the name of the object in your gcsDestination section, for example: gs://redaction-vision/test.
Since asyncBatchAnnotate is an asynchronous operation, it won't return the result, it instead returns the name of the operation. You can use that unique name to call GetOperation to check the status of the operation.
Note that there could be more than 1 output file for your pdf if the pdf has more pages than batchSize and the output json file names change depending on the number of pages. It isn't safe to always append "output-1-to-1.json".
Make sure that the uri prefix you put in the output config is unique because you have to do a wildcard search in gcs on the prefix you provide to get all of the json files that were created.

Mapping geo_point data when importing data to AWS Elasticsearch

I have a set of data inside dynamodb that I am importing to AWS Elasticsearch using this tutorial: https://medium.com/#vladyslavhoncharenko/how-to-index-new-and-existing-amazon-dynamodb-content-with-amazon-elasticsearch-service-30c1bbc91365
I need to change the mapping of a part of that data to geo_point.
I have tried creating the mapping before importing the data with:
PUT user
{
"mappings": {
"_doc": {
"properties": {
"grower_location": {
"type": "geo_point"
}
}
}
}
}
When I do this the data doesn't import, although I don't receive an error.
If I import the data first I am able to search it, although the grower_location: { lat: #, lon: # } object is mapped as an integer and I am unable to run geo_distance.
Please help.
I was able to fix this by importing the data once with the python script in the tutorial.
Then running
GET user/_mappings
Copying the auto generated mappings to clipboard, then,
DELETE user/
Then pasting the copied mapping to a new mapping and changing the type for the geo_point data.
PUT user/
{
"mappings": {
"user_type": {
"properties": {
...
"grower_location": {
"type": "geo_point"
}
...
}
}
}
}
Then re-importing the data using the python script in the tutorial.
Everything is imported and ready to be searched using geo_point!

how to use google cloud vision to extract multiple text language in android studio

I am trying to build an android application (in android studio platform) which extracts different text languages from image using google cloud vision, but I have a problem in starting.
I don't know how to use google cloud files. Which files do I need to create or download and how to direct my API to extract multiple languages?
I got the API and this source code :
POST https://vision.googleapis.com/v1/images:annotate?key=YOUR_API_KEY
{
"requests": [
{
"image": {
"content": "/9j/7QBEUGhvdG9zaG9...base64-encoded-image-content...fXNWzvDEeYxxxzj/Coa6Bax//Z"
},
"features": [
{
"type": "TEXT_DETECTION"
}
]
}
]
}
public static void detectText(String filePath, PrintStream out) throws Exception, IOException {
List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
Image img = Image.newBuilder().setContent(imgBytes).build();
Feature feat = Feature.newBuilder().setType(Type.TEXT_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
for (AnnotateImageResponse res : responses) {
if (res.hasError()) {
out.printf("Error: %s\n", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
for (EntityAnnotation annotation : res.getTextAnnotationsList()) {
out.printf("Text: %s\n", annotation.getDescription());
out.printf("Position : %s\n", annotation.getBoundingPoly());
}
}
}
}
I would suggest you to try your image data on the Cloud Vision Api Explorer [1]. You can try the API directly on the web browser with an Oauth2 authentication. Follow the steps below:
Enable the API on Google Cloud Console -> APIs and Services -> Libraries
Set the scopes on the API Explorer checkbox:
https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/cloud-vision
{
"requests": [
{
"features": [
{
"type": "TEXT_DETECTION"
}
],
"image": {
"source": {
"imageUri": "http://dps.usc.edu/files/2015/07/text-alerts.png"
},
},
"imageContext": {
"languageHints": [
"en"
]
}
]
}
Set de “imageContext”. There you can set language hints, but the API might detect the language automatically. Check this [2] for available language hints.
In the source you could use an image from your Google Cloud Storage bucket changing “imageUri” by: "gcsImageUri": "gs://your-bucket/text-alerts.png" as your image uri. Note the change in protocol.
You are using “content” instead of “source”, and this is for setting a base64 encoded string image. You can try to encode an image with Base64 copy the encode as plain text and try on the API Explorer to check that the encode is correct and works. Be careful when copying as you may get noise like \n, \t and things that might break your b64 encode. I share a python code that does the job:
import base64
f = open("text-alerts.png", "rb")
encoded = base64.b64encode(f.read())
print(encoded)
f.close()
fw = open('content.b64', "wb")
fw.write(encoded)
fw.close()
In your request:
{ "requests": [ { "image": { "content": "/9j/7QBEUGhvdG9zaG9...base64-encoded-image-content...fXNWzvDEeYxxxzj/Coa6Bax//Z" }, "features": [ { "type": "TEXT_DETECTION" } ] } ] }
The content tag is the image string in Base64:
"content":“/9j/7QBEUGhvdG9zaG9...base64-encoded-image-content...fXNWzvDEeYxxxzj/Coa6Bax//Z”
You can use a web tool to do the same and check that your Base64 works. You can load the file on the Android Studio.
Here [4] you can find a sample for Android, with a README that explains how to configure your App. You need to create your API Key here [3] and in the MainActivity you have a variable that must be set to your API Key that then is used for the request.
private static final String CLOUD_VISION_API_KEY = "YOUR_API_KEY";
The sample loads an image and converts it to Base64 before sending the request [5]. See the method callCloudVision, inside there is an AsyncTask that retrieves an image and converts it to Base64 before sending the request.
[1] https://cloud.google.com/vision/docs/quickstart
[2] https://cloud.google.com/vision/docs/languages
[3] https://console.cloud.google.com/apis/credentials?project=your-project-id
[4] https://github.com/GoogleCloudPlatform/cloud-vision/tree/master/android
[5] https://github.com/GoogleCloudPlatform/cloud-vision/blob/master/android/CloudVision/app/src/main/java/com/google/sample/cloudvision/MainActivity.java#L192