Enabling regex support on AWS Managed ElasticSearch in painless scripts - amazon-web-services

I am trying to upload templates to my AWS managed ElasticSearch.
ElasticSearch responds with a 500 error complaining that I need to set script.painless.regex.enabled to true. I know that you cannot edit the elasticsearch.yml file directly, but is there anyway to allow for support of regex in painless scripts on AWS managed ES?

There is no way yet to use regex under AWS ES cluster.
You can try to use StringTokenizer, as following example:
example value:
doc['your_str_field.keyword'].value = '{"xxx":"123213","yyy":"123213","zzz":"123213"}'
Painless script:
{
"script": {
"lang": "painless",
"inline": "String xxx = doc['your_str_field.keyword'].value; xxx = xxx.replace('{','').replace('}','').replace('\"','').replace(' ','');StringTokenizer tokenizer = new StringTokenizer(xxx, ',');tokenizer.nextToken();tokenizer.nextToken();StringTokenizer tokenizer_v = new StringTokenizer(tokenizer.nextToken(),':');tokenizer_v.nextToken();return tokenizer_v.nextToken();"
}
}
also, I needed to increase max_compilations_rate
PUT /_cluster/settings
{
"transient": {
"script.max_compilations_rate": "500/1m"
}
}

Related

Get generated API key from AWS AppSync API created with CDK

I'm trying to access data from my stack where I'm creating an AppSync API. I want to be able to use the generated Stacks' url and apiKey but I'm running into issues with them being encoded/tokenized.
In my stack I'm setting some fields to the outputs of the deployed stack:
this.ApiEndpoint = graphAPI.url;
this.Authorization = graphAPI.graphqlApi.apiKey;
When trying to access these properties I get something like ${Token[TOKEN.209]} and not the values.
If I'm trying to resolve the token like so: this.resolve(graphAPI.graphqlApi.apiKey) I instead get { 'Fn::GetAtt': [ 'AppSyncAPIApiDefaultApiKey537321373E', 'ApiKey' ] }.
But I would like to retrieve the key itself as a string, like da2-10lksdkxn4slcrahnf4ka5zpeemq5i.
How would I go about actually extracting the string values for these properties?
The actual values of such Tokens are available only at deploy-time. Before then you can safely pass these token properties between constructs in your CDK code, but they are opaque placeholders until deployed. Depending on your use case, one of these options can help retrieve the deploy-time values:
If you define CloudFormation Outputs for a variable, CDK will (apart from creating it in CloudFormation), will, after cdk deploy, print its value to the console and optionally write it to a json file you pass with the --outputs-file flag.
// AppsyncStack.ts
new cdk.CfnOutput(this, 'ApiKey', {
value: this.api.apiKey ?? 'UNDEFINED',
exportName: 'api-key',
});
// at deploy-time, if you use a flag: --outputs-file cdk.outputs.json
{
"AppsyncStack": {
"ApiKey": "da2-ou5z5di6kjcophixxxxxxxxxx",
"GraphQlUrl": "https://xxxxxxxxxxxxxxxxx.appsync-api.us-east-1.amazonaws.com/graphql"
}
}
Alternatively, you can write a script to fetch the data post-deploy using the listGraphqlApis and listApiKeys commands from the appsync JS SDK client. You can run the script locally or, for advanced use cases, wrap the script in a CDK Custom Resource construct for deploy-time integration.
Thanks to #fedonev I was able to extract the API key and url like so:
const client = new AppSyncClient({ region: "eu-north-1" });
const command = new ListGraphqlApisCommand({ maxResults: 1 });
const res = await client.send(command);
if (res.graphqlApis) {
const apiKeysCommand = new ListApiKeysCommand({
apiId: res.graphqlApis[0].apiId,
});
const apiKeyResponse = await client.send(apiKeysCommand);
const urls = flatMap(res.graphqlApis[0].uris);
if (apiKeyResponse.apiKeys && res.graphqlApis[0].uris) {
sendSlackMessage(urls[1], apiKeyResponse.apiKeys[0].id || "");
}
}

Automate AWS Marketplace publishing through CLI

I have my product uploaded to AWS as an AMI through Hashicorp's Packer. Now I'ld like to automate the last step, publishing it to the marketplace. The product already exists, it's only about adding a revision.
After reading this article, the API_StartChangeSet doc, this add revisions user guide & fiddling with the marketplace console, I think I just have to
aws marketplace-catalog start-change-set --catalog AWSMarketplace --change-set-name "$VERSION" --change-set '[ {"ChangeType": "AddRevisions", "Entity": {"Identifier": "REDACTED#29","Type": "ServerProduct#1.0"}, "Details": "{\"DataSetArn\": \"?????\", \"RevisionArns\": [\"?????\"] }" ]'
I'm having a hard time coming up with "Details" part. I've my AMI id. I guess that goes in the RevisionsArns ? What should I put in the DataSetArn, the "EntityArn" from the output of aws marketplace-catalog describe-entity --catalog AWSMarketplace --entity-id REDACTED ?
Details facet here is just a product type specific facet, encoded as json string. For the AMI that you are offering in the AWS Marketplace, it could include support information, region availability or any other info that provides a descriptive text regarding your change. For example:
"Details": "{\"Description\":{}, \"PromotionalResources\":{}, \"RegionAvailability\":{}, \"SupportInformation\":{}}",
The example you found does not necessarily mean that you have to have EntityArn and RevisionsArns. The Details facet is used as an information describing the details of your change.
Check here.
Turns out I didn't found the good documentation, my last link being about AWS Data Exchange, whose "Details" field's contents were confusing.
Here the relevant documentation: Marketplace catalog AMI add version, and here's the snippet I was looking for
"Details": "{
\"Version\": {
\"VersionTitle\": \"*My new title*\",
\"ReleaseNotes\": \"*My new Release notes*\"
},
\"DeliveryOptions\": [
{
\"Details\": {
\"AmiDeliveryOptionDetails\": {
\"AmiSource\": {
\"AmiId\": \"ami-1234567890abcdef\",
\"AccessRoleArn\": \"arn:aws:iam::12345678901:role/AwsMarketplaceAmiIngestion\",
\"UserName\": \"ec2-user\",
\"OperatingSystemName\": \"AMAZONLINUX\",
\"OperatingSystemVersion\": \"Amazon Linux 2 AMI 2.0.20210126.0 x86_64 HVM gp2\"
},
\"UsageInstructions\": \"Easy to use AMI\",
\"RecommendedInstanceType\": \"m4.xlarge\",
\"SecurityGroups\": [
{
\"IpProtocol\": \"tcp\",
\"FromPort\": 443,
\"ToPort\": 443,
\"IpRanges\": [
\"0.0.0.0/0\"
]
}
]
}
}
}
]
}"

How to create an AWS SSM Document Package using Terraform

Using Terraform, I am trying to create an AWS SSM Document Package for Chrome so I can install it on various EC2 instances I have. I define these steps via terraform:
Upload zip containing Chrome installer plus install and uninstall powershell scripts.
Add that ZIP to an SSM package.
However, when I execute terraform apply I receive the following error...
Error updating SSM document: InvalidParameterValueException:
AttachmentSource not provided in the input request.
status code: 400, request id: 8d89da70-64de-4edb-95cd-b5f52207794c
The contents of my main.tf are as follows:
# 1. Add package zip to s3
resource "aws_s3_bucket_object" "windows_chrome_executable" {
bucket = "mybucket"
key = "ssm_document_packages/GoogleChromeStandaloneEnterprise64.msi.zip"
source = "./software-packages/GoogleChromeStandaloneEnterprise64.msi.zip"
etag = md5("./software-packages/GoogleChromeStandaloneEnterprise64.msi.zip")
}
# 2. Create AWS SSM Document Package using zip.
resource "aws_ssm_document" "ssm_document_package_windows_chrome" {
name = "windows_chrome"
document_type = "Package"
attachments_source {
key = "SourceUrl"
values = ["/path/to/mybucket"]
}
content = <<DOC
{
"schemaVersion": "2.0",
"version": "1.0.0",
"packages": {
"windows": {
"_any": {
"x86_64": {
"file": "GoogleChromeStandaloneEnterprise64.msi.zip"
}
}
}
},
"files": {
"GoogleChromeStandaloneEnterprise64.msi.zip": {
"checksums": {
"sha256": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
}
}
}
DOC
}
If I change the file from a zip to a vanilla msi I do not receive the error message, however, when I navigate to the package in the AWS console it tells me that the install.ps1 and uninstall.ps1 files are missing (since obviously they weren't included).
Has anyone experienced the above error and do you know how to resolve it? Or does anyone have reference to a detailed example of how to do this?
Thank you.
I ran into this same problem, in order to fix it I added a trailing slash to the source url value parameter:
attachments_source {
key = "SourceUrl"
values = ["/path/to/mybucket/"]
}
My best guess is it appends the filename from the package spec to the value provided in the attachments source value so it needs the trailing slash to build a valid path to the actual file.
This is the way it should be set up for an attachment in s3:
attachments_source {
key = "S3FileUrl"
values = ["s3://packer-bucket/packer_1.7.0_linux_amd64.zip"]
name = "packer_1.7.0_linux_amd64.zip"
}
I realized that in the above example there was no way terraform could identify a dependency between the two resources i.e. the s3 object needs to be created before the aws_ssm_document. Thus, I added in the following explicit dependency inside the aws_ssm_document:
depends_on = [
aws_s3_bucket_object.windows_chrome_executable
]

Confused about GCP Dataproc sofwareConfig values

I'm attempting to modify Airflow's dataproc operator to include Anaconda and Jupyter to the cluster.
I'm overriding DataprocClusterCreateOperator to include optionalComponents.
After reading Google docs I understand that I need to pass an enum;
Every time I try to run this task I encounter invalid value errors, or TypeError: Object of type 'EnumMeta' is not JSON serializable
I'd really appreciate it if someone can tell me how to correctly pass in this field.
cluster_data = {
'projectId': self.project_id,
'clusterName': self.cluster_name,
'config': {
'gceClusterConfig': {
},
'masterConfig': {
'numInstances': self.num_masters,
'machineTypeUri': master_type_uri,
'diskConfig': {
'bootDiskType': self.master_disk_type,
'bootDiskSizeGb': self.master_disk_size
}
},
'workerConfig': {
'numInstances': self.num_workers,
'machineTypeUri': worker_type_uri,
'diskConfig': {
'bootDiskType': self.worker_disk_type,
'bootDiskSizeGb': self.worker_disk_size
}
},
'secondaryWorkerConfig': {},
'softwareConfig': {
# I've tried the following:
'optionalComponents': 'ANACONDA,JUPYTER'
#from google.cloud.dataproc_v1 import enums
'optionalComponents': [enums.Component.ANACONDA.value]
},
}
}
You want to use a JSON list there ['ANACONDA', 'JUPYTER'].
As general guidance for figuring out how to structure things, you can create a cluster with gcloud and then run:
gcloud dataproc clusters describe my-cluster --format json
That --format json is the key. The result should be directly copy-pastable.

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?
The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.
Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.
Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results
You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good
If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn
I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)
there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!