AWS ElasticSearch with Lambda and S3 doesn't add documents to index - amazon-web-services

I have a bit of a mysterious issue: I have a lambda function which transports data from S3 bucket to AWS ES cluster.
My lambda function runs correctly and reports the following:
All 6 log records added to ES
However added documents do not appear in AWS ElasticSearch index
/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open logs 3N2O9CqhSwCP6sj1QK5EQw 5 1 0 0 1.2kb 1.2kb
I'm using this lambda function https://github.com/aws-samples/amazon-elasticsearch-lambda-samples/blob/master/src/s3_lambda_es.js
Lambda function's role has full permissions to ES cluster and S3 bucket. It can access S3 bucket because I can print out contents to Lambda's console log
Any ideas for further debugging are much appreciated!
Cheers

There can be many reasons for this. since you are asking about ideas for debugging, here are couple of them:
Add the console.log in postDocumentToES method of the lambda that shows where exactly does it connect
Try to extract the code from lambda and run it locally just to make sure it succeeds to send to elastic search (so that the code is correct at least)
Make sure that there are no "special restrictions" on index (like ttl for a couple of minutes or something), or, maybe something that doesn't allow inserting into the index.
How many ES servers do you have? Maybe there is a cluster of them and the replication is not configured correctly, so when you check the state of the index in one ES it doesn't actually have the documents but the other ES server could have these docs.

Related

Trigger a Custom Function Every X Hours in AWS

I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena

Pushing data to AWS Elasticsearch from CloudWatch logs without a schema

Our setup is this, AWS Services produce and publish logs to the CloudWatch Service. From there we use the standard Lambda function to publish the logs to the AWS ElasticSearch
The lambda function pushes the logs to ES using the file format cloudwatch-logs-<date> This creates a new index every day
We have an issue with mapping of the data. So for example when a service (eg. aurora db) publish its first set of logs and the field CPU value is 0 the ES set that as a long. When that same service publishes a second set of logs and the CPU is set to 10.5 the ES rejects that set of data with the error mapper cannot change type [long] to [float]
We have allot of services publishing logs with allot of data sets. Is the best way to resolve this for lambda to push the logs with format of cloudwatch-logs so only one index is created and then manual fix the mapping issue for that index ? or is there a better way to resolve this ?

AWS CloudFront returns 503 for regions other than us-east-1

I am trying to configure a cloudfront distribution with a lambda#Edge function linked to the origin request event. The lambda edge returns a very basic html page (the code is based on this example: Serving Static Content (Generated Response)). Once deployed, the distribution works as expected in locations close to North Virginia region, but fails in other locations returning the following error:
503: The Lambda function associated with the CloudFront distribution
was throttled. We can't connect to the server for this app or website
at this time. There might be too much traffic or a configuration
error. Try again later, or contact the app or website owner. If you
provide content to customers through CloudFront, you can find steps to
troubleshoot and help prevent this error by reviewing the CloudFront
documentation.
I already tried looking at the logs, but nothing is logged in cloudwatch when the 503 error is thrown and the logs from the CF distribution shows the lambdalimitExceeded error.
I have been jumping around between different locations using a VPN and I find it strange that it only works for places close to us-east-1 region. I am creating all the resources using a federated account, I don't know if it could be related to IAM permissions.
Another thing to point out is that everything works as expected if I reproduce the same scenario using another aws account and a regular user.
If you're seeing the lambdalimitExceeded then you need to review the following for your Lambda#Edge function:
The number of function executions exceeded one of the quotas (formerly known as limits) that Lambda sets to throttle executions in an AWS Region (concurrent executions or invocation frequency).
The function exceeded the Lambda function timeout quota.
Remember that Lambda#Edge is executed closer to the user, if you try to retrieve external resources (to the region) then you may timeout due to geographical latency, can you increase the timeout more to account for this?
Do you have other Lambdas running in the regions where it is running? If you view the CloudWatch logs for one of the regions closer to the users edge location you will see these Lambda logs and hopefully be able to identify the root cause. If not then add more debugging in.

AWS WAF - Auto Save Web Application Firewall logs in S3

How do you route AWS Web Application Firewall (WAF) logs to an S3 bucket? Is this something I can quickly do through the AWS Console? Or, would I have to use a lambda function (invoked by a CloudWatch timer event) to query the WAF logs every n minutes?
UPDATE:
I'm interested in the ACL logs (Source IP, URI, Matches rule, Request Headers, Action, Time, etc).
UPDATE (05/15/2017)
AWS doesn't provide an easy way to view/parse these logs. You can get a "random sample" via the get-sampled-requests command. Which isn't acceptable...
Gets detailed information about a specified number of requests--a
sample--that AWS WAF randomly selects from among the first 5,000
requests that your AWS resource received during a time range that you
choose. You can specify a sample size of up to 500 requests, and you
can specify any time range in the previous three hours.
http://docs.aws.amazon.com/cli/latest/reference/waf/get-sampled-requests.html
Also, I'm not the only one experiencing this issue either:
https://forums.aws.amazon.com/thread.jspa?threadID=220202
I was looking for this functionality today and stumbled across the referenced thread. It was, coincidentally, updated today:
Hello,
Thanks for your input. I have submitted a feature request on your
behalf to export WAF events to S3 for long term analysis.
Best Regards, albertpataws
The lack of this feature strikes me as being almost as odd as the fact that I can't change timezones for graphs.

All my AWS datapipelines have stopped working with Validation error

I use AWS data pipelines to automatically back up dynamodb tables to S3 on a weekly basis.
All of my data-pipelines, have stopped working since two weeks ago.
After some investigation, I see that EMR fails with "validation error" and "Terminated with errors No active keys found for user account". As a results all the jobs timeout.
Any ideas what this means?
I ruled out changes to the list of instant types that are allowed to be used with EMR.
Also I tried to read the EMR logs but it looks like it doesn't event get to the point to create logs (or I am looking for them in the wrong place).
AWS account which used to launch EMR has keys ( access key and sec key ) Could you check if those keys are deleted ? You need to login to AWS console and check keys exists for your account.
if not re create keys and use in your code that launches EMR.
Basically #Sandesh Deshmane answered my question correctly.
For future reference and clarity I explain the situation here too:
What happened was that originally I used the root account and console to create the pipelines. Later I decided to follow the best practices and removed my root account keys.
A few days later (my pipelines are scheduled to run weekly) when they all failed I did not make the connection and thought of other problems.
I think one good way to avoid this (if you want to use console) will be to login to console with an IAM account and create the pipelines.
Or you can use command line tools to create them with and IAM credentials.
The real solution now (I think it was not available when the console was first introduced), is to assign the correct IAM role in the first page when you are creating your pipeline in the console. In the "security/access" section change it from default to custom and select the correct roles there.