Kinesis Stream not getting the logs - amazon-web-services

I am receiving cloudtrail logs in Kinesis data stream. I am invoking a stream processing lambda function as described here. The final result that gets returned to the stream is then stored onto an S3 bucket. As of now, the processing fails with the following error file created in the S3 bucket:
{"attemptsMade":4,"arrivalTimestamp":1619677225356,"errorCode":"Lambda.FunctionError","errorMessage":"Check your function and make sure the output is in required format. In addition to that, make sure the processed records contain valid result status of Dropped, Ok, or ProcessingFailed","attemptEndingTimestamp":1619677302684,
Adding in the Python lambda function here for reference:
import base64
import gzip
import json
import logging
# Setup logging configuration
logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
def unpack_kinesis_stream_records(event):
# decode and decompress each base64 encoded data element
return [gzip.decompress(base64.b64decode(k["data"])).decode('utf-8') for k in event["records"]]
def decode_raw_cloud_trail_events(cloudTrailEventDataList):
#Convert Raw Event Data List
eventList = [json.loads(e) for e in cloudTrailEventDataList]
#Filter out-non DATA_MESSAGES
filteredEvents = [e for e in eventList if e["messageType"] == 'DATA_MESSAGE']
#Covert each indidual log Event Message
events = []
for f in filteredEvents:
for e in f["logEvents"]:
events.append(json.loads(e["message"]))
logger.info("{0} Event Logs Decoded".format(len(events)))
return events
def handle_request(event, context):
#Log Raw Kinesis Stream Records
#logger.debug(json.dumps(event, indent=4))
# Unpack Kinesis Stream Records
kinesisData = unpack_kinesis_stream_records(event)
#[logger.debug(k) for k in kinesisData]
# Decode and filter events
events = decode_raw_cloud_trail_events(kinesisData)
####### INTEGRATION CODE GOES HERE #########
return f"Successfully processed {len(events)} records."
def lambda_handler(event, context):
return handle_request(event, context)
Can anyone help me understand the problem here.

I believe you are using 'kinesis firehose' service and not 'kinesis data stream'. code you are using is used to read directly from kinesis data stream and process cloudtrail events.
kinesis firehose data transformation lambda function is different. Firehose sends received cloudtrail events to lambda function. Lambda process/transform the events and should send those events back to firehose, so that firehose can deliver them to destination S3 bucket.
Your lambda function should return records in exactly same format as firehose expects them and each record should have either of the status [Dropped, Ok, or ProcessingFailed]. You can read more in aws doc

Related

Getting Error when pre-processing data from Kinesis with Lambda

I have a use case where I have to filter incoming data from Kinesis Firehose based on the type of the event. I should write only certain events to S3 and ignore the rest of the events. I am using lambda to filter the records. I am using following python code to achieve this:
def lambda_handler(event, context):
# TODO implement
output = []
for record in event['records']:
payload = base64.b64decode(record["data"])
payload_json = json.loads(payload)
event_type = payload_json["eventPayload"]["operation"]
if event_type == "create" or event_type == "update":
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload)}
output.append(output_record)
else:
output_record = {
'recordId': record['recordId'],
'result': 'Dropped'}
output.append(output_record)
return {'records': output}
I am only trying to process "create" and "update" events and dropping the rest of the events. I got the sample code from AWS docs and built it from there.
This is giving the following error:
{"attemptsMade":1,"arrivalTimestamp":1653289182740,"errorCode":"Lambda.MissingRecordId","errorMessage":"One or more record Ids were not returned. Ensure that the Lambda function returns all received record Ids.","attemptEndingTimestamp":1653289231611,"rawData":"some data","lambdaArn":"arn:$LATEST"}
I am not able to get what this error means and how to fix it.
Bug: The return statement needs to be outside of the for loop. This is the cause of the error. The function is processing multiple recordIds, but only 1 recordId is returned. Unindent the return statement.
The data key must be included in output_record, even if the event is being dropped. You can base64 encode the original payload with no transformations.
Additional context: event['records'] and output must be the same length (length validation). Each dictionary in output must have a recordId key whose value equals a recordId value in a dictionary in event['record'] (recordId validation).
From AWS documentation:
The record ID is passed from Kinesis Data Firehose to Lambda during the invocation. The transformed record must contain the same record ID. Any mismatch between the ID of the original record and the ID of the transformed record is treated as a data transformation failure.
Reference: Amazon Kinesis Data Firehose Data Transformation

Can an AWS Athena query read from S3 through an S3 Object Lambda?

I understand I can write an S3 Object Lambda which can transparently alter the returned S3 contents on the fly during retrieval, without modifying the original object.
I understand also that Athena can read json or csv files from S3.
My question is, can both of these capabilities be combined, so that Athena queries would read data which is transparently altered on the fly via S3 Object Lambda prior to being parsed by Athena?
SOME CODE
Suppose I write a CSV file:
hello
world
Then I write an S3 Object lambda:
import boto3
import requests
def lambda_handler(event, context):
print(event)
object_get_context = event["getObjectContext"]
request_route = object_get_context["outputRoute"]
request_token = object_get_context["outputToken"]
s3_url = object_get_context["inputS3Url"]
# Get object from S3
response = requests.get(s3_url)
original_object = response.content.decode('utf-8')
# Transform object
transformed_object = original_object.upper()
# Write object back to S3 Object Lambda
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=transformed_object,
RequestRoute=request_route,
RequestToken=request_token)
return {'status_code': 200}
(astute readers will notice this is the example from aws docs)
Now suppose I create an Athena EXTERNAL table and write this query:
SELECT * from hello
How can I ensure that the Athena query will return WORLD instead of world in this scenario?

AWS Kinesis Data Firehose outputs file to S3 bucket. How/Where transform the data?

I have configured eventbridge rule to output(target) to Kinesis Data firehose and eventually Data firehose to S3 Bucket. Data is in JSON format.
Data is getting delivered to S3 bucket with no issues.
I have created a Glue Crawler pointing to S3 bucket which creates table schema/metadata in Glue Catalog and data in AWS Athena.
I am facing two issues currently :
Data written to S3 bucket by Firehose is writing it as Single line json which means if there are 5 records in json , AWS Athena queries only top record because it is not delimited by new line (\n). When records aren't separated by a newline character (\n) it will always return 1 record as per AWS Athena docs.
https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/
How I can transform these records to single line and where i should do this ? In firehose ?
2.In Json data , there are some columns with special characters and AWS Athena docs said column
cannot have special characters like (forward slash etc.)? How i can remove special
character/rename the column name , where i should transform this data ? In ETL Glue Job ? After
crawler has created the table schema ?
I understand in Kinesis Firehose we can transform source records with AWS Lambda but i am not able to solve above 2 issues with it.
Update:
I was able to resolve Issue first by creating lamda function and calling from Data firehose. What logic I can include in same lamda to remove special character from the KeyName. Code below :
Example - in json if the value is like "RelatedAWSResources:0/type": "AWS::Config::ConfigRule" -- Remove Special Character from the KeyName only make it like "RelatedAWSResources0type": "AWS::Config::ConfigRule"
import json
import boto3
import base64
output = []
def lambda_handler(event, context):
for record in event['records']:
payload = base64.b64decode(record['data']).decode('utf-8')
print('payload:', payload)
row_w_newline = payload + "\n"
print('row_w_newline type:', type(row_w_newline))
row_w_newline = base64.b64encode(row_w_newline.encode('utf-8'))
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': row_w_newline
}
output.append(output_record)
print('Processed {} records.'.format(len(event['records'])))
return {'records': output}

AWS Lambda multiple triggers

I have three s3 buckets that are invoking a lambda function whenever there is a change in the content of specific objects inside the buckets.
Does anyone know if it is possible, using boto3, to retrieve those objects that have triggered the function?
Thanks!
UPDATE
I would like to get the objects that have triggered the lambda function from the response contents. I have tried to get it from the response of the get_function method of the lambda client but to no avail:
import boto3
lam = boto3.client('lambda')
response = lam.get_function(FunctionName='mylambdafunction')
Here's some sample code to retrieve the object that triggered the AWS Lambda function invocation:
import urllib
import boto3
# Connect to S3
s3_client = boto3.client('s3')
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
localFilename = '/tmp/foo.txt'
# Download the file from S3 to the local filesystem
s3_client.download_file(bucket, key, localFilename)
# Do other stuff here
Basically, it extracts the Bucket and Key (filename) from the event data that is passed to the function, then calls download_file().

Re-process DLQ events in Lambda

I have an AWS Lambda Function 'A' with a SQS DeadLetterQueue configured. When the Lambda fails to process an event, this is correctly sent to the DLQ. Is there a way to re-process events that ended into a DLQ?
I found two solution, but they both have drawbacks:
Create a new Lambda Function 'B' that reads from the SQS and then sends the events one by one to the previous Lambda 'A'. -> Here I have to write new code and deploy a new Function
Trigger again Lambda 'A' just when an event arrives in the SQS -> This looks dangerous as I can incur in looping executions
My ideal solution should be re-processing on demand the discarded events with Lambda 'A', without creating a new Lambda 'B' from scratch. Is there a way to accomplish this?
Finally, I didn't find any solution from AWS to reprocess the DLQ events of a Lambda Function. Then I created my own custom Lambda Function (I hope that this will be helpful to other developers with same issue):
import boto3
lamb = boto3.client('lambda')
sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName='my_dlq_name')
def lambda_handler(event, context):
for _ in range(100):
messages_to_delete = []
for message in queue.receive_messages(MaxNumberOfMessages=10):
payload_bytes_array = bytes(message.body, encoding='utf8')
# print(payload_bytes_array)
lamb.invoke(
FunctionName='my_lambda_name',
InvocationType="Event", # Event = Invoke the function asynchronously.
Payload=payload_bytes_array
)
# Add message to delete
messages_to_delete.append({
'Id': message.message_id,
'ReceiptHandle': message.receipt_handle
})
# If you don't receive any notifications the messages_to_delete list will be empty
if len(messages_to_delete) == 0:
break
# Delete messages to remove them from SQS queue handle any errors
else:
deleted = queue.delete_messages(Entries=messages_to_delete)
print(deleted)
Part of the code is inspired by this post