Sending EMR Logs to CloudWatch - amazon-web-services

Is there a way to send EMR logs to CloudWatch instead of S3. We would like to have all our services logs in one location. Seems like the only thing you can do is set up alarms for monitoring but that doesn't cover logging.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
Would I have to install CloudWatch agent on the nodes in the cluster https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html

you can install the CloudWatch agent via EMR’s bootstrap configuration, and configure it to watch log directories. It then starts to push logs to Amazon CloudWatch Logs

You can read the logs from s3 and push them to the cloudwatch using boto3 and delete them from s3 if you do not need. In some use-cases stdout.gz log will be needed to be in the cloudwatch for monitoring purposes.
boto3 documentation on put_log_events
import boto3
import botocore.session
import logging
import time
import datetime
import gzip
def get_session(service_name):
session = botocore.session.get_session()
aws_access_key_id = session.get_credentials().access_key
aws_secret_access_key = session.get_credentials().secret_key
aws_session_token = session.get_credentials().token
region = session.get_config_variable('region')
return boto3.client(
service_name = service_name,
region_name = region,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
aws_session_token = aws_session_token
)
def get_log_file(s3, bucket, key):
log_file = None
try:
obj = s3.get_object(Bucket=bucket, Key=key)
compressed_body = obj['Body'].read()
log_file = gzip.decompress(compressed_body)
except Exception as e:
logger.error(f"Error reading from bucket : {e}")
raise
return log_file
def create_log_events(logs, batch_size):
log_event_batch = []
log_event_batch_collection = []
try:
for line in logs.splitlines():
log_event = {'timestamp': int(round(time.time() * 1000)), 'message':line.decode('utf-8')}
if len(log_event_batch) < batch_size:
log_event_batch.append(log_event)
else:
log_event_batch_collection.append(log_event_batch)
log_event_batch = []
log_event_batch.append(log_event)
except Exception as e:
logger.error(f"Error creating log events : {e}")
raise
log_event_batch_collection.append(log_event_batch)
return log_event_batch_collection
def create_log_stream_and_push_log_events(logs, log_group, log_stream, log_event_batch_collection, delay):
response = logs.create_log_stream(logGroupName=log_group, logStreamName=log_stream)
seq_token = None
try:
for log_event_batch in log_event_batch_collection:
log_event = {
'logGroupName': log_group,
'logStreamName': log_stream,
'logEvents': log_event_batch
}
if seq_token:
log_event['sequenceToken'] = seq_token
response = logs.put_log_events(**log_event)
seq_token = response['nextSequenceToken']
time.sleep(delay)
except Exception as e:
logger.error(f"Error pushing log events : {e}")
raise
The caller function
def main():
s3 = get_session('s3')
logs = get_session('logs')
BUCKET_NAME = 'Your_Bucket_Name'
KEY = 'logs/emr/Path_To_Log/stdout.gz'
BATCH_SIZE = 10000 #According to boto3 docs
PUSH_DELAY = 0.2 #According to boto3 docs
LOG_GROUP='test_log_group' #Destination log group
LOG_STREAM='{}-{}'.format(time.strftime('%Y-%m-%d'),'logstream.log')
log_file = get_log_file(s3, BUCKET_NAME, KEY)
log_event_batch_collection = create_log_events(log_file, BATCH_SIZE)
create_log_stream_and_push_log_events(logs, LOG_GROUP, LOG_STREAM, log_event_batch_collection, PUSH_DELAY)

Related

Botocore Stubber - Unable to locate credentials

I'm working on unit tests for my lambda which is getting some files from S3, processing them and loading data from them to DynamoDB. I created botocore stubbers that are used during tests, but I got botocore.exceptions.NoCredentialsError: Unable to locate credentials
My lambda handler code
s3_client = boto3.client('s3')
ddb_client = boto3.resource('dynamodb', region_name='eu-west-1')
def lambda_handler(event, context):
for record in event['Records']:
s3_event = record.get('s3')
bucket = s3_event.get('bucket', {}).get('name', '')
file_key = s3_event.get('object', {}).get('key', '')
file = s3_client.get_object(Bucket=bucket, Key=file_key)
and tests file:
class TestLambda(unittest.TestCase):
def setUp(self) -> None:
self.session = botocore.session.get_session()
# S3 Stubber Set Up
self.s3_client = self.session.create_client('s3', region_name='eu-west-1')
self.s3_stubber = Stubber(self.s3_client)
# DDB Stubber Set Up
self.ddb_resource = boto3.resource('dynamodb', region_name='eu-west-1')
self.ddb_stubber = Stubber(self.ddb_resource.meta.client)
def test_s3_to_ddb_handler(self) -> None:
event = {}
with self.s3_stubber:
with self.ddb_stubber:
response = s3_to_ddb_handler.s3_to_ddb_handler(event, ANY)
Issue seems to be that actual call to AWS resources is done which shouldnt be the case and stubber should be used, how can I force that?
You need to call .activate() on your Stubber instances: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/stubber.html#botocore.stub.Stubber.activate

Copying S3 objects from one account to other using Lambda python

I'm using boto3 to copy files from s3 bucket from one account to other. I need a similar functionality like aws s3 sync. Please see my code. My company has decided to 'PULL' from other S3 bucket (source account). Please don't suggest replication, S3 batch, S3 trigger Lambda..etc. We have gone through all these options and my management do not want to do any configuration at source side. Can you please review this code and let me know if this code works for thousands of objects. Source bucket has nearly 10000 objects. We will create this lambda function in destination account and create a cloudwatch event to trigger the lambda once in a day.
I am checking ETag so that modified files will be copied across when this function is triggered.
Edit: I simplified my code just to see pagination works. It's working if I don't add client.copy(). If I add this line in for loop after reading 3,4 objects it's throwing "errorMessage": "2021-08-07T15:29:07.827Z 82757747-7b72-4f29-ae9f-22e95f969d6c Task timed out after 3.00 seconds". Please advise. Please note that 'test/' folder in my source bucket has around 1100 objects.
import os
import logging
import botocore
logger = logging.getLogger()
logger.setLevel(os.getenv('debug_level', 'INFO'))
client = boto3.client('s3')
def handler(event, context):
main(event, logger)
def main(event, logger):
try:
SOURCE_BUCKET = os.environ.get('SRC_BUCKET')
DEST_BUCKET = os.environ.get('DST_BUCKET')
REGION = os.environ.get('REGION')
prefix = 'test/'
# Create a reusable Paginator
paginator = client.get_paginator('list_objects_v2')
print ('after paginator')
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET,Prefix = prefix)
print ('after page iterator')
index = 0
for page in page_iterator:
for obj in page['Contents']:
index += 1
print ("I am looking for {} in the source bucket".format(obj['ETag']))
copy_source = {'Bucket': SOURCE_BUCKET, 'Key': obj['Key']}
client.copy(copy_source, DEST_BUCKET, obj['Key'])
logger.info("number of objects copied {}:".format(index))
except botocore.exceptions.ClientError as e:
raise
This version is working fine if I increase the Lambda timeout to 15 min and memory to 512MB. This checks if the source object already exists in destination before copying.
import boto3
import os
import logging
import botocore
from botocore.client import Config
logger = logging.getLogger()
logger.setLevel(os.getenv('debug_level', 'INFO'))
config = Config(connect_timeout=5, retries={'max_attempts': 0})
client = boto3.client('s3', config=config)
#client = boto3.client('s3')
def handler(event, context):
main(event, logger)
def main(event, logger):
try:
DEST_BUCKET = os.environ.get('DST_BUCKET')
SOURCE_BUCKET = os.environ.get('SRC_BUCKET')
REGION = os.environ.get('REGION')
prefix = ''
# Create a reusable Paginator
paginator = client.get_paginator('list_objects_v2')
print ('after paginator')
# Create a PageIterator from the Paginator
page_iterator_src = paginator.paginate(Bucket=SOURCE_BUCKET,Prefix = prefix)
page_iterator_dest = paginator.paginate(Bucket=DEST_BUCKET,Prefix = prefix)
print ('after page iterator')
index = 0
for page_source in page_iterator_src:
for obj_src in page_source['Contents']:
flag = "FALSE"
for page_dest in page_iterator_dest:
for obj_dest in page_dest['Contents']:
# checks if source ETag already exists in destination
if obj_src['ETag'] in obj_dest['ETag']:
flag = "TRUE"
break
if flag == "TRUE":
break
if flag != "TRUE":
index += 1
client.copy_object(Bucket=DEST_BUCKET, CopySource={'Bucket': SOURCE_BUCKET, 'Key': obj_src['Key']}, Key=obj_src['Key'],)
print ("source ETag {} and destination ETag {}".format(obj_src['ETag'],obj_dest['ETag']))
print ("source Key {} and destination Key {}".format(obj_src['Key'],obj_dest['Key']))
print ("Number of objects copied{}".format(index))
logger.info("number of objects copied {}:".format(index))
except botocore.exceptions.ClientError as e:
raise

Dump lambda output to csv and have it email as an attachment

I have a lambda function that generates a list of untagged buckets in AWS environment. Currently I send the output to a slack channel directly. Instead I would like to have my lambda dump the output to a csv file and send it as a report. Here is the code for it, let me know if you need any other details.
import boto3
from botocore.exceptions import ClientError
import urllib3
import json
http = urllib3.PoolManager()
def lambda_handler(event, context):
#Printing the S3 buckets with no tags
s3 = boto3.client('s3')
s3_re = boto3.resource('s3')
buckets = []
print('Printing buckets with no tags..')
for bucket in s3_re.buckets.all():
s3_bucket = bucket
s3_bucket_name = s3_bucket.name
try:
response = s3.get_bucket_tagging(Bucket=s3_bucket_name)
except ClientError:
buckets.append(bucket)
print(bucket)
for bucket in buckets:
data = {"text": "%s bucket has no tags" % (bucket)}
r = http.request("POST", "https://hooks.slack.com/services/~/~/~",
body = json.dumps(data),
headers = {"Content-Type": "application/json"})

How to get boto3 to display _all_ RDS instances?

Trying to get all RDS instances with boto3 - does not return all RDS instances.
When I look at my RDS instances in Oregon (us-west-2), I see the following:
However, if I run the below Python3 script, I only get one result:
$ python3 ./stackoverflow.py
RDS instances in Oregon
------------------------------
aurora-5-7-yasmin.cazdggrmkpt1.us-west-2.rds.amazonaws.com qa test db.t2.small aurora-5-7-yasmin
$
Can you suggest a way to get boto3 to display all RDS instances?
$ cat ./stackoverflow.py
import collections
import boto3
import datetime
import pygsheets
REGIONS = ('us-west-2',)
REGIONS_H = ('Oregon',)
currentDT = str(datetime.datetime.now())
def create_spreadsheet(outh_file, spreadsheet_name = "AWS usage"):
client = pygsheets.authorize(outh_file=outh_file, outh_nonlocal=True)
client.list_ssheets(parent_id=None)
spread_sheet = client.create(spreadsheet_name)
return spread_sheet
def rds_worksheet_creation(spread_sheet):
for i in range(len(REGIONS)):
region = REGIONS[i]
region_h = REGIONS_H[i]
print()
print("{} instances in {}".format("RDS", region_h))
print("------------------------------")
client = boto3.client('rds', region_name=region)
db_instances = client.describe_db_instances()
for i in range(len(db_instances)):
j = i - 1
try:
DBName = db_instances['DBInstances'][j]['DBName']
MasterUsername = db_instances['DBInstances'][0]['MasterUsername']
DBInstanceClass = db_instances['DBInstances'][0]['DBInstanceClass']
DBInstanceIdentifier = db_instances['DBInstances'][0]['DBInstanceIdentifier']
Endpoint = db_instances['DBInstances'][0]['Endpoint']
Address = db_instances['DBInstances'][0]['Endpoint']['Address']
print("{} {} {} {} {}".format(Address, MasterUsername, DBName, DBInstanceClass,
DBInstanceIdentifier))
except KeyError:
continue
if __name__ == "__main__":
spread_sheet = create_spreadsheet(spreadsheet_name = "AWS usage", outh_file = '../client_secret.json')
spread_sheet.link(syncToCloud=False)
rds_worksheet_creation(spread_sheet)
$ cat ../client_secret.json
{"installed":{"client_id":"362799999999-uml0m2XX4v999999mr2s03XX9g8l9odi.apps.googleusercontent.com","project_id":"amiable-shuttle-198516","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://accounts.google.com/o/oauth2/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"XXXXxQH434Qg-xxxx99_n0vW","redirect_uris":["urn:ietf:wg:oauth:2.0:oob","http://localhost"]}}
$
Edit 1:
Following Michael's comment, I changed the script to the following, but even though one more related line appeared, most of the RDS instances are still not returned:
$ python3 ./stackoverflow.py
RDS instances in Oregon
------------------------------
aurora-5-7-yasmin.cazdggrmkpt1.us-west-2.rds.amazonaws.com qa +++ DBName gave KeyError +++ db.t2.small aurora-5-7-yasmin
aurora-5-7-yasmin.cazdggrmkpt1.us-west-2.rds.amazonaws.com qa test db.t2.small aurora-5-7-yasmin
$
$ cat ./stackoverflow.py
import collections
import boto3
import datetime
import pygsheets
REGIONS = ('us-west-2',)
REGIONS_H = ('Oregon',)
currentDT = str(datetime.datetime.now())
def create_spreadsheet(outh_file, spreadsheet_name = "AWS usage"):
client = pygsheets.authorize(outh_file=outh_file, outh_nonlocal=True)
client.list_ssheets(parent_id=None)
spread_sheet = client.create(spreadsheet_name)
return spread_sheet
def rds_worksheet_creation(spread_sheet):
for i in range(len(REGIONS)):
region = REGIONS[i]
region_h = REGIONS_H[i]
print()
print("{} instances in {}".format("RDS", region_h))
print("------------------------------")
client = boto3.client('rds', region_name=region)
db_instances = client.describe_db_instances()
for i in range(len(db_instances)):
j = i - 1
try:
DBName = db_instances['DBInstances'][j]['DBName']
except KeyError:
DBName = "+++ DBName gave KeyError +++"
MasterUsername = db_instances['DBInstances'][0]['MasterUsername']
DBInstanceClass = db_instances['DBInstances'][0]['DBInstanceClass']
DBInstanceIdentifier = db_instances['DBInstances'][0]['DBInstanceIdentifier']
Endpoint = db_instances['DBInstances'][0]['Endpoint']
Address = db_instances['DBInstances'][0]['Endpoint']['Address']
print("{} {} {} {} {}".format(Address, MasterUsername, DBName, DBInstanceClass,
DBInstanceIdentifier))
if __name__ == "__main__":
spread_sheet = create_spreadsheet(spreadsheet_name = "AWS usage", outh_file = '../client_secret.json')
spread_sheet.link(syncToCloud=False)
rds_worksheet_creation(spread_sheet)
You have an error in your original code but if you want this code to scale to a large number of instances (it is unlikely you'll need this) then you'll want to use something like the following:
import boto3
available_regions = boto3.Session().get_available_regions('rds')
for region in available_regions:
rds = boto3.client('rds', region_name=region)
paginator = rds.get_paginator('describe_db_instances').paginate()
for page in paginator:
for dbinstance in page['DBInstances']:
print("{DBInstanceClass}".format(**dbinstance))
You can get rid of the paginator and just use the first loop if you know each region will have fewer than 100s of instances:
for region in available_regions:
rds = boto3.client('rds', region_name=region)
for dbinstance in rds.describe_db_instances():
print("{DBInstanceClass}".format(**dbinstance))
Additionally you can provide a simple
dbinstance.get('DBName', 'No Name Set')
instead of excepting around the KeyError.
Your for loop range is getting the value of 2 since db_instancesis dict type.
Instead of
for i in range(len(db_instances)):
It should be
for i in range(len(db_instances['DBInstances'])):
Which gives list type and correct length to iterate the loop.
This Code will list all RDS instances present in the account
Try this 100 % working code
#!/usr/bin/env python
import boto3
client = boto3.client('rds')
response = client.describe_db_instances()
for i in response['DBInstances']:
db_name = i['DBName']
db_instance_name = i['DBInstanceIdentifier']
db_type = i['DBInstanceClass']
db_storage = i['AllocatedStorage']
db_engine = i['Engine']
print db_instance_name,db_type,db_storage,db_engine
FYI, the more Pythonic way to do loops in this case would be:
for instance in db_instances['DBInstances']:
MasterUsername = instance['MasterUsername']
DBInstanceClass = instance['DBInstanceClass']
etc.
This avoids the need for i-type iterators.

AWS Lambda error No module named 'StringIO' or No module named 'StringIO'

I try to use AWS Lambda for mass email sending, the code we use as the link below:
https://aws.amazon.com/cn/premiumsupport/knowledge-center/mass-email-ses-lambda/
from __future__ import print_function
import StringIO
import csv
import json
import os
import urllib
import zlib
from time import strftime, gmtime
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import boto3
import botocore
import concurrent.futures
__author__ = 'Said Ali Samed'
__date__ = '10/04/2016'
__version__ = '1.0'
# Get Lambda environment variables
region = os.environ['us-east-1']
max_threads = os.environ['10']
text_message_file = os.environ['email_body.txt']
html_message_file = os.environ['email_body.html']
# Initialize clients
s3 = boto3.client('s3', region_name=region)
ses = boto3.client('ses', region_name=region)
send_errors = []
mime_message_text = ''
mime_message_html = ''
def current_time():
return strftime("%Y-%m-%d %H:%M:%S UTC", gmtime())
def mime_email(subject, from_address, to_address, text_message=None, html_message=None):
msg = MIMEMultipart('alternative')
msg['Subject'] = subject
msg['From'] = from_address
msg['To'] = to_address
if text_message:
msg.attach(MIMEText(text_message, 'plain'))
if html_message:
msg.attach(MIMEText(html_message, 'html'))
return msg.as_string()
def send_mail(from_address, to_address, message):
global send_errors
try:
response = ses.send_raw_email(
Source=from_address,
Destinations=[
to_address,
],
RawMessage={
'Data': message
}
)
if not isinstance(response, dict): # log failed requests only
send_errors.append('%s, %s, %s' % (current_time(), to_address, response))
except botocore.exceptions.ClientError as e:
send_errors.append('%s, %s, %s, %s' %
(current_time(),
to_address,
', '.join("%s=%r" % (k, v) for (k, v) in e.response['ResponseMetadata'].iteritems()),
e.message))
def lambda_handler(event, context):
global send_errors
global mime_message_text
global mime_message_html
try:
# Read the uploaded csv file from the bucket into python dictionary list
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
response = s3.get_object(Bucket=bucket, Key=key)
body = zlib.decompress(response['Body'].read(), 16+zlib.MAX_WBITS)
reader = csv.DictReader(StringIO.StringIO(body),
fieldnames=['from_address', 'to_address', 'subject', 'message'])
# Read the message files
try:
response = s3.get_object(Bucket=bucket, Key=text_message_file)
mime_message_text = response['Body'].read()
except:
mime_message_text = None
print('Failed to read text message file. Did you upload %s?' % text_message_file)
try:
response = s3.get_object(Bucket=bucket, Key=html_message_file)
mime_message_html = response['Body'].read()
except:
mime_message_html = None
print('Failed to read html message file. Did you upload %s?' % html_message_file)
if not mime_message_text and not mime_message_html:
raise ValueError('Cannot continue without a text or html message file.')
# Send in parallel using several threads
e = concurrent.futures.ThreadPoolExecutor(max_workers=max_threads)
for row in reader:
from_address = row['from_address'].strip()
to_address = row['to_address'].strip()
subject = row['subject'].strip()
message = mime_email(subject, from_address, to_address, mime_message_text, mime_message_html)
e.submit(send_mail, from_address, to_address, message)
e.shutdown()
except Exception as e:
print(e.message + ' Aborting...')
raise e
print('Send email complete.')
# Remove the uploaded csv file
try:
response = s3.delete_object(Bucket=bucket, Key=key)
if 'ResponseMetadata' in response.keys() and response['ResponseMetadata']['HTTPStatusCode'] == 204:
print('Removed s3://%s/%s' % (bucket, key))
except Exception as e:
print(e)
# Upload errors if any to S3
if len(send_errors) > 0:
try:
result_data = '\n'.join(send_errors)
logfile_key = key.replace('.csv.gz', '') + '_error.log'
response = s3.put_object(Bucket=bucket, Key=logfile_key, Body=result_data)
if 'ResponseMetadata' in response.keys() and response['ResponseMetadata']['HTTPStatusCode'] == 200:
print('Send email errors saved in s3://%s/%s' % (bucket, logfile_key))
except Exception as e:
print(e)
raise e
# Reset publish error log
send_errors = []
if __name__ == "__main__":
json_content = json.loads(open('event.json', 'r').read())
lambda_handler(json_content, None)
but it has problem when i choose python 2.7.the error is
module initialization error 'us-east-1'
when i choose python 3.6 the error is
Unable to import module 'lambda_function': No module named 'StringIO'
anyone can tell me what is the problem it is ?
From Python v3, the StringIO module has gone. Instead, import the io module and use io.StringIO.
The problem with the v27 version is presumably that the following statement is failing:
region = os.environ['us-east-1']
This will result in a KeyError if us-east-1 is not an available environment variable. Instead use AWS_REGION or AWS_DEFAULT_REGION. See the full list of Lambda environment variables.
Please set the environment variables as described in step 4 of the article:
"Configure Lambda environment variables appropriate to your usage scenario. For example, the following variables would be valid for a given use case:
REGION=us-east-1, MAX_THREADS=10, TEXT_MESSAGE_FILE=email_body.txt, HTML_MESSAGE_FILE=email_body.html."
What was done (as per the code provided in the question) is replacing names of environment variables with their values, which means that python is looking for e.g. 'us-east-1' environment variable which isn't there...
This is the original code
# Get Lambda environment variables
region = os.environ['REGION']
max_threads = os.environ['MAX_THREADS']
text_message_file = os.environ['TEXT_MESSAGE_FILE']
html_message_file = os.environ['HTML_MESSAGE_FILE']
You can also hard-code the values, like below:
# Get Lambda environment variables
region = 'us-east-1'
max_threads = '10'
text_message_file = 'email_body.txt'
html_message_file = 'email_body.html'
but I'd suggest to set the environment variables instead (and use the version of script provided by the article author). When it comes to setting environment variables in Lambda, see this article :)