How can I create a custom metric watching EFS metered size in AWS Cloudwatch? - amazon-web-services

Title pretty much says it all - Since the EFS metered size (usage) ist not a metric that I can use in Cloudwatch, I need to create a custom metric watching the last metered file size in EFS.
Is there any possiblity to do so? Or is there maybe a even better way to monitore the size of my EFS?

I would recommend using a Lambda, running every hour or so and sending the data into CloudWatch.
This code gathers all the EFS File Systems and sends their size (in kb) to Cloudwatch along with the file system name. Modify it to suit your needs:
import json
import boto3
region = "us-east-1"
def push_efs_size_metric(region):
efs_name = []
efs = boto3.client('efs', region_name=region)
cw = boto3.client('cloudwatch', region_name=region)
efs_file_systems = efs.describe_file_systems()['FileSystems']
for fs in efs_file_systems:
efs_name.append(fs['Name'])
cw.put_metric_data(
Namespace="EFS Metrics",
MetricData=[
{
'MetricName': 'EFS Size',
'Dimensions': [
{
'Name': 'EFS_Name',
'Value': fs['Name']
}
],
'Value': fs['SizeInBytes']['Value']/1024,
'Unit': 'Kilobytes'
}
]
)
return efs_name
def cloudtrail_handler(event, context):
response = push_efs_size_metric(region)
print ({
'EFS Names' : response
})
I'd also suggest reading up on the reference below for more details on creating custom metrics.
References
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html

Related

How do I use AWS Lambda to trigger Comprehend with S3?

I'm currently using aws lambda to trigger an amazon comprehend job, but the code is only used to run one piece of text under sentiment analysis.
import boto3
def lambda_handler(event, context):
s3 = boto3.client("s3")
bucket = "bucketName"
key = "textName.txt"
file = s3.get_object(Bucket = bucket, Key = key)
analysisdata = str(file['Body'].read())
comprehend = boto3.client("comprehend")
sentiment = comprehend.detect_sentiment(Text = analysisdata, LanguageCode = "en")
print(sentiment)
return 'Sentiment detected'
I want to run a file where each line in the text file is a new piece of text to analyze with sentiment analysis (it's an option if you manually enter stuff into comprehend), but is there a way to alter this code to do that? And have the output sentiment analysis file be placed into that same S3 bucket? Thank you in advance.
It looks like you can use start_sentiment_detection_job():
response = client.start_sentiment_detection_job(
InputDataConfig={
'S3Uri': 'string',
'InputFormat': 'ONE_DOC_PER_FILE'|'ONE_DOC_PER_LINE',
'DocumentReaderConfig': {
'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT'|'TEXTRACT_ANALYZE_DOCUMENT',
'DocumentReadMode': 'SERVICE_DEFAULT'|'FORCE_DOCUMENT_READ_ACTION',
'FeatureTypes': [
'TABLES'|'FORMS',
]
}
},
OutputDataConfig={
'S3Uri': 'string',
'KmsKeyId': 'string'
},
...
)
It can read from an object in Amazon S3 (S3Uri) and store the output in an S3 object.
It looks like you could use 'InputFormat': 'ONE_DOC_PER_LINE' to meet your requirements.

Export DynamoDb metrics logs to S3 or CloudWatch

I'm trying to use DynamoDB metrics logs in an external observability tool.
To do that, I'll need to get these log data from S3 or CloudWatch log groups (not from Insights or CloudTrail).
For this reason, if there isn't a way to use CloudWatch, I'll need to export these metric logs from DynamoDb to S3, and from there export to CloudWatch or try to get those data directly from S3.
Do you know this is possible?
You could try using Logstash, it has a plugin for Cloudwatch and S3:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-cloudwatch.html
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html
AWS puts DynamoDB metrics (table operation, table, and account) over CloudWatch metrics. Also, you can create as many metrics as you need. If you use Python, you can read it with boto3. The CloudWatch client has this method:
get_metric_data
Try this with your metrics:
cloudwatch_client = boto3.client('cloudwatch')
yesterday = date.today() - timedelta(days=1)
today = date.today()
response = cloudwatch_client.get_metric_data(
MetricDataQueries=[
{
'Id': 'some_request',
'MetricStat': {
'Metric': {
'Namespace': 'DynamoDB',
'MetricName': 'metric_name',
'Dimensions': []
},
'Period': 3600,
'Stat': 'Sum',
}
},
],
StartTime=datetime(yesterday.year, yesterday.month, yesterday.day),
EndTime=datetime(today.year, today.month, today.day),
)
print(response)

Creating a CloudWatch Metrics from the Athena Query results

My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return

AWS Lambda - Copy monthly snapshots to another region

I am trying to run a lambda that will kick off on a schedule to copy all snapshots taken the day prior to another region for DR purposes. I have a bit of code but it seems to not work as intended.
Symptoms:
It's grabbing the same snapshots multiple times and copying them
It always errors out on 2 particular snapshots, I don't know enough coding to write a log to figure out why. These snapshots work if I manually copy them though.
import boto3
from datetime import date, timedelta
SOURCE_REGION = 'us-east-1'
DEST_REGION = 'us-west-2'
ec2_source = boto3.client('ec2', region_name = SOURCE_REGION)
ec2_destination = boto3.client('ec2', region_name = DEST_REGION)
snaps = ec2_source.describe_snapshots(OwnerIds=['self'])['Snapshots']
yesterday = date.today() - timedelta(days = 1)
yesterday_snaps = [ s for s in snaps if s['StartTime'].date() == yesterday ]
for yester_snap in yesterday_snaps:
DestinationSnapshot = ec2_destination.copy_snapshot(
SourceSnapshotId = yester_snap['SnapshotId'],
SourceRegion = SOURCE_REGION,
Encrypted = True,
KmsKeyId='REMOVED FOR SECURITY',
DryRun = False
)
DestinationSnapshotID = DestinationSnapshot['SnapshotId']
ec2_destination.create_tags(Resources=[DestinationSnapshotID],
Tags=yester_snap['Tags']
)
waiter = ec2_destination.get_waiter('snapshot_completed')
waiter.wait(
SnapshotIds=[DestinationSnapshotID],
DryRun=False,
WaiterConfig={'Delay': 10,'MaxAttempts': 123}
)
Debugging
You can debug by simply putting print() statements in your code.
For example:
for yester_snap in yesterday_snaps:
print('Copying:', yester_snap['SnapshotId'])
DestinationSnapshot = ec2_destination.copy_snapshot(...)
The logs will appear in CloudWatch Logs. You can access the logs via the Monitoring tab in the Lambda function. Make sure the Lambda function has AWSLambdaBasicExecutionRole permissions so that it can write to CloudWatch Logs.
Today/Yesterday
Be careful about your definition of yesterday. Amazon EC2 instances run in the UTC timezone, so your concept of a today and yesterday might not match what is happening.
It might be better to add a tag to snapshots after they are copied (eg 'copied') rather than relying on dates to figure out which ones to copy.
CloudWatch Events rule
Rather than running this program once per day, an alternative method would be:
Create an Amazon CloudWatch Events rule that triggers on Snapshot creation:
{
"source": [
"aws.ec2"
],
"detail-type": [
"EBS Snapshot Notification"
],
"detail": {
"event": [
"createSnapshot"
]
}
}
Configure the rule to trigger an AWS Lambda function
In the Lambda function, copy the Snapshot that was just created
This way, the Snapshots are created immediately and there is no need to search for them or figure out which Snapshots to copy

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?
The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.
Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.
Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results
You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good
If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn
I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)
there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!