Question
What is the S3 extended destination configuration and where in the AWS documentation explains clearly what it is for?
As the name suggests, it must be about S3 destination. However, the S3 destination part of the AWS document has no mention.
Choose Amazon S3 for Your Destination
If there are articles or blogs which have clear explanation, please provide the pointers.
I have been looking for a clue in the documentations as below, but as often with the AWS documentations, it is not clear. It looks partly related with input record conversion or record processing.
Amazon Kinesis Data Firehose API Reference - ExtendedS3DestinationConfiguration
Describes the configuration of a destination in Amazon S3.
Amazon Kinesis Data Firehose Developer Guide PDF - Converting Input Record Format (API)
If you want Kinesis Data Firehose to convert the format of your input data from JSON
to Parquet or ORC, specify the optional DataFormatConversionConfiguration element in
ExtendedS3DestinationConfiguration ...
AWS CloudFormation - AWS::KinesisFirehose::DeliveryStream ExtendedS3DestinationConfiguration
The ExtendedS3DestinationConfiguration property type configures an Amazon S3 destination for an Amazon Kinesis Data Firehose delivery stream.
Extended S3 Destination
resource "aws_kinesis_firehose_delivery_stream" "extended_s3_stream" {
name = "terraform-kinesis-firehose-extended-s3-test-stream"
destination = "extended_s3"
extended_s3_configuration {
role_arn = "${aws_iam_role.firehose_role.arn}"
bucket_arn = "${aws_s3_bucket.bucket.arn}"
processing_configuration {
enabled = "true"
processors {
type = "Lambda"
parameters {
parameter_name = "LambdaArn"
parameter_value = "${aws_lambda_function.lambda_processor.arn}:$LATEST"
}
}
}
}
}
The Terraform documentation is the best at showing the difference between S3 and Extended S3 destinations: https://www.terraform.io/docs/providers/aws/r/kinesis_firehose_delivery_stream.html
S3 Extended inherits the S3 destination configuration parameters with extra ones such as data_format_conversion_configuration or the error_output_prefix
I am afraid the Kinesis Firehose document is so poorly written, I wonder how people can figure out how to use Firehose just from the documentation.
It looks originally the firehose simply relays data to the S3 bucket and there is no built-in transformation mechanism and the S3 destination configuration has no processing configuration as in AWS::KinesisFirehose::DeliveryStream S3DestinationConfiguration.
Then as in Amazon Kinesis Firehose Data Transformation with AWS Lambda, a mechanism to transform records was introduced seemingly around early 2017, so AWS::KinesisFirehose::DeliveryStream ExtendedS3DestinationConfiguration has been added.
Apparently people struggle to find the way of how to configure:
Does Amazon Kinesis Firehose support Data Transformations programmatically?
Well so I figured it out after much effort and documentation scrounging.
Who can figure it out by just reading the AWS document?
Amazon Kinesis Data Firehose Data Transformation
Firehose extended S3 configurations for lambda transformation
Could not figure out from the AWS document, but it looks the configurations required are below after looking into the actual implementations in the Internet.
Update
As per the suggestion by Kevin Eid.
Resource: aws_kinesis_firehose_delivery_stream
s3_configuration - (Optional) Required for non-S3 destinations. For S3 destination, use extended_s3_configuration instead.
The extended_s3_configuration object supports the same fields from s3_configuration as well as the following:
data_format_conversion_configuration - (Optional) Nested argument for the serializer, deserializer, and schema for converting data from the JSON format to the Parquet or ORC format before writing it to Amazon S3. More details given below.
error_output_prefix - (Optional) Prefix added to failed records before writing them to S3. This prefix appears immediately following the bucket name.
processing_configuration - (Optional) The data processing configuration. More details are given below.
s3_backup_mode - (Optional) The Amazon S3 backup mode. Valid values are Disabled and Enabled. Default value is Disabled.
s3_backup_configuration - (Optional) The configuration for backup in Amazon S3. Required if s3_backup_mode is Enabled. Supports the same fields as s3_configuration object.
The s3_configuration is still there due to compatibility or legacy reason only I believe, hence only need to use extended_s3_configuration but the AWS documentation does not explain properly. It is such a pity the AWS documentation does not serve as the source of truth.
First of The ExtendedS3DestinationConfiguration property type configures an Amazon S3 destination for an Amazon Kinesis Data Firehose delivery stream.
See:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-extendeds3destinationconfiguration.html
Thanks.
This little screenshot shows new components in ExtendedS3DestinationConfiguration as compared to S3DestinationConfiguration:
Also, what is and how the extended s3 configuration is defined are shown in API documentation:
{
"RoleARN": "string",
"BucketARN": "string",
"Prefix": "string",
"ErrorOutputPrefix": "string",
"BufferingHints": {
"SizeInMBs": integer,
"IntervalInSeconds": integer
},
"CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
"EncryptionConfiguration": {
"NoEncryptionConfig": "NoEncryption",
"KMSEncryptionConfig": {
"AWSKMSKeyARN": "string"
}
},
"CloudWatchLoggingOptions": {
"Enabled": true|false,
"LogGroupName": "string",
"LogStreamName": "string"
},
"ProcessingConfiguration": {
"Enabled": true|false,
"Processors": [
{
"Type": "Lambda",
"Parameters": [
{
"ParameterName": "LambdaArn"|"NumberOfRetries"|"RoleArn"|"BufferSizeInMBs"|"BufferIntervalInSeconds",
"ParameterValue": "string"
}
...
]
}
...
]
},
"S3BackupMode": "Disabled"|"Enabled",
"S3BackupUpdate": {
"RoleARN": "string",
"BucketARN": "string",
"Prefix": "string",
"ErrorOutputPrefix": "string",
"BufferingHints": {
"SizeInMBs": integer,
"IntervalInSeconds": integer
},
"CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
"EncryptionConfiguration": {
"NoEncryptionConfig": "NoEncryption",
"KMSEncryptionConfig": {
"AWSKMSKeyARN": "string"
}
},
"CloudWatchLoggingOptions": {
"Enabled": true|false,
"LogGroupName": "string",
"LogStreamName": "string"
}
},
"DataFormatConversionConfiguration": {
"SchemaConfiguration": {
"RoleARN": "string",
"CatalogId": "string",
"DatabaseName": "string",
"TableName": "string",
"Region": "string",
"VersionId": "string"
},
"InputFormatConfiguration": {
"Deserializer": {
"OpenXJsonSerDe": {
"ConvertDotsInJsonKeysToUnderscores": true|false,
"CaseInsensitive": true|false,
"ColumnToJsonKeyMappings": {"string": "string"
...}
},
"HiveJsonSerDe": {
"TimestampFormats": ["string", ...]
}
}
},
"OutputFormatConfiguration": {
"Serializer": {
"ParquetSerDe": {
"BlockSizeBytes": integer,
"PageSizeBytes": integer,
"Compression": "UNCOMPRESSED"|"GZIP"|"SNAPPY",
"EnableDictionaryCompression": true|false,
"MaxPaddingBytes": integer,
"WriterVersion": "V1"|"V2"
},
"OrcSerDe": {
"StripeSizeBytes": integer,
"BlockSizeBytes": integer,
"RowIndexStride": integer,
"EnablePadding": true|false,
"PaddingTolerance": double,
"Compression": "NONE"|"ZLIB"|"SNAPPY",
"BloomFilterColumns": ["string", ...],
"BloomFilterFalsePositiveProbability": double,
"DictionaryKeyThreshold": double,
"FormatVersion": "V0_11"|"V0_12"
}
}
},
"Enabled": true|false
}
}
Related
I've got logstash processing logs and uploading to an opensearch instance running on AWS as a service.
I've added a geoip filter to my logstash to process IPs into geographic data. According to the docs, the geoip filter should generate a location field that contains lon and lat and that should be recognised as a geo_point type which can then be used to populate map visualisations.
I've been trying for a couple of hours now but opensearch always splits the location field into the numbers location.lon and location.lat instead of just recognising location as geo_point, hence I cannot use it for map visualisations.
Here is my logstash config:
input {
file {
...
codec => json {
target => "[log_message]"
}
}
}
filter {
...
geoip {
source => "[log_message][forwarded_ip_address]"
}
}
output {
...
opensearch {
...
ecs_compatibility => disabled
}
}
The template on my opensearch instance is the standard one, so it does contain this:
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
I am not sure if this is relevant but AWS OpenSearch requires the ECS compatibility to be set as disabled, which I did.
Has somebody managed to do this successfully on AWS OpenSearch?
Have you tried to set the location field as geo_point type in the index before ingesting the data? I don't think opensearch detects the geo_point type automatically.
I am designing and implementing a backup plan to restore my client API keys. How to go about this ?
To fasten the recovery process, I am trying to create a backup plan for taking the backup of Client API keys, probably in s3 or local. I am scratching my head from past 2 days on how to achieve this ? May be some python script or something which will take the values from apigateway and dump into some new s3 bucket. But not sure how to implement this.
You can get all apigateway API keys list using apigateway get-api-keys. Here is the full AWS CLI command.
aws apigateway get-api-keys --include-values
Remember --include-values is must to use otherwise actual API Key will not be included in the result.
It will display the result in the below format.
"items": [
{
"id": "j90yk1111",
"value": "AAAAAAAABBBBBBBBBBBCCCCCCCCCC",
"name": "MyKey1",
"description": "My Key1",
"enabled": true,
"createdDate": 1528350587,
"lastUpdatedDate": 1528352704,
"stageKeys": []
},
{
"id": "rqi9xxxxx",
"value": "Kw6Oqo91nv5g5K7rrrrrrrrrrrrrrr",
"name": "MyKey2",
"description": "My Key 2",
"enabled": true,
"createdDate": 1528406927,
"lastUpdatedDate": 1528406927,
"stageKeys": []
},
{
"id": "lse3o7xxxx",
"value": "VGUfTNfM7v9uysBDrU1Pxxxxxx",
"name": "MyKey3",
"description": "My Key 3",
"enabled": true,
"createdDate": 1528406609,
"lastUpdatedDate": 1528406609,
"stageKeys": []
}
}
]
To get API Key detail of a single API Key, use below AWS CLI command.
aws apigateway get-api-key --include-value --api-key lse3o7xxxx
It should display the below result.
{
"id": "lse3o7xxxx",
"value": "VGUfTNfM7v9uysBDrU1Pxxxxxx",
"name": "MyKey3",
"description": "My Key 3",
"enabled": true,
"createdDate": 1528406609,
"lastUpdatedDate": 1528406609,
"stageKeys": []
}
Similar to get-api-keys call, --include-value is must here, otherwise actual API Key will not be included in the result
Now you need to convert the output in a format which can be saved on s3 and later can be imported to apigateway.
You can import keys with import-api-keys
aws apigateway import-api-keys --body <value> --format <value>
--body (blob)
The payload of the POST request to import API keys. For the payload
format
--format (string)
A query parameter to specify the input format to imported API keys.
Currently, only the CSV format is supported. --format csv
Simplest style is with two fields only e.g Key,name
Key,name
apikey1234abcdefghij0123456789,MyFirstApiKey
You can see the full detail of formats from API Gateway API Key File Format.
I have implemented it in python using a lambda for backing up APIs keys. Used boto3 APIs similar to the above answer.
However, I am looking for a way to trigger the lambda with an event of "API key added/removed" :-)
Trying to create a cloud formation template to configure WAF with geo location condition. Couldnt find the right template yet. Any pointers would be appreciated.
http://docs.aws.amazon.com/waf/latest/developerguide/web-acl-geo-conditions.html
Unfortunately, the actual answer (as of this writing, July 2018) is that you cannot create geo match sets directly in CloudFormation. You can create them via the CLI or SDK, then reference them in the DataId field of a WAFRule's Predicates property.
Creating a GeoMatchSet with one constraint via CLI:
aws waf-regional get-change-token
aws waf-regional create-geo-match-set --name my-geo-set --change-token <token>
aws waf-regional get-change-token
aws waf-regional update-geo-match-set --change-token <new_token> --geo-match-set-id <id> --updates '[ { "Action": "INSERT", "GeoMatchConstraint": { "Type": "Country", "Value": "US" } } ]'
Now reference that GeoMatchSet id in the CloudFormation:
"WebAclGeoRule": {
"Type": "AWS::WAFRegional::Rule",
"Properties": {
...
"Predicates": [
{
"DataId": "00000000-1111-2222-3333-123412341234" // id from create-geo-match-set
"Negated": false,
"Type": "GeoMatch"
}
]
}
}
There is no documentation for it, but it is possible to create the Geo Match in serverless/cloudformation.
Used the following in serverless:
Resources:
Geos:
Type: "AWS::WAFRegional::GeoMatchSet"
Properties:
Name: geo
GeoMatchConstraints:
- Type: "Country"
Value: "IE"
Which translated to the following in cloudformation:
"Geos": {
"Type": "AWS::WAFRegional::GeoMatchSet",
"Properties": {
"Name": "geo",
"GeoMatchConstraints": [
{
"Type": "Country",
"Value": "IE"
}
]
}
}
That can then be referenced when creating a rule:
(serverless) :
Resources:
MyRule:
Type: "AWS::WAFRegional::Rule"
Properties:
Name: waf
Predicates:
- DataId:
Ref: "Geos"
Negated: false
Type: "GeoMatch"
(cloudformation) :
"MyRule": {
"Type": "AWS::WAFRegional::Rule",
"Properties": {
"Name": "waf",
"Predicates": [
{
"DataId": {
"Ref": "Geos"
},
"Negated": false,
"Type": "GeoMatch"
}
]
}
}
I'm afraid that your question is too vague to solicit a helpful response. The CloudFormation User Guide (pdf) defines many different WAF / CloudFront / R53 resources that will perform various forms of geo match / geo blocking capabilities. The link you provide seems a subset of Web Access Control Lists (Web ACL) - see AWS::WAF::WebACL on page 2540.
I suggest you have a look and if you are still stuck, actually describe what it is you are trying to achieve.
Note that the term you used: "geo location condition" doesn't directly relate to an AWS capability that I'm aware of.
Finally, if you are referring to https://aws.amazon.com/about-aws/whats-new/2017/10/aws-waf-now-supports-geographic-match/, then the latest Cloudformation User Guide doesn't seem to have been updated yet to reflect this.
The command I use:
aws s3api put-bucket-notification-configuration --bucket bucket-name --notification-configuration file:///Users/chris/event_config.json
Works fine if I take out the "Filter" key. As soon as I add it in, I get:
Parameter validation failed:
Unknown parameter in NotificationConfiguration.LambdaFunctionConfigurations[0]: "Filter", must be one of: Id, LambdaFunctionArn, Events
Here's my JSON file:
{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:us-east-1:000000000:function:name",
"Events": [
"s3:ObjectCreated:*"
],
"Filter": {
"Key": {
"FilterRules": [
{
"Name": "prefix",
"Value": "images/"
}
]
}
}
}
]
}
When I look at the command's docs (http://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-notification-configuration.html), I don't see any mistake. I've tried copy/pasting, carefully looking over, etc... Any help would be greatly appreciated!
You need to be running at least version 1.7.46 of aws-cli, released 2015-08-20.
This release adds Amazon S3 support for event notification filters and fixes some issues.
https://aws.amazon.com/releasenotes/CLI/3585202016507998
The aws-cli utility contains a lot of built-in intelligence and validation logic. New features often require the code in aws-cli to be updated, and Filter on S3 event notifications is a relatively recent feature.
See also: https://aws.amazon.com/blogs/aws/amazon-s3-update-delete-notifications-better-filters-bucket-metrics/
I successfully managed to get a data pipeline to transfer data from a set of tables in Amazon RDS (Aurora) to a set of .csv files in S3 with a "copyActivity" connecting the two DataNodes.
However, I'd like the .csv file to have the name of the table (or view) that it came from. I can't quite figure out how to do this. I think the best approach is to use an expression the filePath parameter of the S3 DataNode.
But, I've tried #{table}, #{node.table}, #{parent.table}, and a variety of combinations of node.id and parent.name without success.
Here's a couple of JSON snippets from my pipeline:
"database": {
"ref": "DatabaseId_abc123"
},
"name": "Foo",
"id": "DataNodeId_xyz321",
"type": "MySqlDataNode",
"table": "table_foo",
"selectQuery": "select * from #{table}"
},
{
"schedule": {
"ref": "DefaultSchedule"
},
"filePath": "#{myOutputS3Loc}/#{parent.node.table.help.me.here}.csv",
"name": "S3_BAR_Bucket",
"id": "DataNodeId_w7x8y9",
"type": "S3DataNode"
}
Any advice you can provide would be appreciated.
I see that you have #{table} (did you mean #{myTable}?). If you are using a parameter to pass the name of the DB table, you can use that in the S3 filepath as well like this:
"filePath": "#{myOutputS3Loc}/#{myTable}.csv",