I have manually installed Confluent Kafka Connect S3 using the standalone method and not through Confluent's process or as part of the whole platform.
I can successfully launch the connector from the command line with the command:
./kafka_2.11-2.1.0/bin/connect-standalone.sh connect.properties s3-sink.properties
Topic CDC offsets from AWS MSK can be seen being consumed. No errors are thrown. However, in AWS S3, no folder structure is created for new data and no JSON data is stored.
Questions
Should the connector dynamically create the folder structure as it
sees the first JSON packet for a topic?
Other than configuring
awscli credentials, connect.properties and s3-sink.properties are
there any other settings that need to be set to properly connect to
the S3 bucket?
Recommendations on install documentation more
comprehensive than the standalone docs on the Confluent website? (linked above)
connect.properties
bootstrap.servers=redacted:9092,redacted:9092,redacted:9092
plugin.path=/plugins/kafka-connect-s3 key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
s3-sink.properties
name=s3-sink connector.class=io.confluent.connect.s3.S3SinkConnector
tasks.max=1
topics=database_schema_topic1,database_schema_topic2,database_schema_topic3
s3.region=us-east-2 s3.bucket.name=databasekafka s3.part.size=5242880
flush.size=1 storage.class=io.confluent.connect.s3.storage.S3Storage
format.class=io.confluent.connect.s3.format.json.JsonFormat
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=NONE
Should the connector dynamically create the folder structure as it sees the first JSON packet for a topic?
Yes, even you control this path(directory structure) using parameter "topics.dir" and "path.format"
Other than configuring awscli credentials, connect.properties and s3-sink.properties are there any other settings that need to be set to properly connect to the S3 bucket?
By default, S3 connector will use Aws credentials (access id and secret key) through environment variables or credentials file.
You can change by modifying the parameter "s3.credentials.provider.class". Default value of the parameter is "DefaultAWSCredentialsProviderChain"
Recommendations on install documentation more comprehensive than the standalone docs on the Confluent website? (linked above)
I recommend you to go with distributed mode as it provides high availability for your connect cluster and connectors running on it.
You can go through below documentation to configure connect cluster in distributed mode.
https://docs.confluent.io/current/connect/userguide.html#connect-userguide-dist-worker-config
Related
Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
I am using centOS-7 machine, bacula community-edition 11.0.5 and PostgreSql Database
Bacula is used to take full and incremental backup
I followed bellow document link to store the backup on an Amazon S3 bucket.
https://www.bacula.lat/community/bacula-storage-in-any-cloud-with-rclone-and-rclone-changer/?lang=en
I configured storage daemon as they shown in the above link, once after the backup, backup is success and backed up file storing in the given path /mnt/vtapes/tapes, but backup-file is not moving from /mnt/vtapes/tapes to AWS s3 bucket.
In the above document mentioned as, we need to create Schedule routines to the cloud to move backup file from /mnt/vtapes/tapes to Amazon S3 bucket.
**I am not aware of what is cloud Schedule routines in AWS, whether it is any cloud lambda function or something else?
Is there any S3 cloud driver which support bacula backup or any other way to store bacula-community backup file on Amazon S3 other than S3FS-Fuse and libs3 ?
The link which you shared is for bacula-enterprise, we are using bacula-community. so any related document you prefer for bacula-community edition
Bacula Community include AWS S3 cloud driver starting from 9.6.0. Check https://www.bacula.org/11.0.x-manuals/en/main/main.pdf - Chapter 3, New Features in 9.6.0. And additional: 4.0.1 New Commands, Resource, and Directives for Cloud. This is the same exact driver available at Enterprise version.
In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work.
How can I view the DAG/Spark UI of my Glue job?
You need to setup an ec2 instance that can host the history server.
The below documentation has links to CloudFormation templates that you can use.
https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
You can access the history server via the ec2 instance(default on 18080). You need to configure the networks and ports suitably.
EDIT - There is also an option to setup SparkUI locally. This requires downloading the docker image from aws-glue-samples repo amd settin the AWS credential and s3 location there. This server consummes the files that the glue job generates. The files are about 4MB large.
I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.
I have a postgres dump in AWS S3 bucket, what is the most convenience way to restore it in a AWS RDS ?
AFAIK, there is no native AWS way to manually push data from S3 to anywhere else. The dump stored on S3 needs to first be downloaded and then restored.
You can use the link posted above (http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html), however that doesn't help you download the data.
The easiest way to get something off of S3 is to simply go to the S3 console and point/click your way to the file, right click it and click Download. If you need to restore FROM an EC2 instance (e.g. because your RDS does not have a public IP), than install and configure the AWS CLI (http://docs.aws.amazon.com/cli/latest/userguide/installing.html).
Once you have the CLI configured, download with the following command:
aws s3 cp s3://<<bucket>>/<<folder>>/<<folder>>/<<key>> dump.gz
NOTE: the above command may need some additional tweaking depending on whether you have multiple AWS profiles installed on the machine, the dump is not one file (but many), etc.
From there restore to RDS just like you would a normal Postgres server following the instructions in the AWS link.
Hope that helps!