Amazon Managed Airflow Dags wont sync - amazon-web-services

Did anyone try the new Amazon managed services Airflow? I did all the steps mentioned and could open airflow link. But none of the dags are synched. I don't see any error. I event tried updating the environment with a different S3 bucket .

Are you uploading dags in the dags folder in your s3 bucket?
dags.py should be inside the dags folder in s3 bucket to get synced.

Related

CodeDeploy or CodeCommit in AWS is changing File Type in S3 Bucket during Push?

I am trying to connect the content of a static Website in an S3 bucket to a CodeCommit repo via CodeDeploy.
However, when I set up a repo via CodeCommit and a CodeDeploy Pipeline and when I push changes to my S3 bucket of my HTML file, the static HTML page doesn't loadinstead my browser screen either briefly flashes or it instead downloads the HTML file.
I know I have the S3 bucket configured correctly because when I test my .html file via it's public URL, it loads as expect.
Addtionally, when I download my HTML file via my S3 bucket BEFORE I push commit changes, the same file downloads. However, when I download the newly committed HTML file from S3, it's corrupted. Which makes me think it's an issue in how I've configured CodeDeploy, but can't figure it out.
I believe I have the header information configured correctly
The S3 Bucket policy Bucket policy allows for reading of objects. CodePipeline successfully pushes my repo changes to my S3 Bucket. But for some reason, even through S3 still sees the file type as HTML, it's not configuring as such after a push from CodeDeploy. Additionally, when I try to download the new pushed HTML file and open it, the HTML code is all jumbled.
Any help or insights is appreciated.
Eureka! I found the solution (by accident).
Just in case others run into this problem: When configuring a deployment pipeline in CodePipeline, if you don't select "Extract file before deploy" in your deployment configuration step, CodePipeline will instead deploy any code commit HTML files (and I assume other files types as well) as "octet-streams". Enabling "Extract file before deploy" fixed this problem.
Now I will be able to finally sleep tonight!

Monitoring a S3 bucket and downloading any new files continuously

Is there any way I can monitor a S3 bucket for any new files added to it using boto3? Once a new file is added to the S3 bucket, it needs to be downloaded.
My Python code needs to run on an external VMC Server, which is not hosted on an AWS EC2 instance. Whenever a vendor will push a new file to our public S3 bucket, I need to download those files to this VMC Server for ingestion in our on-prem databases/servers. I can't access the VMC Server from AWS either, and neither is there any webhook available.
I have written the code for downloading the files, however, how can I monitor a S3 bucket for any new files?
Take a look at S3 Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html

How to use Spark to read data from one AWS account and write to another AWS account?

I have spark jobs running on a EKS cluster to ingest AWS logs from S3 buckets.
Now I have to ingest logs from another AWS account. I have managed to use the below setting to successfully read in data from cross account with hadoop AssumedRoleCredentialProvider.
But how do I save the dataframe back to my own AWS account S3? It seems no way to set the Hadoop S3 config back to my own AWS account.
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.external.id","****")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.arn","****")
val data = spark.read.json("s3a://cross-account-log-location")
data.count
//change back to InstanceProfileCredentialsProvider not working
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
data.write.parquet("s3a://bucket-in-my-own-aws-account")
As per the hadoop documentation different S3 buckets can be accessed with different S3A client configurations, having a per bucket configuration including the bucket name.
Eg: fs.s3a.bucket.<bucket name>.access.key
Check the below URL: http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets

How to put custom application logs in s3 bucket from AWS EMR

I have custom application/service for AWS EMR that I'm installing from bootstrap action on all nodes. My I want to put logs file of that application in same s3 bucket that I'm using for emr-logs.
Can anyone suggest me where I have to configure my log path in logpusher to get logs in s3 bucket in fixed interval same as a hadoop application.
you can configure it in /etc/logpusher/hadoop.config and restart the logpusher on all nodes

Jenkins S3 plugin unable to delete files in S3

My Jenkins Setup works as follows:
Takes checkout of code from Github
Publishes the code successfully to my S3 bucket.
The IAM user I have configured for it has full access permission to S3.
But the problem occurs if I delete a file/directory, it updates all the files in my s3 bucket but doesn't removes the deleted files/directories. Is Deleting files/directories not possible by Jenkins S3 plugin?
S3 plugin removes files on onDelete event. Jenkins creates this event when it goes to remove build from history (due to rotation or something like that). Uploading works only as uploading - not updating.