CloudWatch - Delete logs after transfered - amazon-web-services

I have a CloudWatch set up on my EC2 instance to transfer logs to specific log groups.
In time, those logs can grow quite big in size so I wanted to delete them for example, on weekly basis.
I was wondering if there is any option of setting up auto-cleanup from EC2 instance, of transferred logs using Cloudwatch?
What would be the best way to achieve that?

To remove the logfiles from EC2 running Linux, you have two choices:
If you're using logfiles that already rotate based on time or other value, you can use the auto_removal option to delete them after the log agent is finished. See docs.
If you're using a file that's constantly updated, you'll need to use logrotate, which is a program invoked by CRON that will rename, compress, and delete old files. There's a good intro doc here.
If you use logrotate, here's an example config that I've found useful for high-volume log sources. It performs a rotate if the file reaches 100 megabytes, rather than just doing it every day (you'll need to run it from cron.hourly to make that useful). Most important, it enables copytruncate, which will truncate the file in-place, allowing the program to continue writing to it.
/var/log/filename.log {
rotate 7
daily
maxsize 100M
nodateext
missingok
notifempty
copytruncate
compress
delaycompress
}

Related

How to break the streams in Docker awslog driver?

I do have a EC2 instance and a docker container running on it. Currently this docker container uses awslog driver to push logs to CloudWatch. If I go to CloudWatch console, I see a very large log stream (with container id as name) which contains all logs of last 16 days (since I've created the container). It almost seems like if I have this container running for 1 year, this log stream will keep all logs of 1 year. I am not quite sure what is the maximum size limit of a CloudWatch log stream, but most likely it will have a limit, at least I believe.
So my question is;
How to chunk this huge logstream? Hopefully by current date, smth like {{.ContainerId}}{{.CurrentDate}}
What is the maximum size limit of a CloudWatch log stream?
Is it a good practice to append onto a single huge log stream?
The following is the definition of Cloudwatch Log Stream as defined in the docs, here
Log streams
A log stream is a sequence of log events that share the same source. More specifically, a log stream is generally intended to represent the sequence of events coming from the application instance or resource being monitored. For example, a log stream may be associated with an Apache access log on a specific host. When you no longer need a log stream, you can delete it using the aws logs delete-log-stream command.
Unfortunately what you want is not possible at the moment. Not sure what exactly is your use-case but you can filter the logs streams using time, so separating them is not really necessary. See start-time and end-time in filter-log-events
You might want to define the following awslog driver options to get a better stream name.
awslogs-stream-prefix see docs

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?

How to sync a number between multiple google cloud instances using google cloud storage?

I am trying to sync an operation between multiple instances in google cloud.
In the home folder of an image from where I create new instances, I have several files that are named like this: 1.txt, 2.txt, 3.txt,... 50000.txt.
I have another file in the google cloud storage bucket named gs://bucket/current_file.txt that contains a number in a single line which indicates the latest file that is being processed by all the running the google cloud instances. Initially this file looks like this:
0
Now I am creating multiple google instances one by one. The instances have a startup script like this:
gsutil cp gs://bucket/current_file.txt /home/ubuntu/;
past_file=`tail /home/ubuntu/current_file.txt`;
current_file=$((past_file+1));
echo $current_file > /home/ubuntu/current_file.txt;
gsutil cp /home/ubuntu/current_file.txt gs://bucket/;
process.py /home.ubuntu/$current_file.txt;
So this script downloads the value of the current file that is being processed by another instance, then it increments it by 1, and starts processing the incremented file. Also gs://bucket/current_file.txt is updated so that other instances know the name of the next file they can start processing. When I have only 1 instance running, the gs://bucket/current_file.txt is updated properly, but when I am running multiple instances, sometimes the value in gs://bucket/current_file.txt goes up to a value and then erratically it falls back to a decreased value.
My assumption is somehow two different instances are trying to upload the same file at the same time and messes up the integer value inside the text file.
Is it anyway possible to lock the file so that other instances wait before one instance can overwrite the gs://bucket/currrent_file.txt?
If not, can someone suggest any other mechanism through which I can update the current_file number once it is being processed by one instance, and then can be communicated to other instances so that they can start processing the following files when they complete processing a file at hand?
You are correct. In your architecture, you need some mechanism to lock your current-file counter so that only one process at a time is able to change its value. You want to be able to apply a mutex or lock to the file, when one process opens it to increment it, so that another process is unable to increment it concurrently.
I recommend you consider alternative approaches.
Even if you are able to lock the counter, your "workers" will block, waiting their turn to increment this variable when they should be able to continue processing files. You also limit processing to one file at a time when, it may be more efficient for your processes to grab batches of files at a time.
There are various approaches for you to consider.
If your set of files is predetermined, i.e. you always have 50k. When you start, you could decide how many workers you wish to use and then give each of them part of the problem to solve. If you chose 1000 workers, the first may be assigned 1.txt..50.txt, the 2nd 51.txt..99.txt etc. If there are gaps in the files, the worker would skip the missing file.
In a more complex scenario, when the files are created in the bucket randomly and ongoing, a common practice is to queue the processing. Have a look at Task Queues and Cloud Pub/Sub. In this approach, you track files as they arrive. For each file you enqueue a job to process it. With both Tasks Queues and Pub/Sub you can create push or pull queues.
In either approach, you would write a worker that accepts jobs (files) from the queue, processes them and does something with the processed file. This approach has 2 advantages over the simpler case: The first is that you can dynamically increase|reduce the number of workers based on the queue depth (number of files to be processed). The second is that, if a worker fails, it won't take the job from the queue and so another worker can replace it and complete the file processing.
You could move processed files to a "processed" bucket to track completion. This way, if your job fails, you need only restart with the files that have not yet been processed.
Lastly, rather than creating instances one-by-one, have a look at auto-scaling using Managed Instance Groups or perhaps consider using Kubernetes. Both these technologies help you clone many similar processes from a single template. While neither of these solutions solves your coordination problem, either would help you manage all the workers.

"Realtime" syncing of large numbers of log files to S3

I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach

Pointing multiple projects' log sinks to one bucket

I have a few GCP projects with log sinks to different storage buckets. I'd like to combine them into a single bucket. But the stackdriver export doesn't add any distinguishing information to the object names it creates; they all look like cloudaudit.googleapis.com/activity/2017/11/14/00:00:00_00:59:59_S0.json
What will happen if I start pushing them all to a single bucket? Will the different project sinks overwrite each other's objects? Is there any way to distinguish which project created the logs just from the object?
If not, I guess I should switch to pubsub sinks, and then write some code that produces objects with more desirable names. Are there any established patterns or examples for doing this?
Update: I filed https://issuetracker.google.com/issues/69371200 for this issue.
To enable this, just select custom destination on the sink and point to the bucket with this format: storage.googleapis.com/[BUCKET_ID].
I've just enabled this in a couple of my projects, as I'm curious to see the results when exporting to a bucket. However, I have been using a single BQ sink for all my projects, and the tables created have all the logs mixed, so no logs lost when using a single BQ sink.
I'm assuming for a GCS sink will work in the same way, but I'll tell you in a couple of days.
If a single bucket sink does not work, you can always use a single BQ sink (that will help in analyzing the logs), and when you no longer want to have them in BQ, export them and store the files wherever you want.
Also, since you'll be writing to your sink constantly, you can't use nearline or coldline, so the storage pricing is better in BQ than a regional bucket (0.02 USD/GB in BQ vs somewhere between 0.02 and 0.35 USD/GB for regional storage, depending on the region; BQ has 10GB free monthly, GCS 5GB).
I would generally recommend using a BQ sink, but I'll tell you what happens with my bucket logs.
Update:
A few hours later, and I've verified that shared bucket sinks work pretty much as you would expect. It concatenates logs chronologically regardless of the project origin, and only creates a single file for each time window. Hope this helps! (I still prefer BQ as a log sink...)
Update 2:
For the behavior you seek in the feature request, I would use BQ, but you could just as easily grep the project ID and separate the logs:
grep '"logName":"projects/<your-project-id>/' mixed-log.json > single-project-log.json
Or just get a cloud function triggered by bucket updates (so, every time you receive a log file in the sink) to run this for you.
Or namespace you buckets and have a cloud function moving them to wherever you need as soon as they are written.
The possibilities are endless!
If you have an organization or folder which includes all the projects that you want to collect logs from, then you can create a sink that collects from all projects in that org/folder.
Unfortunatlely, you cannot do this from the Cloud Console. Instead you must use gcloud with the --organization or --folder option or the API.