awslogs agent can't keep up - amazon-web-services

I'm running the awslogs agent on a server, and when I look in CloudWatch logs in the AWS console, the logs are about 60 minutes behind. Our server produces about 650MB of data per hour, and it appears that the agent is not able to keep up.
Here is our abbreviated config file:
[application.log]
datetime_format = %Y-%m-%d %H:%M:%S
time_zone = UTC
file = var/output/logs/application.json.log*
log_stream_name = {hostname}
initial_position = start_of_file
log_group_name = ApplicationLog
[service_log]
datetime_format = %Y-%m-%dT%H:%M:%S
time_zone = UTC
file = var/output/logs/service.json.log*
log_stream_name = {hostname}
initial_position = start_of_file
log_group_name = ServiceLog
Is there a common way to speed of the awslogs agent?

The amount of data (> 0.2MB/s) is not an issue for the agent. The agent has a capacity of about 3MB/s per log file. However, if you're using the same log stream for multiple log files, the agents write to the same stream, and end up blocking each other. The throughput more than halves when you share a stream between log files.
Also, there are a few properties that can be configured that may have an impact on performance:
buffer_duration = <integer>
batch_count = <integer>
batch_size = <integer>
To solve my issue, I did two things:
Drastically increase the batch size (defaults to 32768 bytes)
Use a different log stream for each log file
And the agent had no problems keeping up. Here's my final config file:
[application.log]
datetime_format = %Y-%m-%d %H:%M:%S
time_zone = UTC
file = var/output/logs/application.json.log*
log_stream_name = {hostname}-app
initial_position = start_of_file
log_group_name = ApplicationLog
batch_size = 524288
[service_log]
datetime_format = %Y-%m-%dT%H:%M:%S
time_zone = UTC
file = var/output/logs/service.json.log*
log_stream_name = {hostname}-service
initial_position = start_of_file
log_group_name = ServiceLog
batch_size = 524288

awslogs agent supports log rotation
so this:
file = var/output/logs/application.json.log*
would pick up too many files?
Try:
file = var/output/logs/application.json.log
to speed up the process.

Related

Create separate log stream for each file with pattern AWS Cloudwatch

I wanna create separate log streams for each fille with pattern sync-.log (for example filename sync-site1.log, stream name - server-sync-site1). There are a lot of log files with sync-.log pattern, so do it manually for each is a bad option for me. How can I do it? I'm using an older agent.
Here is my current config:
[server-sync.log]
datetime_format = %Y-%m-%dT%H:%M:%S,%f
file = /home/ec2-user/log/sync/sync-*.log
buffer_duration = 5000
log_stream_name = server-sync
initial_position = end_of_file
multi_line_start_pattern = {datetime_format}
log_group_name = server
There is a well defined API Call for this purpose CreateLogStream, corresponding api call in boto3 is create_stream
response = client.create_log_stream(
logGroupName='string',
logStreamName='string'
)
There is a terraform resource as well named aws_cloudwatch_log_stream
resource "aws_cloudwatch_log_group" "yada" {
name = "Yada"
}
resource "aws_cloudwatch_log_stream" "foo" {
name = "SampleLogStream1234"
log_group_name = aws_cloudwatch_log_group.yada.name
}

AWS Cloudwatch logs not working as expected

I am trying to use AWS CloudWatch to maintain the application logs in a Ubuntu EC2 instance. I have installed the awslogs agent using the following command as suggested in their documentation to monitor the file application.log and push any new entries in the file to CloudWatch.
Setup command - sudo python3 ./awslogs-agent-setup.py --region ap-south-1
It was working fine for a day when I tested it out after setting it up, then it stopped working from the next day. I can see that the changes in the log files are being detected by the AWS Agent, as there is an entry in the awslogs.log file as soon as there is a new entry in the application.log file. However, the same updates are not being pushed/reflected in the CloudWatch console.
What might have gone wrong here?
Entry in /var/log/awslogs.log
2020-02-27 12:19:03,376 - cwlogs.push.reader - WARNING - 1388 - Thread-4 - Fall back to previous event time: {'end_position': 10483213, 'timestamp': 1582261391000, 'start_position': 10483151}, previousEventTime: 1582261391000, reason: timestamp could not be parsed from message.
2020-02-27 12:19:07,437 - cwlogs.push.publisher - INFO - 1388 - Thread-3 - Log group: branchpayout-python-pilot, log stream: ip-172-27-99-136_application.log, queue size: 0, Publish batch: {'fallback_events_count': 2, 'source_id': 'c0bd7124acf1c35ede963da6b8ec9882', 'num_of_events': 2, 'first_event': {'end_position': 10483151, 'timestamp': 1582261391000, 'start_position': 10482278}, 'skipped_events_count': 0, 'batch_size_in_bytes': 985, 'last_event': {'end_position': 10483213, 'timestamp': 1582261391000, 'start_position': 10483151}}
Configuration in /var/awslogs/etc/awslogs.conf
[/home/ubuntu/application-name/application.log]
file = /home/ubuntu/application-name/application.log
datetime_format = %Y-%m-%d %H:%M:%S,%f
log_stream_name = {hostname}_application.log
buffer_duration = 5000
log_group_name = branchpayout-python-pilot
initial_position = end_of_file
multi_line_start_pattern = {datetime_format}
Check you log format and accordingly update your awslogs.conf.
for me nginx access log format in access.log was "%d/%b/%Y:%H:%M:%S %z" hence my config file contains :
datetime_format = %d/%b/%Y:%H:%M:%S %z
Below are the example .
Nginx error.log 2017/08/12 05:04:00 %Y/%m/%d %H:%M:%S
Nginx access.log 12/Aug/2017:06:19:17 +0900 %d/%b/%Y:%H:%M:%S %z
php-fpm error.log 12-Aug-2017 05:24:38 %d-%b-%Y %H:%M:%S
php-fpm www-error.log 10-Aug-2017 23:40:46 UTC %d-%b-%Y %H:%M:%S
messages Aug 12 06:13:36 %b %d %H:%M:%S
secure Aug 11 04:03:33 %b %d %H:%M:%S

CloudWatch logs acting weird

I have two log files with multi-line log statements. Both of them have same datetime format at the begining of each log statement. The configuration looks like this:
state_file = /var/lib/awslogs/agent-state
[/opt/logdir/log1.0]
datetime_format = %Y-%m-%d %H:%M:%S
file = /opt/logdir/log1.0
log_stream_name = /opt/logdir/logs/log1.0
initial_position = start_of_file
multi_line_start_pattern = {datetime_format}
log_group_name = my.log.group
[/opt/logdir/log2-console.log]
datetime_format = %Y-%m-%d %H:%M:%S
file = /opt/logdir/log2-console.log
log_stream_name = /opt/logdir/log2-console.log
initial_position = start_of_file
multi_line_start_pattern = {datetime_format}
log_group_name = my.log.group
The cloudwatch logs agent is sending log1.0 logs correctly to my log group on cloudwatch, however, its not sending log files for log2-console.log.
awslogs.log says:
2016-11-15 08:11:41,308 - cwlogs.push.batch - WARNING - 3593 - Thread-4 - Skip event: {'timestamp': 1479196444000, 'start_position': 42330916L, 'end_position': 42331504L}, reason: timestamp is more than 2 hours in future.
2016-11-15 08:11:41,308 - cwlogs.push.batch - WARNING - 3593 - Thread-4 - Skip event: {'timestamp': 1479196451000, 'start_position': 42331504L, 'end_position': 42332092L}, reason: timestamp is more than 2 hours in future.
Though server time is correct. Also weird thing is Line numbers mentioned in start_position and end_position does not exist in actual log file being pushed.
Anyone else experiencing this issue?
I was able to fix this.
The state of awslogs was broken. The state is stored in a sqlite database in /var/awslogs/state/agent-state. You can access it via
sudo sqlite3 /var/awslogs/state/agent-state
sudo is needed to have write access.
List all streams with
select * from stream_state;
Look up your log stream and note the source_id which is part of a json data structure in the v column.
Then, list all records with this source_id (in my case it was 7675f84405fcb8fe5b6bb14eaa0c4bfd) in the push_state table
select * from push_state where k="7675f84405fcb8fe5b6bb14eaa0c4bfd";
The resulting record has a json data structure in the v column which contains a batch_timestamp. And this batch_timestamp seams to be wrong. It was in the past and any newer (more than 2 hours) log entries were not processed anymore.
The solution is to update this record. Copy the v column, replace the batch_timestamp with the current timestamp and update with something like
update push_state set v='... insert new value here ...' where k='7675f84405fcb8fe5b6bb14eaa0c4bfd';
Restart the service with
sudo /etc/init.d/awslogs restart
I hope it works for you!
We had the same issue and the following steps fixed the issue.
If log groups are not updating with latest events:
Run These steps:
Stopped the awslogs service
Deleted file /var/awslogs/state/agent-state
Updated /var/awslogs/etc/awslogs.conf configuration from hostaname to
instance ID Ex:
log_stream_name = {hostname} to log_stream_name = {instance_id}
Started awslogs service.
I was able to resolve this issue on Amazon Linux by:
sudo yum reinstall awslogs
sudo service awslogs restart
This method retained my config files in /var/awslogs/, though you may wish to back them up before a reinstall.
Note: In my troubleshooting, I had also deleted my Log Group via the AWS Console. The restart fully reloaded all historical logs, but at the present timestamp, which is of less value. I'm unsure if deleting the Log Group was this was necessary for this method to work. You might want to look at setting the initial_position config to end_of_file before you restart.
I found the reason. The time zone in my docker container is inconsistent with the time zone of my host computer. After setting the two time zones to be consistent, the problem is solved

flume hdfs rollSize not working in multi channels and multi sinks

I am trying to use Flume-ng to grab 128MB of log information and put it into a file in HDFS. But HDFS rolling options not working. Flume-ng send log file per seconds. How can I fix flume.conf file?
agent01.sources = avroGenSrc
agent01.channels = memoryChannel hdfsChannel
agent01.sinks = fileSink hadoopSink
# For each one of the sources, the type is defined
agent01.sources.avroGenSrc.type = avro
agent01.sources.avroGenSrc.bind = dev-hadoop03.ncl
agent01.sources.avroGenSrc.port = 3333
# The channel can be defined as follows.
agent01.sources.avroGenSrc.channels = memoryChannel hdfsChannel
# Each sink's type must be defined
agent01.sinks.fileSink.type = file_roll
agent01.sinks.fileSink.sink.directory = /home1/irteam/flume/data
agent01.sinks.fileSink.sink.rollInterval = 3600
agent01.sinks.fileSink.sink.batchSize = 100
#Specify the channel the sink should use
agent01.sinks.fileSink.channel = memoryChannel
agent01.sinks.hadoopSink.type = hdfs
agent01.sinks.hadoopSink.hdfs.useLocalTimeStamp = true
agent01.sinks.hadoopSink.hdfs.path = hdfs://dev-hadoop04.ncl:9000/user/hive/warehouse/raw_logs/year=%Y/month=%m/day=%d
agent01.sinks.hadoopSink.hdfs.filePrefix = AccessLog.%Y-%m-%d.%Hh
agent01.sinks.hadoopSink.hdfs.fileType = DataStream
agent01.sinks.hadoopSink.hdfs.writeFormat = Text
agent01.sinks.hadoopSink.hdfs.rollInterval = 0
agent01.sinks.hadoopSink.hdfs.rollSize = 134217728
agent01.sinks.hadoopSink.hdfs.rollCount = 0
#Specify the channel the sink should use
agent01.sinks.hadoopSink.channel = hdfsChannel
# Each channel's type is defined.
agent01.channels.memoryChannel.type = memory
agent01.channels.hdfsChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent01.channels.memoryChannel.capacity = 100000
agent01.channels.memoryChannel.transactionCapacity = 10000
agent01.channels.hdfsChannel.capacity = 100000
agent01.channels.hdfsChannel.transactionCapacity = 10000
I found this solution. dfs.replication mismatch cause this problem.
In my hadoop conf (hadoop-2.7.2/etc/hadoop/hdfs-site.xml)
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
I have 2 data nodes so I change it to
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
and I add config in flume.conf
agent01.sinks.hadoopSink.hdfs.minBlockReplicas = 2
thanks for
https://qnalist.com/questions/5015704/hit-max-consecutive-under-replication-rotations-error
and
Flume HDFS sink keeps rolling small files

s3cmd obfuscate file names (change to random value) on Amazon S3 side (local original file name)

my .s3cfg with GPG encryption passphrase and other security settings. Would you recommend other security hardening?
[default]
access_key = $USERNAME
access_token =
add_encoding_exts =
add_headers =
bucket_location = eu-central-1
ca_certs_file =
cache_file =
check_ssl_certificate = True
check_ssl_hostname = True
cloudfront_host = cloudfront.amazonaws.com
default_mime_type = binary/octet-stream
delay_updates = False
delete_after = False
delete_after_fetch = False
delete_removed = False
dry_run = False
enable_multipart = True
encoding = UTF-8
encrypt = False
expiry_date =
expiry_days =
expiry_prefix =
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/local/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = $PASSPHRASE
guess_mime_type = True
host_base = s3.amazonaws.com
host_bucket = %(bucket)s.s3.amazonaws.com
human_readable_sizes = False
invalidate_default_index_on_cf = False
invalidate_default_index_root_on_cf = True
invalidate_on_cf = False
kms_key =
limitrate = 0
list_md5 = False
log_target_prefix =
long_listing = False
max_delete = -1
mime_type =
multipart_chunk_size_mb = 15
multipart_max_chunks = 10000
preserve_attrs = True
progress_meter = True
proxy_host =
proxy_port = 0
put_continue = False
recursive = False
recv_chunk = 65536
reduced_redundancy = False
requester_pays = False
restore_days = 1
secret_key = $PASSWORD
send_chunk = 65536
server_side_encryption = False
signature_v2 = False
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 300
stats = False
stop_on_error = False
storage_class =
urlencoding_mode = normal
use_https = True
use_mime_magic = True
verbosity = WARNING
website_endpoint = http://%(bucket)s.s3-website-%(location)s.amazonaws.com/
website_error =
website_index = index.html
I use this command to upload/sync my local folder to Amazon S3.
s3cmd -e -v put --recursive --dry-run /Users/$USERNAME/Downloads/ s3://dgtrtrtgth777
INFO: Compiling list of local files...
INFO: Running stat() and reading/calculating MD5 values on 15957 files, this may take some time...
INFO: [1000/15957]
INFO: [2000/15957]
INFO: [3000/15957]
INFO: [4000/15957]
INFO: [5000/15957]
INFO: [6000/15957]
INFO: [7000/15957]
INFO: [8000/15957]
INFO: [9000/15957]
INFO: [10000/15957]
INFO: [11000/15957]
INFO: [12000/15957]
INFO: [13000/15957]
INFO: [14000/15957]
INFO: [15000/15957]
I tested the encryption with Transmit GUI S3 Client and didn't get plain text files.
But I see the original filename. I wish to change the filename to a random value, but have local the original filename (mapping?). How can I do this?
What are downsides doing so if I need to restore the files? I use Amazon S3 only as a backup, in addition to my TimeMachine backup.
If you use "random" names, then it isn't sync.
If your only record on the filenames/mapping is local, it will be impossible to restore your backup in case of a local failure.
If you don't need all versions of your files I'd suggest putting everything in a (possibly encrypted) compressed tarball before uploading it.
Otherwise, you will have to write a small script that lists all files and individually does an s3cmd put specifying a random destination, where the mapping is appended to a log file, which should be the first thing you s3cmd put to your server. I don't recommend this for something as crucial as storing your backups.
A skeleton showing how this could work:
# Save all files in backupX.sh where X is the version number
find /Users/$USERNAME/Downloads/ | awk '{print "s3cmd -e -v put "$0" s3://dgtrshitcrapola/"rand()*1000000}' > backupX.sh
# Upload the mapping file
s3cmd -e -v put backupX.sh s3://dgtrshitcrapola/
# Upload the actual files
sh backupX.sh
# Add cleanup code here
However, you will need to handle filename collisions, failed uploads, versioning clashes, ... why not use an existing tool that backs up to S3?