Logstash skipping to many log lines - amazon-web-services

I setup logstash on my windows box. I run 7 instances of logstash on it. Each one has a folder with log files for input. I ran them at the same time and directed it to AWS es cluster running 7 instances of r3.xlarge and 3 master nodes (r3.xlarge). All input files combined take around 9GB. After all the logstash instances stopped running, I only had 6 million events in elasticsearch, there should be around 30 million. I went back to my one of my logstash cmd windows where I ran it and looked at the last event. It did not correspond to the last log line in the file where it took it from, it was like ~50th line towards the bottom. Then the second to last log event in the same window corresponded not to the log line right before the one I looked up first, I found it about 30 log lines above it in the log file. So it is apparent my logstash is skipping log lines.
Now I checked my elastic search and it shows all zeros, so nothing got dropped? (I looked at bulk.rejected in particular)
_cat/thread_pool?v
Is this data cumulative or does it get refreshed?
Which brings me to my second question. If logstash itself dropped the log lines for some reason, where and how can i troubleshoot it, I know that none of my logstash instances crashed. All I know is it happily dropped 70% of all my logs and I have no error log or clue to go by as to what happened.
Edit:
My logstash configuration:
(It is like it is ingoring all my logs for Friday, Sat and Sun, and just processing for Monday (3/21))
input {
file {
type => "apache_logs"
path => "D:/logs/apache_logs/all/ssl_access.*"
start_position => "beginning"
sincedb_path => "NUL"
}
}
filter {
grok {
match => ["message","%{IPORHOST:client_ip} (?<username>[-]) (?<password>[-]) \[(?<timestamp>\d{2}[/][a-zA-Z]{3}[/]\d{4}:\d{2}:\d{2}:\d{2}\s-\d{0,4})\] \"%{GREEDYDATA:request}\" %{NOTSPACE:obssocookie} %{NOTSPACE:ps_sso_uid_in} %{NOTSPACE:ps_sso_uid_out} (?<status>[0-9]{3}) (?<bytes>[0-9]{1,}|-) %{NOTSPACE:protocol} %{NOTSPACE:ciphers} \"%{GREEDYDATA:referrer}\" \"%{GREEDYDATA:user_agent}\""]
match => [ "path", "(?<app_node>webpr[0-9]{2}[a-z]{0,1})" ]
add_field => { "server_node" => "%{app_node}" }
break_on_match => false
}
mutate {
gsub => ["obssocookie","^.*=",""]
}
mutate {
gsub => ["ps_sso_uid_in","^.*=",""]
}
mutate {
gsub => ["ps_sso_uid_out","^.*=",""]
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
remove_field => "timestamp"
}
geoip {
source => "client_ip"
}
if [geoip] {
mutate {
add_field => {
"ip_type" => "public"
}
}
} else {
mutate{
add_field => {
"ip_type" => "private"
}
}
}
}
output {
stdout{ codec => rubydebug}
amazon_es {
hosts => ["apache-logs-xxxxxxxxxxxxxxxxxxxxxxxxxx.us-west-2.es.amazonaws.com"]
region => "us-west-2"
aws_access_key_id => 'xxxxxxxxxxxxxxxxxx'
aws_secret_access_key => 'xxxxxxxxxxxxxxxxxxxxx'
index => "logstash-apache-friday"
}
}
How can I know how many events logstash dropped specifically, not how many elastic search rejected, because I already checked through the API and bulk.rejected=0

Found my culprit. I have to include this in my files input, looks like it skips any files older than 24 hours by default
ignore_older => 0
Kind of surprising, I would expect to add settings when I want to narrow my input, otherwise logstash should process any files, older than 24 hours or not. Really not something that was obvious..

Related

GCS bucket update frequency ,how do I set it?

I am trying to use logstash to push messages to the GCS using the output plugin below. I am able to see the msgs in the bucket, however they appear every hour and not real time. Where can I change the frequency of the log send?
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-google_cloud_storage.html
P.S: I tried to add this to my config file but of no use:
flush_interval_secs => 2
my config looks something like this:
input{
kafka {
zk_connect => "xxxxxxxxxxxxxxxxxxxxxx"
group_id => "yyyyyyyyyyyyyyyyyyyyyyyyyyyy"
topic_id => "zzzzzzzzzzzzzzzzz"
reset_beginning => true
auto_offset_reset => "smallest"
}
}
output
{
google_cloud_storage {
bucket => "aaaa/bbb"
flush_interval_secs => 15
}
stdout
{
codec => rubydebug
}
}
From Documentation:
Uploader interval when uploading new files to GCS. Adjust time based on your time pattern (for example, for hourly files, this interval can be around one hour).
Default value is 60.
Example:
output {
google_cloud_storage {
bucket => "my_bucket" (required)
date_pattern => "%Y-%m-%dT%H:00" (optional)
uploader_interval_secs => 60 (optional)
}
}
Additionally, you can also set the date_pattern, which is the time pattern for log file.

Get AWS CloudTrail log to Kibana

Is there any better solution implement to get aws cloudtrail logs to kibana, here I am using ElasticSearch Service from AWS
Heres the logstash input that I use with 1.4.2. It works well, though I suspect it is noisy (it requires a lot of S3 GET/HEAD/LIST requests).
input {
s3 {
bucket => "bucketname"
delete => false
interval => 60 # seconds
prefix => "cloudtrail/"
type => "cloudtrail"
codec => "cloudtrail"
credentials => "/etc/logstash/s3_credentials.ini"
sincedb_path => "/opt/logstash_cloudtrail/sincedb"
}
}
filter {
if [type] == "cloudtrail" {
mutate {
gsub => [ "eventSource", "\.amazonaws\.com$", "" ]
add_field => {
"document_id" => "%{eventID}"
}
}
if ! [ingest_time] {
ruby {
code => "event['ingest_time'] = Time.now.utc.strftime '%FT%TZ'"
}
}
ruby {
code => "event.cancel if (Time.now.to_f - event['#timestamp'].to_f) > (60 * 60 * 24 * 1)"
}
ruby {
code => "event['ingest_delay_hours'] = (Time.now.to_f - event['#timestamp'].to_f) / 3600"
}
# drop events more than a day old, we're probably catching up very poorly
if [ingest_delay_hours] > 24 {
drop {}
}
# example of an event that is noisy and I don't care about
if [eventSource] == "elasticloadbalancing" and [eventName] == "describeInstanceHealth" and [userIdentity.userName] == "deploy-s3" {
drop {}
}
}
}
The credentials.ini format is explained on the s3 input page; it's just this:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
I also have a search that sends results to our #chatops but I'm not posting that here.
If you haven't tried it already, you can use cloudtrail and cloudwatch logs together. Then use cloudwatch logs to create a subscription to send the cloudtrail data to elasticsearch.
Once that is done you should be able to define a kibana index that starts with cwl* that is time based.
Cheers-

GROK Pattern Works with GROK Debugger but not in Logstash GROK

I have a GROK pattern I am trying to use in Logstash that works within the GROK Debugger website but not within Log stash. I've tried different configurations with no success. I'm hoping someone can help me identify why this is not working.
Input: 2015-04-15 12:43:23.788 1883 AUDIT nova.compute.resource_tracker [-] Free disk (GB): -7
Search Pattern: Free disk \(GB\): \-%{INT:auth_method}
I want to extract the value 7
Thanks for your help!!!!
Hate to say it, OP, but it works for me:
input {
stdin {}
}
filter {
grok {
match => [ message, "Free disk \(GB\): \-%{INT:auth_method}" ]
}
}
output {
stdout { codec => rubydebug }
}
Gives you this:
2015-04-15 12:43:23.788 1883 AUDIT nova.compute.resource_tracker [-] Free disk (GB): -7
{
"message" => "2015-04-15 12:43:23.788 1883 AUDIT nova.compute.resource_tracker [-] Free disk (GB): -7",
"#version" => "1",
"#timestamp" => "2015-04-16T15:57:17.229Z",
"host" => "0.0.0.0",
"auth_method" => "7"
}
Check for extra spaces at the end of your pattern, perhaps?

Logstash conf file for parsing django exceptions

I have been trying to use logstash, elastic search, and Kibana for monitoring my django server.
I have set the conf file as given below
input {
tcp { port => 5000 codec => json }
udp { port => 5000 type => syslog }
}
output {
elasticsearch_http {
host => "127.0.0.1"
port => 9200
}
stdout { codec => rubydebug }
}
But the messages logged are too lengthy and could not find a method to parse it.
Any help is appreciated
As far as I can tell, there is not a pattern or built-in that will directly parse Django exceptions.
You need to tell the forwarding agent to target the Django log files that you're generating, marking them as "type": "django".
Then, on the Logstash server, you can use the following:
pattern:
DJANGO_LOGLEVEL (DEBUG|INFO|ERROR|WARNING|CRITICAL)
DJANGO_LOG %{DJANGO_LOGLEVEL:log_level}\s+%{TIMESTAMP_ISO8601:log_timestamp}\s+%{TZ:log_tz}\s+%{NOTSPACE:logger}\s+%{WORD:module}\s+%{POSINT:proc_id}\s+%{GREEDYDATA:content}
filter:
filter {
if [type] == "django" {
grok {
match => ["message", "%{DJANGO_LOG}" ]
}
date {
match => [ "timestamp", "ISO8601", "YYYY-MM-dd HH:mm:ss,SSS"]
target => "#timestamp"
}
}
}
if you don't want to add the pattern file, you can expand the DJANGO_LOGLEVEL pattern into the %{DJANGO_LOGLEVEL:log_level} field and place the targeting rule that follows DJANGO_LOG into the grok match placeholder.

good resources for grok patterns for python log file

I want to use logtash for parsing python log files , where can i find the resources that help me in doing that. For example:
20131113T052627.769: myapp.py: 240: INFO: User Niranjan Logged-in
In this I need to capture the time information and also some data information.
I had exactly the same problem/need. I couldn't really find a solution to this. No available grok patterns really matched the python logging output, so I simply went ahead and wrote a custom grok pattern which I've added naively into patterns/grok-patterns.
DATESTAMP_PYTHON %{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND},%{INT}
The logstash configuration I wrote gave me nice fields.
#timestamp
level
message
Added some extra field which I called pymodule which should show you the python module that was producing the log entry.
My logstash configuration file looks like this (ignore the sincedb_path this is simple a manner of forcing logstash to read the entire log file everytime you run it):
input {
file {
path => "/tmp/logging_file"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok {
match => [
"message", "%{DATESTAMP_PYTHON:timestamp} - %{DATA:pymodule} - %{LOGLEVEL:level} - %{GREEDYDATA:logmessage}" ]
}
mutate {
rename => [ "logmessage", "message" ]
}
date {
timezone => "Europe/Luxembourg"
locale => "en"
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS" ]
}
}
output {
stdout {
codec => json
}
}
Please note that
I give absolutely no guarantee that this is the best or even an
slightly acceptable solution.
Our Python log file has a slightly different format:
[2014-10-08 19:05:02,846] (6715) DEBUG:Our debug message here
So I was able to create a configuration file without any need for special patterns:
input {
file {
path => "/path/to/python.log"
start_position => "beginning"
}
}
filter {
grok {
match => [
"message", "\[%{TIMESTAMP_ISO8601:timestamp}\] \(%{DATA:pyid}\) %{LOGLEVEL:level}\:%{GREEDYDATA:logmessage}" ]
}
mutate {
rename => [ "logmessage", "message" ]
}
date {
timezone => "Europe/London"
locale => "en"
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS" ]
}
}
output {
elasticsearch {
host => localhost
}
stdout {
codec => rubydebug
}
}
And this seems to work fine.