GCS bucket update frequency ,how do I set it? - google-cloud-platform

I am trying to use logstash to push messages to the GCS using the output plugin below. I am able to see the msgs in the bucket, however they appear every hour and not real time. Where can I change the frequency of the log send?
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-google_cloud_storage.html
P.S: I tried to add this to my config file but of no use:
flush_interval_secs => 2
my config looks something like this:
input{
kafka {
zk_connect => "xxxxxxxxxxxxxxxxxxxxxx"
group_id => "yyyyyyyyyyyyyyyyyyyyyyyyyyyy"
topic_id => "zzzzzzzzzzzzzzzzz"
reset_beginning => true
auto_offset_reset => "smallest"
}
}
output
{
google_cloud_storage {
bucket => "aaaa/bbb"
flush_interval_secs => 15
}
stdout
{
codec => rubydebug
}
}

From Documentation:
Uploader interval when uploading new files to GCS. Adjust time based on your time pattern (for example, for hourly files, this interval can be around one hour).
Default value is 60.
Example:
output {
google_cloud_storage {
bucket => "my_bucket" (required)
date_pattern => "%Y-%m-%dT%H:00" (optional)
uploader_interval_secs => 60 (optional)
}
}
Additionally, you can also set the date_pattern, which is the time pattern for log file.

Related

Unable to implement logstash pipeline for kaka as input and s3 as output with each message persisted as individual file

How can I create logstash (https://www.elastic.co/logstash/) pipeline to transfer a single individual message to AWS s3 bucket individual files with the file name as one of the attributes of Kafka message.
I am able to set up a simple pipeline using the following.
I am using s3 output plugin:
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html
and Kafka input plugin :
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html
input {
kafka {
bootstrap_servers => "mykafkaserver:9092"
topics => "document"
group_id => "xLogAna1"
auto_offset_reset => "earliest"
max_poll_records => 1
fetch_max_bytes=>10
}
}
output {
s3{
access_key_id => "XXXXXXXXXXXXXXX"
secret_access_key => "SSSSSSSSSSSSSSS"
region => "eu-west-1"
bucket => "<my-documnt-bucket>"
size_file => 1
time_file => 5
codec => "plain"
}
}
Since I want individual messages in individual files i tweaked the parameters max_poll_records and fetch_max_bytes to get individual messages but it's no help. In resultant s3 files, i am getting numerous kafka messages

Logstash skipping to many log lines

I setup logstash on my windows box. I run 7 instances of logstash on it. Each one has a folder with log files for input. I ran them at the same time and directed it to AWS es cluster running 7 instances of r3.xlarge and 3 master nodes (r3.xlarge). All input files combined take around 9GB. After all the logstash instances stopped running, I only had 6 million events in elasticsearch, there should be around 30 million. I went back to my one of my logstash cmd windows where I ran it and looked at the last event. It did not correspond to the last log line in the file where it took it from, it was like ~50th line towards the bottom. Then the second to last log event in the same window corresponded not to the log line right before the one I looked up first, I found it about 30 log lines above it in the log file. So it is apparent my logstash is skipping log lines.
Now I checked my elastic search and it shows all zeros, so nothing got dropped? (I looked at bulk.rejected in particular)
_cat/thread_pool?v
Is this data cumulative or does it get refreshed?
Which brings me to my second question. If logstash itself dropped the log lines for some reason, where and how can i troubleshoot it, I know that none of my logstash instances crashed. All I know is it happily dropped 70% of all my logs and I have no error log or clue to go by as to what happened.
Edit:
My logstash configuration:
(It is like it is ingoring all my logs for Friday, Sat and Sun, and just processing for Monday (3/21))
input {
file {
type => "apache_logs"
path => "D:/logs/apache_logs/all/ssl_access.*"
start_position => "beginning"
sincedb_path => "NUL"
}
}
filter {
grok {
match => ["message","%{IPORHOST:client_ip} (?<username>[-]) (?<password>[-]) \[(?<timestamp>\d{2}[/][a-zA-Z]{3}[/]\d{4}:\d{2}:\d{2}:\d{2}\s-\d{0,4})\] \"%{GREEDYDATA:request}\" %{NOTSPACE:obssocookie} %{NOTSPACE:ps_sso_uid_in} %{NOTSPACE:ps_sso_uid_out} (?<status>[0-9]{3}) (?<bytes>[0-9]{1,}|-) %{NOTSPACE:protocol} %{NOTSPACE:ciphers} \"%{GREEDYDATA:referrer}\" \"%{GREEDYDATA:user_agent}\""]
match => [ "path", "(?<app_node>webpr[0-9]{2}[a-z]{0,1})" ]
add_field => { "server_node" => "%{app_node}" }
break_on_match => false
}
mutate {
gsub => ["obssocookie","^.*=",""]
}
mutate {
gsub => ["ps_sso_uid_in","^.*=",""]
}
mutate {
gsub => ["ps_sso_uid_out","^.*=",""]
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
remove_field => "timestamp"
}
geoip {
source => "client_ip"
}
if [geoip] {
mutate {
add_field => {
"ip_type" => "public"
}
}
} else {
mutate{
add_field => {
"ip_type" => "private"
}
}
}
}
output {
stdout{ codec => rubydebug}
amazon_es {
hosts => ["apache-logs-xxxxxxxxxxxxxxxxxxxxxxxxxx.us-west-2.es.amazonaws.com"]
region => "us-west-2"
aws_access_key_id => 'xxxxxxxxxxxxxxxxxx'
aws_secret_access_key => 'xxxxxxxxxxxxxxxxxxxxx'
index => "logstash-apache-friday"
}
}
How can I know how many events logstash dropped specifically, not how many elastic search rejected, because I already checked through the API and bulk.rejected=0
Found my culprit. I have to include this in my files input, looks like it skips any files older than 24 hours by default
ignore_older => 0
Kind of surprising, I would expect to add settings when I want to narrow my input, otherwise logstash should process any files, older than 24 hours or not. Really not something that was obvious..

Writing a single file to multiple s3 buckets with gulp-awspublish

I have a simple single-page app, that is deployed to an S3 bucket using gulp-awspublish. We use inquirer.js (via gulp-prompt) to ask the developer which bucket to deploy to.
Sometimes the app may be deployed to several S3 buckets. Currently, we only allow one bucket to be selected, so the developer has to gulp deploy for each bucket in turn. This is dull and prone to error.
I'd like to be able to select multiple buckets and deploy the same content to each. It's simple to select multiple buckets with inquirer.js/gulp-prompt, but not simple to generate arbitrary multiple S3 destinations from a single stream.
Our deploy task is based upon generator-webapp's S3 recipe. The recipe suggests gulp-rename to rewrite the path to write to a specific bucket. Currently our task looks like this:
gulp.task('deploy', ['build'], () => {
// get AWS creds
if (typeof(config.awsCreds) !== 'object') {
return console.error('No config.awsCreds settings found. See README');
}
var dirname;
const publisher = $.awspublish.create({
key: config.awsCreds.key,
secret: config.awsCreds.secret,
bucket: config.awsCreds.bucket
});
return gulp.src('dist/**/*.*')
.pipe($.prompt.prompt({
type: 'list',
name: 'dirname',
message: 'Using the ‘' + config.awsCreds.bucket + '’ bucket. Which hostname would you like to deploy to?',
choices: config.awsCreds.dirnames,
default: config.awsCreds.dirnames.indexOf(config.awsCreds.dirname)
}, function (res) {
dirname = res.dirname;
}))
.pipe($.rename(function(path) {
path.dirname = dirname + '/dist/' + path.dirname;
}))
.pipe(publisher.publish())
.pipe(publisher.cache())
.pipe($.awspublish.reporter());
});
It's hopefully obvious, but config.awsCreds might look something like:
awsCreds: {
dirname: 'default-bucket',
dirnames: ['default-bucket', 'other-bucket', 'another-bucket']
}
Gulp-rename rewrites the destination path to use the correct bucket.
We can select multiple buckets by using "checkbox" instead of "list" for the gulp-prompt options, but I'm not sure how to then deliver it to multiple buckets.
In a nutshell, if $.prompt returns an array of strings instead of a string, how can I write the source to multiple destinations (buckets) instead of a single bucket?
Please keep in mind that gulp.dest() is not used -- only gulp.awspublish() -- and we don't know how many buckets might be selected.
Never used S3, but if I understand your question correctly a file js/foo.js should be renamed to default-bucket/dist/js/foo.js and other-bucket/dist/js/foo.js when the checkboxes default-bucket and other-bucket are selected?
Then this should do the trick:
// additionally required modules
var path = require('path');
var through = require('through2').obj;
gulp.task('deploy', ['build'], () => {
if (typeof(config.awsCreds) !== 'object') {
return console.error('No config.awsCreds settings found. See README');
}
var dirnames = []; // array for selected buckets
const publisher = $.awspublish.create({
key: config.awsCreds.key,
secret: config.awsCreds.secret,
bucket: config.awsCreds.bucket
});
return gulp.src('dist/**/*.*')
.pipe($.prompt.prompt({
type: 'checkbox', // use checkbox instead of list
name: 'dirnames', // use different result name
message: 'Using the ‘' + config.awsCreds.bucket +
'’ bucket. Which hostname would you like to deploy to?',
choices: config.awsCreds.dirnames,
default: config.awsCreds.dirnames.indexOf(config.awsCreds.dirname)
}, function (res) {
dirnames = res.dirnames; // store array of selected buckets
}))
// use through2 instead of gulp-rename
.pipe(through(function(file, enc, done) {
dirnames.forEach((dirname) => {
var f = file.clone();
f.path = path.join(f.base, dirname, 'dist',
path.relative(f.base, f.path));
this.push(f);
});
done();
}))
.pipe(publisher.cache())
.pipe($.awspublish.reporter());
});
Notice the comments where I made changes from the code you posted.
What this does is use through2 to clone each file passing through the stream. Each file is cloned as many times as there were bucket checkboxes selected and each clone is renamed to end up in a different bucket.

Get AWS CloudTrail log to Kibana

Is there any better solution implement to get aws cloudtrail logs to kibana, here I am using ElasticSearch Service from AWS
Heres the logstash input that I use with 1.4.2. It works well, though I suspect it is noisy (it requires a lot of S3 GET/HEAD/LIST requests).
input {
s3 {
bucket => "bucketname"
delete => false
interval => 60 # seconds
prefix => "cloudtrail/"
type => "cloudtrail"
codec => "cloudtrail"
credentials => "/etc/logstash/s3_credentials.ini"
sincedb_path => "/opt/logstash_cloudtrail/sincedb"
}
}
filter {
if [type] == "cloudtrail" {
mutate {
gsub => [ "eventSource", "\.amazonaws\.com$", "" ]
add_field => {
"document_id" => "%{eventID}"
}
}
if ! [ingest_time] {
ruby {
code => "event['ingest_time'] = Time.now.utc.strftime '%FT%TZ'"
}
}
ruby {
code => "event.cancel if (Time.now.to_f - event['#timestamp'].to_f) > (60 * 60 * 24 * 1)"
}
ruby {
code => "event['ingest_delay_hours'] = (Time.now.to_f - event['#timestamp'].to_f) / 3600"
}
# drop events more than a day old, we're probably catching up very poorly
if [ingest_delay_hours] > 24 {
drop {}
}
# example of an event that is noisy and I don't care about
if [eventSource] == "elasticloadbalancing" and [eventName] == "describeInstanceHealth" and [userIdentity.userName] == "deploy-s3" {
drop {}
}
}
}
The credentials.ini format is explained on the s3 input page; it's just this:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
I also have a search that sends results to our #chatops but I'm not posting that here.
If you haven't tried it already, you can use cloudtrail and cloudwatch logs together. Then use cloudwatch logs to create a subscription to send the cloudtrail data to elasticsearch.
Once that is done you should be able to define a kibana index that starts with cwl* that is time based.
Cheers-

good resources for grok patterns for python log file

I want to use logtash for parsing python log files , where can i find the resources that help me in doing that. For example:
20131113T052627.769: myapp.py: 240: INFO: User Niranjan Logged-in
In this I need to capture the time information and also some data information.
I had exactly the same problem/need. I couldn't really find a solution to this. No available grok patterns really matched the python logging output, so I simply went ahead and wrote a custom grok pattern which I've added naively into patterns/grok-patterns.
DATESTAMP_PYTHON %{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{HOUR}:%{MINUTE}:%{SECOND},%{INT}
The logstash configuration I wrote gave me nice fields.
#timestamp
level
message
Added some extra field which I called pymodule which should show you the python module that was producing the log entry.
My logstash configuration file looks like this (ignore the sincedb_path this is simple a manner of forcing logstash to read the entire log file everytime you run it):
input {
file {
path => "/tmp/logging_file"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok {
match => [
"message", "%{DATESTAMP_PYTHON:timestamp} - %{DATA:pymodule} - %{LOGLEVEL:level} - %{GREEDYDATA:logmessage}" ]
}
mutate {
rename => [ "logmessage", "message" ]
}
date {
timezone => "Europe/Luxembourg"
locale => "en"
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS" ]
}
}
output {
stdout {
codec => json
}
}
Please note that
I give absolutely no guarantee that this is the best or even an
slightly acceptable solution.
Our Python log file has a slightly different format:
[2014-10-08 19:05:02,846] (6715) DEBUG:Our debug message here
So I was able to create a configuration file without any need for special patterns:
input {
file {
path => "/path/to/python.log"
start_position => "beginning"
}
}
filter {
grok {
match => [
"message", "\[%{TIMESTAMP_ISO8601:timestamp}\] \(%{DATA:pyid}\) %{LOGLEVEL:level}\:%{GREEDYDATA:logmessage}" ]
}
mutate {
rename => [ "logmessage", "message" ]
}
date {
timezone => "Europe/London"
locale => "en"
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS" ]
}
}
output {
elasticsearch {
host => localhost
}
stdout {
codec => rubydebug
}
}
And this seems to work fine.