Upload log4j2 JSON formatted log files to AWS CloudWatch with correct timestamp - amazon-web-services

I'm trying to upload log files to AWS CloudWatch. The application is outputting log4j2 style JSON into a file:
https://logging.apache.org/log4j/2.x/manual/layouts.html#JSONLayout
AWS provide 2 cloudwatch log agents for this task. An 'older' agent, and a 'unified' agent:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_GettingStarted.html
I have tried using both agents, but run into the following problems, mainly related to parsing timestamps and the fact that the agent is doing a regex match on the entire log line and is not parsing the it as JSON:
An example log message (additional fields omitted):
{"message": "Processing data for previous day: 2019-06-17T02:01:00", "timestamp": "2019-06-18T17:16:19.338000+0100"}
The older agent threw an exception because it was attempting to use my configured timestamp format to parse the timestamp in the message, not the one in the timestamp field.
The unified agent is unable to parse timestamps with sub-second precision which results in problems when attempting to combine log streams from multiple sources.
So, is there a better tool/strategy to upload JSON formatted log4j2 files to CoudWatch?

Related

Where to find detailed logs of BigQuery Data Transfer

I am using BQ Data Transfer to move some zipped JSON data from s3 to BQ.
I am receiving the following error and I'd like to dig deeper into it.
"jsonPayload": {"message": "Job xyz (table ybz) failed with error INVALID_ARGUMENT: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. File: gs://some-file-name; JobID: PID"},
When trying to connect that URL (replacing the gs:// part with https://storage.googleapis.com/) and I get
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
That storage can't be found on my GCP Storage buckets.
I suspect there are badly formatted JSON, but without clearly looking at the logs and errors I can't get back to the s3 bucket owner with relevant information.
You can refer to this document to BigQuery Data Transfer Service for Amazon S3.
When you load JSON files into BigQuery, note the following:
JSON data must be newline delimited. Each JSON object must be on a separate line in the file.
If you use gzip compression, BigQuery cannot read the data in parallel. Loading compressed JSON data into BigQuery is slower than loading uncompressed data.
You cannot include both compressed and uncompressed files in the same load job.
The maximum size for a gzip file is 4 GB.
BigQuery supports the JSON type even if schema information is not known at the time of ingestion. A field that is declared as JSON type is loaded with the raw JSON values.
For more information regarding limitations about the Amazon S3 transfers you can refer to this document.
To view logs of BigQuery Data transfers in logs explorer, you can use this filter:
resource.type="bigquery_dts_config"
labels.run_id="transfer_run_id"

Google BigQuery cannot read some ORC data

I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.
Command I used:
$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17
Upload complete.
Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE
BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message:
The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory
Indeed, there is no such file and I believe it shouldn't be.
Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.
Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.
Does anybody came across such problem?
Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.
Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:
select timestamp('2020-01-01 00:00:00 GMT-00:00')
you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:
“2020-01-01 00:00:00 GMT-00:00” -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef” -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00” -- correct timestamp string literal
In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?
Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.
Best regards,
Google Cloud Support
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones

Upload multi-lined JSON log to AWS CloudWatch Log

The put-log-events expect the JSON file need to wrap by a [ & ]
e.g.
# aws logs put-log-events --log-group-name my-logs --log-stream-name 20150601 --log-events file://events
[
{
"timestamp": long,
"message": "string"
}
...
]
However, my JSON file is in multi-lined format like
{"timestamp": xxx, "message": "xxx"}
{"timestamp": yyy, "message": "yyy"}
Is it possible to upload without writing my own program?
[1] https://docs.aws.amazon.com/cli/latest/reference/logs/put-log-events.html#examples
An easy way to handle publish the batch without any coding would be by using jq to do the necessary transformation in the file. jq is a command line utility to do the JSON processing.
cat events | jq -s '.'> events-formatted.json
aws logs put-log-events --log-group-name my-logs --log-stream-name 20150601 --log-events file://events-formatted.json
With this the data should be formatted and could be ingested to CloudWatch.
If you want to keep those lines as a single event, you can cast the lines to string, join them with \n and send them that way.
Since lines look like self sufficient json themselves, sending them as an array of events (hence [...]) might not be that bad, since they will get into same log group and will be easy to find as a batch.
You will need to escape it as suggested, and remove the new lines. Even though there is allot of JSON these days used as the consumer format, it isn't a great raw representation when it comes to logs. Reason being is that logs can get truncated.
Try parsing truncated JSON, no fun at all!
You also don't want to have timestamp embedded in your logs either, this will break the filter and search ability that you get with cloudwatch.
You can stream a RAW format to cloudwatch logs, and then use streams to parse that raw data, format it, filter it or whatever you want to do, into a service such as Elastic Search. I would recommend streaming to Elastic Search service on AWS if you are wanting to do more with your logs than what cloudwatch gives you, and you can do your embedded timestamp format as well if you so wish.

Why doesn't my Kinesis Analytics Application Schema Discovery work?

I am sending comma-separated data to my kinesis stream, and I want my kinesis analytics app to recognize that there are two columns (both bigints). But when I populate my stream with some records and click "Discover Schema", it always gives me a schema of one column! Here's a screenshot:
I have tried many different delimiters to indicate columns, including comma, space, and comma-space, but none of these cause aws to detect my schema properly. At one point I gave up and edited the schema manually, which caused this error:
While I know that I have the option to keep the schema as a single column and use string and date-time manipulation to structure my data, I prefer not to do it this way... Any suggestions?
While I wasn't able to get the schema discovery tool to work, I realized that I am able to manually edit my schema and it works fine. I was getting that error because I had just populated the stream initially, and I was not continuously sending data.
Schema Discovery required me to send data to my input kinesis stream during the schema discovery. To do this for my Proof of Concept application I used the AWS CLI:
# emittokinesis.sh
JSON='{
"messageId": "31c14ee7-9bde-484d-af05-03509c2c33aa",
"myTest": "myValue"
}'
echo "$JSON"
JSONBASE64=$(echo ${JSON} | base64)
echo 'aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "'${JSONBASE64}'"'
aws kinesis put-record --stream-name logstash-input-test --partition-key 1 --data "${JSONBASE64}"
I clicked the "Run Schema Discovery" button in the AWS UI and then quickly ran my shell script in a CMD window.
Once my initial schema was discovered I could manually edit the schema but it mostly matched what I expected based on my input JSON.

Cloud watch logs prepending timestamp to each line

We have cloud watch log agent setup and the logs streamed are appending a timestamp to beginning of each line which we could see after export.
2017-05-23T04:36:02.473Z "message"
Is there any configuration on cloud watch log agent setup that helps not appending this timestamp to each log entry?
Is there a way to export cloud watch logs only the messages of log events? We dont want the timestamp on our exported logs.
Thanks
Assume that you are able to retrieve those logs using your Lambda function (Python 3.x).
Then you can use Regular Expression to identify the timestamp and write a function to strip it from the event log.
^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z\t
The above will identify the following timestamp: 2019-10-10T22:11:00.123Z
Here is a simple Python function:
def strip(eventLog):
timestamp = "r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z\t'"
result = re.sub(timestamp, "", eventLog)
return result
I don't think it's possible, I needed the same exact behavior you are asking for and looks like it's not possible unless you implement a man in the middle processor to remove the timestamp from every log message as suggested in the other answer
Checking the CloudWatch Logs Client API in the first place, it's required to send the timestamp with every log message you send to CloudWatch Logs (API reference)
And the export logs to S3 task API also has no parameters to control this behavior (API reference)