I'm looking into OpenSearch and trying out the dashboards & data querying but am having some difficulty with some specific queries I'd like to run.
The data I have streaming into OpenSearch is a set of custom error logs which the #message field contains json such as: ...{"code":400,"message":"Bad request","detail":"xxxx"}... (the entire message field is not valid json as it contains other data too)
I can specifically query for the code field with "code:400" but would like a more generic query to match all 4XX codes but adding any type of wildcard or range breaks the query and the quotes surrounding it are required otherwise the results include code OR 400
Is there any way to achieve this with the kind of data I have or is this a limitation in the querying syntax?
Related
I went through the whole boto3 documentation and it seems like there is no way to retrieve the execution details of a specific query. The only way that I can see is to get the execution IDs of all queries by using list_query_executions() and then use either get_query_execution() or batch_get_query_execution(). However, even in the execution details there is no field that allows you to map the executions to a specific query (you can use the query string, but this is not favorable, as the query string is subject to change). In that sense, I can only see the option to get all execution details... but no way to map them to a query (e.g. by using the query ID or name)
Do I miss something here? Is it somehow possible to cleanly map a query to all of its executions?
Thank you!
I'm interested in using a streaming pipeline from google pub/sub to big query, I wanted to know how it would handle a case where an updated json object is sent with missing fields/branches that are presently already in a big query table / schema. For example will it set the value in the table to empty/null, retain what's in the table and update fields/branches that are present, or simply fail because the sent object does not match the schema one to one.
I run a Glue Crawler over a nested JSON data-source on S3 and I tried to query nested fields as per documentation via Redshift Spectrum:
select c.id , c.my_nested_column.MyField
from my_external_schema.my_table c;
But as per title I was getting the error message
[42703] ERROR: column "my_nested_column" does not exist
which doesn't really make sense as from metadata I can see the field exists. But because of this I'm unable to unnest fields from "my_nested_column".
How to fix this?
After some investigations, I noticed that one of the fields of the JSONs being parsed contained a colon within the field name and it was something like my:field.
This clashes with the JSON logic and doesn't work really well. I removed this field from the Glue Catalog and afterwards I was able to query the field correctly.
The initial error message I was getting really didn't help but the problem was caused by malformed JSON field name.
I have enabled logging on my GCP PostgreSQL 11 Cloud SQL database. The logs are being redirected to a bucket in the same project and they are in a JSON format.
The logs contain queries which were executed on the database. Is there a way to create a decent report from these JSON logs with a few fields from the log entries? Currently the log files are in JSON and not very reader friendly.
Additionally, if a multi-line query is run, then those many log entries are created for that query. If there is also a way to recognize logs which are belong to the same query, that will be helpful, too!
I guess the easiest way is using BigQuery.
BigQuery will import properly those jsonl files and will assign proper field names for the json data
When you have multiline-queries, you'll see that they appear as multiple log entries in the json files.
Looks like all entries from a multiline query have the same receiveTimestamp (which makes sense, since they were produced at the same time).
Also, the insertId field has a 's=xxxx' subfield that does not change for lines on the same statement. For example:
insertId: "s=6657e04f732a4f45a107bc2b56ae428c;i=1d4598;b=c09b782e120c4f1f983cec0993fdb866;m=c4ae690400;t=5b1b334351733;x=ccf0744974395562-0#a1"
The strategy to extract that statements in the right line order is:
Sort by the 's' field in insertId
Then sort by receiveTimestamp ascending (to get all the lines sent at once to the syslog agent in the cloudsql service)
And finally sort by timestamp ascending (to get the line ordering right)
I am trying to split logs per customer. I think I understand the Query DSL of Elasticsearch.
For filtering the logs I use a domain name as filter parameter.
For this time we will call them
bh250.example.com
bh500.example.com
now I have managed to filter the logs so that the owner of domain bh250.example.com can only see his logfiles.
But when I want to sort them, based on the timestamp it "breaks" the filter and shows both bh250 and bh500 logs.
q =Q("match", domainname=domein)
q1 =Q("match", status="404")
search= Search(using=dev_client, index="access-logs").query(q).filter.("term" , status="200").sort("-#timestamp")[0:100]
Now without the Sort function it shows the correct logs, but in a different order. with the sort function i get both records on screen. (bh250 and bh500)
I also have looked at if mappings could be the issue but im not quite sure about why the sort function breaks down my "filter"