Regular expression to split two events in JSON file - regex

I have one JSON log file and I am looking for a regex to split the events within it. I have written one regex but it is reading all events as one group.
Log file:
[ {
"name" : "CounterpartyNotional",
"type" : "RiskBreakdown",
"duration" : 20848,
"count" : 1,
"average" : 20848.0
}, {
"name" : "CounterpartyPreSettlement",
"type" : "RiskBreakdown",
"duration" : 15370,
"count" : 1,
"average" : 15370.0
} ]
[ {
"name" : "TraderCurrency",
"type" : "Formula",
"duration" : 344,
"count" : 1,
"average" : 344.0
} ]
PS: I will be using this regex for a Splunk tool.

Your regex does not read all events together. In the line above the regex (on the linked page) there is written "2 matches", which means your regex has split the log, but you must know how to iterate through the matches (i.e. the events) in the language which runs the regex matching.
For example in Python 3 (If you don't mind I simplify the regex):
import re
log = """[ {
"name" : "CounterpartyNotional",
"type" : "RiskBreakdown",
"duration" : 20848,
"count" : 1,
"average" : 20848.0
}, {
"name" : "CounterpartyPreSettlement",
"type" : "RiskBreakdown",
"duration" : 15370,
"count" : 1,
"average" : 15370.0
} ]
[ {
"name" : "TraderCurrency",
"type" : "Formula",
"duration" : 344,
"count" : 1,
"average" : 344.0
} ]"""
event = re.compile(r'{[^}]*?"RiskBreakdown"[^}]*}')
matches = event.findall(log)
print(matches)
And yes, it is true, this is not valid JSON, but on the linked page it is OK, so maybe it's a typo.

Related

Unable to execute Dataflow Pipeline : "Failed to create job with prefix beam_bq_job_LOAD_textiotobigquerydataflow"

I'm trying to run a dataflow batch job using template "Text file on Cloud Storage to BigQuery". First three steps are working but in the last stage it is getting failed giving error:
Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_textiotobigquerydataflow0releaser1025091627592969dd_1a449a94623645758e91dcba53a86498_fc44bdad405c2c80860231502c18eb1e_00001_00000, reached max retries: 3, last failed job: { "configuration" : { "jobType" : "LOAD", "labels" : { "beam_job_id" : "2022-11-10_02_06_07-15255037958352274885" }, "load" : { "createDisposition" : "CREATE_IF_NEEDED", "destinationTable" : { "datasetId" : "minerals_test_dataset", "projectId" : "jio-big-data-poc", "tableId" : "mytable01" }, "ignoreUnknownValues" : false, "sourceFormat" : "NEWLINE_DELIMITED_JSON", "useAvroLogicalTypes" : false, "writeDisposition" : "WRITE_APPEND" } }, "etag" : "LHqft9L/H4XBWTNZ7BSRXA==", "id" : "jio-big-data-poc:asia-south1.beam_bq_job_LOAD_textiotobigquerydataflow0releaser1025091627592969dd_1a449a94623645758e91dcba53a86498_fc44bdad405c2c80860231502c18eb1e_00001_00000-2", "jobReference" : { "jobId" : "beam_bq_job_LOAD_textiotobigquerydataflow0releaser1025091627592969dd_1a449a94623645758e91dcba53a86498_fc44bdad405c2c80860231502c18eb1e_00001_00000-2", "location" : "asia-south1", "projectId" : "jio-big-data-poc" }, "kind" : "bigquery#job", "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/jio-big-data-poc/jobs/beam_bq_job_LOAD_textiotobigquerydataflow0releaser1025091627592969dd_1a449a94623645758e91dcba53a86498_fc44bdad405c2c80860231502c18eb1e_00001_00000-2?location=asia-south1", "statistics" : { "creationTime" : "1668074949767", "endTime" : "1668074949869", "startTime" : "1668074949869" }, "status" : { "errorResult" : { "message" : "Provided Schema does not match Table jio-big-data-poc:minerals_test_dataset.mytable01. Cannot add fields (field: marks)", "reason" : "invalid" }, "errors" : [ { "message" : "Provided Schema does not match Table jio-big-data-poc:minerals_test_dataset.mytable01. Cannot add fields (field: marks)", "reason" : "invalid" } ], "state" : "DONE" }, "user_email" : "49449455496-compute#developer.gserviceaccount.com", "principal_subject" : "serviceAccount:49449455496-compute#developer.gserviceaccount.com" }. org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:200) org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:153) org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:378)
I tried running same job with other datasets csv files, also the javascript udf and json schema are according to the documentation, but the job is failing at the same stage. So, what can be the possible solution to this error?
The Json schema you given doesn't matches the BigQuery schema of your table :
"Provided Schema does not match Table jio-big-data-poc:minerals_test_dataset.mytable01. Cannot add fields (field: marks)", "reason" : "invalid" }, "errors" : [ { "message" : "Provided Schema does not match Table jio-big-data-poc:minerals_test_dataset.mytable01. Cannot add fields (field: marks)", "reason" : "invalid" } ]
There is a field called field: marks that seems to not exists in the BigQuery table.
If you update your BigQuery schema to match perfectly with the fields of your input Json line and element, that will solve the issue.

Regexp query seems to be ignored in elasticsearch

I have the following query:
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "dog cat",
"analyzer" : "standard",
"default_operator" : "AND",
"fields" : ["title", "content"]
}
},
{
"range" : {
"dateCreate" : {
"gte" : "2018-07-01T00:00:00+0200",
"lte" : "2018-07-31T23:59:59+0200"
}
}
},
{
"regexp" : {
"articleIds" : {
"value" : ".*?(2561|30|540).*?",
"boost" : 1
}
}
}
]
}
}
}
The fields title, content and articleIds are of type text, dateCreate is of type date. The articleIds field contains some IDs (comma-separated).
Ok, what happens now? I execute the query an get two results: Both documents contain the words "dog" and "cat" in the title or in the content. So far it's correct.
But the second result has the number 3507 in the articleIds field which doesn't match to my query. It seems that the regexp is ignored because title and content already match. What is wrong here?
And here's the document that should not match my query but does:
{
"_index" : "example",
"_type" : "doc",
"_id" : "3007780",
"_score" : 21.223656,
"_source" : {
"dateCreate" : "2018-07-13T16:54:00+0200",
"title" : "",
"content" : "Its raining cats and dogs.",
"articleIds" : "3507"
}
}
And what I'm expecting is that this document should not be in the results because it contains 3507 which is not part of my query...

Elasticsearch regexp query finds no results

I've a problem to build the correct query. I have an index with a field "ids" with the following mapping:
"ids" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
A sample content could look like this:
10,20,30
It's a list of ids. Now I want to make a query with multiple possible ids and I want to make a disjunction (OR) so I decided to use a regexp:
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "Test"
}
},
{
"regexp" : {
"ids" : {
"value" : "10031|20|10038",
"boost" : 1
}
}
}
]
}
},
"size" : 10,
"from" : 0
}
The query is executed successfully but with no results. I expected to find 3 results.
If you want to get 10031 or 20 or 10038, you need to add parenthesis.
Change "10031|20|10038" => "(10031|20|10038)"

How do I replace a field in logstash

I'm doing an insert from Logstash into ElasticSearch. My problem is that I used a template in ES to lay out the data types, and I am sometimes getting values from Logstash that are null values (or dashes) when I've declared in ES that they should be doubles.
So sometimes, ES is getting a '-' instead of something like "2342", and it is rejecting it and causing an error. Now, if I can replace the '-' with the word 'null', ES works fine.
How do I do this? I assume it works with the ruby filter. I need to be able to replace the '-' fields with null when appropriate.
EDIT:
I was asked for sample configs.
So, for example, say the below config is logstash, which will then send data to ES:
filter {
if [type] == "transaction" {
match => ["message", "%{BASE16FLOAT:ts}\t%{IP:orig_ip}\t%{NOTSPACE:orig_port}" ]
}
}
Now my ES template is saying:
"transaction" : {
"properties" :
{
"ts" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"orig_ip" : {
"type" : "ip"
},
"orig_port" : {
"type" : "long"
},
}
}
So if I throw a data set like either of these, it passes:
{"ts" : "123456789.123234", "orig_ip" : "10.0.0.1", "orig_port" : "2342" }
{"ts" : "123456789.123234", "orig_ip" : "10.0.0.1", "orig_port" : null }
I get a success. But, the following [obviously] fails:
{"ts" : "123456789.123234", "orig_ip" : "10.0.0.1", "orig_port" : "-" }
How can I ensure that the "-" (with quotes) gets changed to a null?
If you amend your template by specifying "ignore_malformed": true in your orig_port long field, it should work.
"transaction" : {
"properties" :
{
"ts" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"orig_ip" : {
"type" : "ip"
},
"orig_port" : {
"type" : "long"
"ignore_malformed": true <---- add this line
}
}
}

Using grep to replace every instance of a pattern after the first in bbedit

So I've got a really long txt file that follows this pattern:
},
"303" :
{
"id" : "4k4hk2l",
"color" : "red",
"moustache" : "no"
},
"303" :
{
"id" : "4k52k2l",
"color" : "red",
"moustache" : "yes"
},
"303" :
{
"id" : "fask2l",
"color" : "green",
"moustache" : "yes"
},
"304" :
{
"id" : "4k4hf4f4",
"color" : "red",
"moustache" : "yes"
},
"304" :
{
"id" : "tthj2l",
"color" : "red",
"moustache" : "yes"
},
"304" :
{
"id" : "hjsk2l",
"color" : "green",
"moustache" : "no"
},
"305" :
{
"id" : "h6shgfbs",
"color" : "red",
"moustache" : "no"
},
"305" :
{
"id" : "fdh33hk7",
"color" : "cyan",
"moustache" : "yes"
},
and I'm trying to format it to be a proper json object with the following structure....
"303" :
{ "list" : [
{
"id" : "4k4hk2l",
"color" : "red",
"moustache" : "no"
},
{
"id" : "4k52k2l",
"color" : "red",
"moustache" : "yes"
},
{
"id" : "fask2l",
"color" : "green",
"moustache" : "yes"
}
]
}
"304" :
{ "list" : [
etc...
meaning I look for all patterns of ^"\d\d\d" : and leave the first unique one , but remove all the subsequent ones (example, leave first instance of "303" :, but completely remove the rest of them. then leave the first instance of "304" :, but completely remove all the rest of them, etc.).
I've been attempting to do this within the bbedit application, which has a grep option for search/replace. My pattern matching fu is too weak to accomplish this. Any ideas? Or a better way to accomplish this task?
You can't capture repeating capturing group. The capture will always contain only last match of a group. So there's no way you can do this with a single search/replace except of dumb repeating your group in pattern. But even that can be a solution only if you know a max count of elements in resulting groups.
Say we have a tring that is a simplified version of your data:
1a;1b;1c;1d;1e;2d;2e;2f;2g;3x;3y;3z;
We see that maximum count of element is 5, so we repeat the capturing group 5 times.
/([0-9])([a-z]*);?(\1([a-z]);)?(\1([a-z]);)?(\1([a-z]);)?(\1([a-z]);)?/
And replace that with
\1:\2\4\6\8\10;
Then we get desired result:
1:abcde;2:defg;3:xyz;
You can apply this technique to your data if you're in great hurry (and after 2 days I suppose you don't), but using some scripting language will be better and cleaner solution.
For my simplified example you have to iterate through matches of /([0-9])[a-z];?(\1[a-z];?)*/. Those will be:
1a;1b;1c;1d;1e;
2d;2e;2f;2g;
3x;3y;3z;
And there you can capture all values and bind them to responsive key, which is only one for each iteration.