Nifi - Extracting Key Value pairs into new fields - regex

With Nifi I am trying to use the ReplaceText processor to extract key value pairs.
The relevant part of the JSON file is the 'RuleName':
"winlog": {
"channel": "Microsoft-Windows-Sysmon/Operational",
"event_id": 3,
"api": "wineventlog",
"process": {
"pid": 1640,
"thread": {
"id": 4452
}
},
"version": 5,
"record_id": 521564887,
"computer_name": "SERVER001",
"event_data": {
"RuleName": "Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043"
},
"provider_guid": "{5790385F-C22A-43E0-BF4C-06F5698FFBD9}",
"opcode": "Info",
"provider_name": "Microsoft-Windows-Sysmon",
"task": "Network connection detected (rule: NetworkConnect)",
"user": {
"identifier": "S-1-5-18",
"name": "SYSTEM",
"domain": "NT AUTHORITY",
"type": "Well Known Group"
}
},
Within the ReplaceText processor I have this configuration
ReplaceText
"winlog.event_data.RuleName":"MitreRef=(.*),Technique=(.*),Tactic=(.*),Alert=(.*)"
"MitreRef":"$1","Technique":"$2","Tactic":"$3","Alert":"$4"
The first problem is that the new fields MitreRef etc. are not created.
The second thing is that the fields may appear in any order in the original JSON, e.g.
"RuleName": "Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043"
or,
MitreRef=1043,Tactic=Command and Control,Technique=Commonly Used Port
Any ideas on how to proceed?

Welcome to StackOverflow!
As your question is quite ambiqious I'll try to guess what you aimed for.
Replacing string value of "RuleName" with JSON representation
I assume that you want to replace the entry
"RuleName": "Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043"
with something along the lines of
"RuleName": {
"Technique": "Commonly Used Port",
"Tactic": "Command and Control",
"MitreRef": "1043"
}
In this case you can grab basically the whole line and assume you have three groups of characters, each consisting of
A number of characters that are not the equals sign: ([^=]+)
The equals sign =
A number of characters that are not the comma sign: ([^,]+)
These groups in turn are separated by a comma: ,
Based on these assumptions you can write the following RegEx inside the Search Value property of the ReplaceText processor:
"RuleName"\s*:\s*"([^=]+)=([^,]+),([^=]+)=([^,]+),([^=]+)=([^,]+)"
With this, you grab the whole line and build a group for every important data point.
Based on the groups you may set the Replacement Value to:
"RuleName": {
"${'$1'}": "${'$2'}",
"${'$3'}": "${'$4'}",
"${'$5'}": "${'$6'}"
}
Resulting in the above mentioned JSON object.
Some remarks
The RegEx assumes that the entry is on a single line and does NOT work when it is splitted onto multiple lines, e.g.
"RuleName":
"Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043"
The RegEx assumes the are exactly three "items" inside the value of RuleName and does NOT work with different number of "items".
In case your JSON file can grow larger you may try to avoid using the Entire text evaluation mode, as this loads the content into a buffer and routes the FlowFile to the failure output in case the file is to large. In that case I recommend you to use the Line-by-Line mode as seen in the attached image.
Allowing a fourth additional value
In case there might be a fourth additional value, you may adjust the RegEx in the Search Value property.
You can add (,([^=]+)=([^,]+))? to the previous expression, which roughly translated to:
( )? - match what is in the bracket zero or one times
, - match the character comma
([^=]+)=([^,]+) - followed by the group of characters as explaind above
The whole RegEx will look like this:
"RuleName"\s*:\s*"([^=]+)=([^,]+),([^=]+)=([^,]+),([^=]+)=([^,]+)(,([^=]+)=([^,]+))?"
To allow the new value to be used you have to adjust the replacement value as well.
You can use the Expression Language available in most NiFi processor properties to decide whether to add another item to the JSON object or not.
${'$7':isEmpty():ifElse(
'',
${literal(', "'):append(${'$8'}):append('": '):append('"'):append(${'$9'}):append('"')}
)}
This expression will look if the seventh RegEx group exists or not and either append an empty string or the found values.
With this modification included the whole replacement value will look like the following:
"RuleName": {
"${'$1'}": "${'$2'}",
"${'$3'}": "${'$4'}",
"${'$5'}": "${'$6'}"
${'$7':isEmpty():ifElse(
'',
${literal(', "'):append(${'$8'}):append('": '):append('"'):append(${'$9'}):append('"')}
)}
}
regarding multiple occurrences
The ReplaceText processor replaces all occurrences it finds where the RegEx matches. Using the settings provided in the last paragraph given the following example input
{
"event_data": {
"RuleName": "Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043,Foo=Bar"
},
"RuleName": "Technique=Commonly Used Port,Tactic=Command and Control,MitreRef=1043"
}
will result in the following:
{
"event_data": {
"RuleName": {
"Technique": "Commonly Used Port",
"Tactic": "Command and Control",
"MitreRef": "1043",
"Foo": "Bar"
}
},
"RuleName": {
"Technique": "Commonly Used Port",
"Tactic": "Command and Control",
"MitreRef": "1043"
}
}
example template
You may download a template I created that includes the above processor from gist.

Related

How to match a string exactly OR exact substring from beginning using Regular Expression

I'm trying to build a regex query for a database and it's got me stumped. If I have a string with a varying number of elements that has an ordered structure how can I find if it matches another string exactly OR some exact sub string when read from the left?
For example I have these strings
Canada.Ontario.Toronto.Downtown
Canada.Ontario
Canada.EasternCanada.Ontario.Toronto.Downtown
England.London
France.SouthFrance.Nice
They are structured by most general location to specific, left to right. However, the number of elements varies with some specifying a country.region.state and so on, and some just country.town. I need to match not only the words but the order.
So if I want to match "Canada.Ontario.Toronto.Downtown" I would want to both get #1 and #2 and nothing else. How would I do that? Basically running through the string and as soon as a different character comes up it's not a match but still allow a sub string that ends "early" to match like #2.
I've tried making groups and using "?" like (canada)?.?(Ontario)?.? etc but it doesn't seem to work in all situations since it can match nothing as well.
Edit as requested:
Mongodb Database Collection:
[
{
"_id": "doc1",
"context": "Canada.Ontario.Toronto.Downtown",
"useful_data": "Some Data"
},
{
"_id": "doc2",
"context": "Canada.Ontario",
"useful_data": "Some Data"
},
{
"_id": "doc3",
"context": "Canada.EasternCanada.Ontario.Toronto.Downtown",
"useful_data": "Some Data"
},
{
"_id": "doc4",
"context": "England.London",
"useful_data": "Some Data"
},
{
"_id": "doc5",
"context": "France.SouthFrance.Nice",
"useful_data": "Some Data"
},
{
"_id": "doc6",
"context": "",
"useful_data": "Some Data"
}
]
User provides "Canada", "Ontario", "Toronto", and "Downtown" values in that order and I need to use that to query doc1 and doc2 and no others. So I need a regex pattern to put in here: collection.find({"context": {$regex: <pattern here>}) If it's not possible I'll just have to restructure the data and use different methods of finding those docs.
At each dot, start an nested optional group for the next term, and add start and end anchors:
^Canada(\.Ontario(\.Toronto(\.Downtown)?)?)?$
See live demo.

Regular Expression If 2nd parameter is Enrollment

I have below response
{
"id": "3452",
"enrollable_id": "3452",
"enrollable_type": "Enrollment"
}
{
"id": "3453",
"enrollable_id": "3453",
"enrollable_type": "Task"
}
{
"id": "3454",
"enrollable_id": "3454",
"enrollable_type": "Enrollment"
}
{
"id": "3455",
"enrollable_id": "3455",
"enrollable_type": "Task"
}
I would like to get id [3452 and 3454] only if enrollable_type= Enrollment. This is for jmeter regex extractor so it would be great if I can just use one liner regex to fetch 3452 and 3454.
The RegEx you are looking for is:
_id":\s*"([^"]+(?=[^\0}]+_type":\s*"E))
Try it online!
Explanation
_id":\s*" Finds the place where the enrollment_id is
[^"]+(?= Matches the ID if:
[^\0}]+_type":\s* Finds the place where enrollable_type is
"E Checks if the enrollable type begins with an uppercase E
) End if
( ) Captures the ID
It's important to note that this RegEx will match on valid people and capture the valid ID. This means you will need to get each match's capture rather than just getting each match.
Disclaimer
The above RegEx contains backslashes, which you will need to escape if using the RegEx as a string literal.
This is the RegEx with all necessary-to-escape characters escaped:
_id":\\s*"([^"]+(?=[^\\0}]+_type":\\s*"E))
It's usually a bad idea to parse structured data with just a regex, but if you're intent on going this route then here you go:
"(\d+)"\s*,\s*(?="enrollable_type":\s*"Enrollment")
This assumes that entrollable_type always follows enrollable_id and that everything is quoted consistently with a little allowance for variance in white space. You should be able to handle a little more variance if necessary, such as if you're unsure if can depend on keys or data being quoted (["']?). However, if you can depend on the order of the properties (such as if they type comes before id) then you should abandon using a regex.
Here's a sample working in JavaScript
const text = `{ "id": "3452", "enrollable_id": "3452", "enrollable_type": "Enrollment" } { "id": "3453", "enrollable_id": "3453", "enrollable_type": "Task" } { "id": "3454", "enrollable_id": "3454", "enrollable_type": "Enrollment" } { "id": "3455", "enrollable_id": "3455", "enrollable_type": "Task" }`;
const re = /"(\d+)"\s*,\s*(?="enrollable_type":\s*"Enrollment")/g;
var match;
while(match = re.exec(text)) {
console.log(match[1]);
}
Your response seems to be a JSON one (however it's malformed). If this is the case and it's really JSON - I would recommend going for JSON Extractor instead as regular expressions are fragile, sensitive to markup change, new lines, order of elements, etc. while JSON Extractor looks only into the content.
The relevant JSON Path query would be something like:
$..[?(#.enrollable_type == 'Enrollment')].enrollable_id
Demo:
More information: JMeter's JSON Path Extractor Plugin - Advanced Usage Scenarios
You can extract the data in 2 ways
Using Json Extractor.
To extract data using json extractor response data should follow json syntax rules,
To extract data use the following JSON path in json extractor
$..[?(#.enrollable_type=="Enrollment")].id
and use match no -1 as shown below
To extract data using regular expression extractor use the following regex
id": "(.+?)",\s*(.+?)\s*"enrollable_type": "Enrollment
template : $1$2$3$4$
Match no -1
as shown below
you can see the variables stored using debug sampler
More information
extract variables

Regexp starts with not working Elasticsearch 6.*

I got trouble with understanding regexp mechanizm in ElasticSearch. I have documents that represent property units:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability field codes availability of property by days for the next two years from today. 'A' means available, 'U' unabailable, 'I' can check in, 'O' can check out. How can I write regexp filter to get all units that are available in particular dates?
I tried to find the 'A' substring with particular length and offset in DailyAvailability field. For example to find units that would be available for 7 days in 7 days from today:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
This query returns for instance unit with DateAvailability that starts from "UUUUUUUUUUUUUUUUUUUIAA", but contains suitable sequences somehere inside the field. How can I anchor regexp for entire source string? ES docs say that lucene regex should be anchored by default.
P.S. I have tried '^.{7}a{7}.*$'. Returns empty set.
It looks like you are using text datatype to store Unit.DailyAvailability (which is also the default one for strings if you are using dynamic mapping). You should consider using keyword datatype instead.
Let me explain in a bit more detail.
Why does my regex match something in the middle of a text field?
What happens with text datatype is that the data gets analyzed for full-text search. It does some transformations like lowercasing and splitting into tokens.
Let's try to use the Analyze API against your input:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
The response is:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
As you can see, Elasticsearch has split your input into three tokens and lowercased them. This looks unexpected, but if you think that it actually tries to facilitate search for words in human language, it makes sense - there are no such long words.
That's why now regexp query ".{7}a{7}.*" will match: there is a token that actually starts with a lot of a's, which is an expected behavior of regexp query.
...Elasticsearch will apply the regexp to the terms produced by the
tokenizer for that field, and not to the original text of the field.
How can I make regexp query consider the entire string?
It is very simple: do not apply analyzers. The type keyword stores the string you provide as is.
With a mapping like this:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
You will be able to do a query like this that will match the document from the post:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
Note that the query became case-sensitive because the field is not analyzed.
This regexp won't return any results anymore: ".{12}a{7}.*"
This will: ".{12}A{7}.*"
So what about anchoring?
The regexes are anchored:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
The reason why it looked like the anchoring was wrong was most likely because tokens got split in an analyzed text field.
Just in addition to brilliant and helpfull answer of Nikolay Vasiliev. In my case I was forced to go farther to make it work on NEST .net. I added attribute mapping to DailyAvailability:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
The filter still didn't work and I got mapping:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
My field contained about 732 symbols so it was ignored by index. I tried:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
It didn't make any difference on mapping. And only after adding manual mappings it started working properly:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
The point is that:
ignore_above - Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.
So use explicit mapping for long keyword fields to set ignore_above if you need to filter them with regexp.
For anyone could be useful, the ES tool does not support the \d \w modes, you should write those as [0-9] and [a-z]

Elasticsearch aggregation to extract pattern and occurrences

I have trouble formulating what I'm looking for so I'll use an example:
You put 3 documents in elasticsearch all with a field "name" containing these values: "test", "superTest51", "stvv".
Is it possible to extract a regular expression like pattern with the occurrences? In this case:
"xxxx": 2 occurrences
"x{5}Xxxx99": 1 occurrence
I've read some things about analyzers, but I don't think that's what I'm looking for.
Edit: To make the question clearer: I don't want to search for a regex pattern, I want to do an aggregate on a regular expression replaced field. For example replace [a-z] with x. Is the best way really to do the regular expression replace outside of elasticsearch?
Based on the formulation of your request, not sure this will match what you are looking for, but assuming you mean to search based on regex ,
following should be what you are looking for:
wildcard and regexp queries
Do take note that the behavior will be different whether the field targeted is analyzed or not.
Typically if you went with the vanilla setup of Elasticsearch as most people to start, your field will likely be analyzed, you can check your the events mapping in your indices to confirm that.
Based on your example and assuming you have a not_analyzed name field:
GET _search
{
"query": {
"regexp": {
"name": "[a-z]{4}"
}
}
}
GET _search
{
"query": {
"regexp": {
"name": "[a-z]{5}[A-Z][a-z]{3}[0-9]{2}"
}
}
}
Based on your update, and a quick search (am not that familiar with aggregations), could be something like the following would match your expectations:
GET _search
{
"size": 0,
"aggs": {
"regmatch": {
"filters": {
"filters": {
"xxxx": {
"regexp": {
"name": "[a-z]{4}"
}
},
"x{5}Xxxx99": {
"regexp": {
"name": "[a-z]{5}[A-Z][a-z]{3}[0-9]{2}"
}
}
}
}
}
}
}
This will give you 3 counts:
- total number of events
- number of first regex match
- number of second regex match

How to extract everything between 2 characters from JSON response?

I'm using the regex in Jmeter 2.8 to extract some values from JSON responses.
The response is like that:
{
"key": "prod",
"id": "p2301d",
"objects": [{
"id": "102955",
"key": "member",
...
}],
"features":"product_features"
}
I'm trying to get everything except the text between [{....}] with one regex.
I've tried this one "key":([^\[\{.*\}\],].+?) but I'm always getting the other values between [{...}] (in this example: member)
Do you have any clue?
Thanks.
Suppose you can try to use custom JSON utils for jmeter (JSON Path Assertion, JSON Path Extractor, JSON Formatter) - JSON Path Extractor in this case.
Add ATLANTBH jmeter-components to jmeter: https://github.com/ATLANTBH/jmeter-components#installation-instructions.
Add JSON Path Extractor (from Post Processors components list) as child to the sampler which returns json response you want to process:
(I've used Dummy Sampler to emulate your response, you will have your original sampler)
Add as many extractors as values your want to extract (3 in this case: "key", "id", "features").
Configure each extractor: define variable name to store extracted value and JSONPath query to extract corresponding value:
for "key": $.key
for "id": $.id
for "features": $.features
Further in script your can refer extracted values using jmeter variables (variable name pointed in JSON Path Extractor settings in "Name" field): e.g. ${jsonKey}, ${jsonID}, ${$.features}.
Perhaps it may be not the most optimal way but it works.
My solution for my problem was to turn the JSON into an object so that i can extract just the value that i want, and not the values in the {...}.
Here you can see my code:
var JSON={"itemType":"prod","id":"p2301d","version":"10","tags":[{"itemType":"member","id":"p2301e"},{"itemType":"other","id":"prod10450"}],"multiPrice":null,"prices":null};
//Transformation into an object:
obj = eval(JSON );
//write in the Jmeter variable "itemtype", the content of obj.itemType:prod
vars.put("itemtype", obj.itemType);
For more information: http://www.havecomputerwillcode.com/blog/?p=500.
A general solution: DEMO
Regex: (\[{\n\s*(?:\s*"\w+"\s*:\s*[^,]+,)+\n\s*}\])
Explanation, you don't consume the spaces that you must correctly, before each line there are spaces and you must consume them before matching, that's why isn't your regex really working. You don't need to scape the { char.