How to capture a filename with or without an extension - regex

I'm trying to capture and replace a filename like
000035 ZSMS_1.mp3
but also a file like
000035 1OMNA
(I'm basically trying to reorder them so they look like e.g., ZSMS_1(000035).mp3).
I've tried
^(\d+) (.*)(\..*$)?
^(\d+) (.*?)(\..*$)?
What I expect to happen:
000035 ZSMS_1.mp3:
[
{
"groups": [
"000035",
"ZSMS_1",
".mp3"
],
"match": "000035 ZSMS_1.mp3"
}
]
000035 1OMNA:
[
{
"groups": [
"000035",
"1OMNA",
""
],
"match": "000035 1OMNA"
}
]
What happens:
1.
^(\d+) (.*)(\..*$)?
000035 ZSMS_1.mp3:
[
{
"groups": [
"000035",
"ZSMS_1.mp3",
""
],
"match": "000035 ZSMS_1.mp3"
}
]
000035 1OMNA:
[
{
"groups": [
"000035",
"1OMNA",
""
],
"match": "000035 1OMNA"
}
]
^(\d+) (.*?)(\..*$)?
000035 ZSMS_1.mp3:
[
{
"groups": [
"000035",
"",
""
],
"match": "000035 "
}
]
000035 1OMNA:
[
{
"groups": [
"000035",
"",
""
],
"match": "000035 "
}
]

You may use
^(\d+)\h+(.*?)(\.[^.]*)?$
See the regex demo
Details
^ - start of string
(\d+) - Group 1: one or more digits
\h+ - 1+ horizontal whitespaces (for better regex engine cross-compatibility, you may use [^\S\r\n]+ or just [ \t]+ to match a tab or space)
(.*?) - Group 2: zero or more chars other than linebreak chars, as few as possible
(\.[^.]*)? - an optional capturing group #3: a dot and then 0 or more chars other than . as many as possible
$ - end of string.

You could try following regex:
^(\d+)\s*(?:(\w+)?)(?:(\.\w+)?)$
Details:
^ - start of line
(\d+) - Group 1: matches a digit
\s* - separates group 1 and 2
(?:(\w+)?) - Group 2 (optional): matches any word character
(?:(\.\w+)?) - Group 3 (optional): matches the character . and any word character
$ - end of line
Demo

Related

Split log message on space for grok pattern

I am two days new to grok and ELK.
I am struggling with breaking up the log messages based on space and make them appear as different fields in the logstash.
My input pattern is:
2022-02-11 11:57:49 - app - INFO - function_name=add elapsed_time=0.0296 input_params=6_3
I would like to see different fields in the logstash/kibana for function_name, elapsed_time and input_params.
At the moment, I have a following .conf
input{
file{
path => "/path/to/log/file"
start_position => "beginning"
}
}
filter{
grok{
match => {"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log-level} %{(?<function_name>[^.]*)\.(?<elapsed_time>[^.]*)\.(?<input>[^.]*)}"}
}
date {
match => ["timestamp", "ISO8601"]
}
function_name {
match => ["function_name", "DATA"]
}
elapsed_time {
match => ["elapsed_time", "BASE16FLOAT"]
}
input {
match => ["input", "DATA"]
}
}
output{
elasticsearch{
hosts => ["localhost:9200"]
index => "math_apis"
}
stdout{codec => rubydebug}
}
But this only produces a following message in logstash
{
"host" => "hostname",
"#timestamp" => 2022-02-11T06:27:49.404Z,
"message" => "2022-02-11 11:57:49 - app - INFO - function_name=add elapsed_time=0.0296 input_params=6_3",
"path" => "path/to/log/file",
"#version" => "1",
"tags" => [
[0] "_grokparsefailure"
]
}
You can use the following pattern:
%{TIMESTAMP_ISO8601:timestamp} - \S+ - %{LOGLEVEL:log_level} - function_name=%{NOTSPACE:function_name} elapsed_time=%{NOTSPACE:elapsed_time} input_params=%{NOTSPACE:input}
Details:
%{TIMESTAMP_ISO8601:timestamp} - timestamp field
- - a literal string
\S+ - any one or more non-whitespace chars
- - a literal string
%{LOGLEVEL:log_level} - LOGLEVEL pattern
- function_name= - a literal string
%{NOTSPACE:function_name} - function_name field of one or more non-whitespace chars
elapsed_time= - space and elapsed_time= string
%{NOTSPACE:elapsed_time} - elapsed_time field of one or more non-whitespace chars
input_params= - literal string
%{NOTSPACE:input} - input field of one or more non-whitespace chars.
See more about Grok patterns here.
Test output:
{
"timestamp": [
[
"2022-02-11 11:57:49"
]
],
"YEAR": [
[
"2022"
]
],
"MONTHNUM": [
[
"02"
]
],
"MONTHDAY": [
[
"11"
]
],
"HOUR": [
[
"11",
null
]
],
"MINUTE": [
[
"57",
null
]
],
"SECOND": [
[
"49"
]
],
"ISO8601_TIMEZONE": [
[
null
]
],
"log_level": [
[
"INFO"
]
],
"function_name": [
[
"add"
]
],
"elapsed_time": [
[
"0.0296"
]
],
"input": [
[
"6_3"
]
]
}

Elasticsearch pattern regex start with

I would like to ask if exists some documentation which describe how to work with Elasticseach pattern regex.
I need to write Pattern Capture Token Filter which filter only tokes start with specific word. For example input tokens stream should be like ("abcefgh", "abc123" , "aabbcc", "abc", "abdef") and my tokenizer will return only tokes abcefgh , abc123, abc because those tokens start with "abc".
Can someone help me how to achieve this use-case?
Thanks.
I suggest something like this:
"analysis": {
"analyzer": {
"my_trim_keyword_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"generate_tokens",
"eliminate_tokens",
"remove_empty"
]
}
},
"filter": {
"eliminate_tokens": {
"pattern": "^(?!abc)\\w+$",
"type": "pattern_replace",
"replacement": ""
},
"generate_tokens": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"(([a-z]+)(\\d*))"
]
},
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
}
}
If your tokens are the result of a pattern_capture filter, you'd need to add after this filter the one called eliminate_tokens in my example which basically matches token that don't start with abc. Those that don't match are replaced by empty string ("replacement": "").
After this, to remove the empty tokens I added the remove_empty filter which is basically a stop filter where the stopword is "" (empty string).

How to use regex inside in query using morphia?

Mongodb allows regex expression of pattern /pattern/ without using $regex expression.
http://docs.mongodb.org/manual/reference/operator/query/in/
How can i do it using morphia ?
If i give Field criteria with field operator as in and value of type "java.util.regex.Pattern" then the equivalent query generated in
$in:[$regex: 'given pattern'] which wont return expected results at all.
Expectation: $in :[ /pattern1 here/,/pattern2 here/]
Actual using 'Pattern' object : $in : [$regex:/pattern1 here/,$regex:/pattern 2 here/]
I'm not entirely sure what to make of your code examples, but here's a working Morphia code snippet:
Pattern regexp = Pattern.compile("^" + email + "$", Pattern.CASE_INSENSITIVE);
mongoDatastore.find(EmployeeEntity.class).filter("email", regexp).get();
Note that this is really slow. It can't use an index and will always require a full collection scan, so avoid it at all cost!
Update: I've added a specific code example. The $in is not required to search inside an array. Simply use /^I/ as you would in string:
> db.profile.find()
{
"_id": ObjectId("54f3ac3fa63f282f56de64bd"),
"tags": [
"India",
"Australia",
"Indonesia"
]
}
{
"_id": ObjectId("54f3ac4da63f282f56de64be"),
"tags": [
"Island",
"Antigua"
]
}
{
"_id": ObjectId("54f3ac5ca63f282f56de64bf"),
"tags": [
"Spain",
"Mexico"
]
}
{
"_id": ObjectId("54f3ac6da63f282f56de64c0"),
"tags": [
"Israel"
]
}
{
"_id": ObjectId("54f3ad17a63f282f56de64c1"),
"tags": [
"Germany",
"Indonesia"
]
}
{
"_id": ObjectId("54f3ad56a63f282f56de64c2"),
"tags": [
"ireland"
]
}
> db.profile.find({ tags: /^I/ })
{
"_id": ObjectId("54f3ac3fa63f282f56de64bd"),
"tags": [
"India",
"Australia",
"Indonesia"
]
}
{
"_id": ObjectId("54f3ac4da63f282f56de64be"),
"tags": [
"Island",
"Antigua"
]
}
{
"_id": ObjectId("54f3ac6da63f282f56de64c0"),
"tags": [
"Israel"
]
}
{
"_id": ObjectId("54f3ad17a63f282f56de64c1"),
"tags": [
"Germany",
"Indonesia"
]
}
Note: The position in the array makes no difference, but the search is case sensitive. Use /^I/i if this is not desired or Pattern.CASE_INSENSITIVE in Java.
Single RegEx Filter
use .filter(), .criteria(), or .field()
query.filter("email", Pattern.compile("reg.*exp"));
// or
query.criteria("email").contains("reg.*exp");
// or
query.field("email").contains("reg.*exp");
Morphia converts this into:
find({"email": { $regex: "reg.*exp" } })
Multiple RegEx Filters
query.or(
query.criteria("email").contains("reg.*exp"),
query.criteria("email").contains("reg.*exp.*2"),
query.criteria("email").contains("reg.*exp.*3")
);
Morphia converts this into:
find({"$or" : [
{"email": {"$regex": "reg.*exp"}},
{"email": {"$regex": "reg.*exp.*2"}},
{"email": {"$regex": "reg.*exp.*3"}}
]
})
Unfortunately,
You cannot use $regex operator expressions inside an $in.
MongoDB Manual 3.4
Otherwise, we could do:
Pattern[] patterns = new Pattern[] {
Pattern.compile("reg.*exp"),
Pattern.compile("reg.*exp.*2"),
Pattern.compile("reg.*exp.*3"),
};
query.field().in(patterns);
hopefully, one day morphia will support that :)

How to match array of sub string with array of string using mongo?

I have follwoing collection structure -
{
"_id": ObjectId("54c784d71e14acf9ae833f9f"),
"vms": [
{
"name": "ABC",
"ids": [
"abc.60a980004270457730244662385a4f69",
"abc.60a980004270457730244662385a4f6d"
]
},
{
"name": "PQR",
"ids": [
"abc.6d867d9c7acd60001aed76eb2c70bd53",
"abc.60a980004270457730244662385a4f6d"
]
},
{
"name": "XYZ",
"ids": [
"abc.600605b00237d91016cdc38f376bd31d",
"abc.600605b00237d91016cdc38f376cd32f"
]
}
]
}
I have an array which contains substrings of ids. here is an array for your reference -
myArray = [ "4270457730244662385a4f69","4270457730244662385a4f6d" , "4270457730244662385a4f6b"]
I want to find each element of myArray is not present in ids as a substring using mongo.
Currently I am able to find single element using regex in mongo.
In above example, I want output as:
[
{
"name": "XYZ",
"ids": [
"abc.600605b00237d91016cdc38f376bd31d",
"abc.600605b00237d91016cdc38f376cd32f"
]
}
]
How do I find substring in array using mongo??
It is possible to do it using regex. You can match the string for multiple substrings using or operator. It is | in regex. Search for 'Boolean "or"' on wikipedia
MongoDB query using aggregation:
db.collection_name.aggregate([
{$unwind: "$vms"},
{$match: {
"vms.ids": {$not: /.*(4270457730244662385a4f69|4270457730244662385a4f6d|4270457730244662385a4f6b).*/}}
}
])
Output will be
{
"_id" : ObjectId("54c784d71e14acf9ae833f9f"),
"vms" : {
"name" : "XYZ",
"ids" : [
"abc.600605b00237d91016cdc38f376bd31d",
"abc.600605b00237d91016cdc38f376cd32f"
]
}
}

To split a json file.. Extracting data between curly braces

I have a json file. I want to split that file into different parts..
Following is my file's content.
I want to split the content based on the curly brackets {},
"1010320": {
"abc": [
"1012220",
"hiiiiiiiii."
],
"xyz": "Describe"
},
"1012757": {
"pqr": [
"1013757",
"x"
]
},
"1014220": {
"abc": [
"1018420",
"sooooo"
],
"answer": "4th"
},
"1019660": {
"abc": [
"1031920",
"welcome"
],
"xyz": "Describing&Interpreting"
},
"1034280": {
"abc": [
"1040560",
"Ok..."
],
"nop": "Student Question"
},
The output should be:
1) "abc": [
"1012220",
"hiiiiiiiii."
],
"xyz": "Describe"
2) "pqr": [
"1013757",
"x"
]
3) "abc": [
"1018420",
"sooooo"
],
"answer": "4th"
plz.. help..
i think this will be useful for you
(?<=\{)\n\s+((?:[\n]+|.*)+?)\n\}
regex demo here : http://regex101.com/r/rS3wI5