Regex excluding specific character - regex

I wonder if it's possible to exclude character or skip more specific to skip them if they exist. I have JSON file like:
{
"key1": "value1",
"key2": "value2",
"array1": [
{
"key3":"value3",
"key4":"value4"
},
{
"key5":"value5",
"key6":"value6"
},
{
"key7":"value7",
"key8":"value8"
}
]
}
And I want to get output in regex like:
{"key3":"value3","key4":"value4"}
{"key5":"value5","key6":"value6"}
{"key7":"value7","key8":"value8"}
My first version of regex is:
"array1":\[(\{[A-za-z0-9%,:."]+})+
I don't know how to skip "," character if it exists.
To simplify it I work on JSON without whitespaces:
{"key1":"value1","key2":"value2","array1":[{"key3":"value3","key4":"value4"},{"key5":"value5","key6":"value6"},{"key7":"value7","key8":"value8"}]}
So I don't know if it's possible to do what I want with regex or I should just return array1 and process it with for example java to split to strings.

Try replacing the unwanted character:
str = '{"key1":"value1","key2":"value2","array1":[{"key3":"value3","key4":"value4"},{"key5":"value5","key6":"value6"},{"key7":"value7","key8":"value8"}]}';
str = str.split('[');
newStr = str[1].replace(/},/g, '}\n').replace('}]', '');
console.log(newStr);

Regular expressions are not the best tool for this job.
A simple PHP script that decodes the JSON and operates on the data always produces the correct result and it is also able to detect invalid input:
$json = << END
{
"key1": "value1",
"key2": "value2",
"array1": [
{
"key3":"value3",
"key4":"value4"
},
{
"key5":"value5",
"key6":"value6"
},
{
"key7":"value7",
"key8":"value8"
}
]
}
END;
$data = json_decode($json, TRUE);
foreach ($data['array1'] as $value) {
echo json_encode($value), "\n";
}
Validation is not included in the code above but it can be easily added by checking the value returned by json_encode() against NULL.

Related

Mongodb conditional query search under an array

I have a data where an array is there. Under that array Many array of objects is there. I am mentioning the raw data so that anyone guess the structure
{
_id: ObjectId(dfs45sd54fgds4gsd54gs5),
content: [
{
str: "Hey",
isDelete: false
},
{
str: "world",
isDelete: true
}
]
}
So I want to search any string that match and I have top search under an array.
So my query is like this:
let searchTerm = req.body.key;
db.collection.find(
{
'content.str': {
$regex: `.*\\b${searchTerm}\\b.*`,
$options: 'i',
}
}
)
So this will return the data. Now for some reason I have to search the data if isDelete: false.
Right now it returns the data whether isDelete is true/false because I have not mentioned the conditon.
Can anyone help me out regarding this to get the data through condition. I want this to Mongodb Query only.
Any help is really appreciated.
The $elemMatch operator matches documents that contain an array field with at least one element that matches all the specified query criteria,
db.collection.find({
content: {
$elemMatch: {
isDelete: true,
str: {
$regex: `.*\\b${searchTerm}\\b.*`,
$options: "i"
}
}
}
},
{
"content.$": 1
})
Working Playground: https://mongoplayground.net/p/VkdWMnYtGA3
You can add another condition there as belo
db.test2.find({
$and: [
{
"content.str": {
$regex: "hey",
$options: "i",
}
},
{
"content.isDelete": false
}
]
},
{
'content.$':1 //Projection - to get only matching array element
})

mongodb aggregate - match $nin array regex values

Must work in mongo version 3.4
Hi,
As part of aggregating relevant tags, I would like to return tags that have script_url that is not contained in the whiteList array.
The thing is, i want to compare script_url to the regex of the array values.
I have this projection:
{
"script_url" : "www.analytics.com/path/file-7.js",
"whiteList" : [
null,
"www.analytics.com/path/*",
"www.analytics.com/path/.*",
"www.analytics.com/path/file-6.js",
"www.maps.com/*",
"www.maps.com/.*"
]
}
This $match compares script_url to exact whiteList values. So the document given above passes when it shouldn't since it has www.analytics.com/path/.* in whiteList
{
"$match": {
"script_url": {"$nin": ["$whiteList"]}
}
}
How do i match script_url with regex values of whiteList?
update
I was able to reach this stage in my aggregation:
{
"script_url" : "www.asaf-test.com/path/file-1.js",
"whiteList" : [
"http://sd.bla.com/bla/878/676.js",
"www.asaf-test.com/path/*"
],
"whiteListRegex" : [
"/http:\/\/sd\.bla\.com\/bla\/878\/676\.js/",
"/www\.asaf-test\.com\/path\/.*/"
]
}
But $match is not filtering out this script_url as it suppose to because its comparing literal strings and not casting the array values to regex values.
Is there a way to convert array values to Regex values in $map using v3.4?
I know you specifically mentioned v3.4, but I can't find a solution to make it work using v3.4.
So for others who have less restrictions and are able to use v4.2 this is one solution.
For version 4.2 or later only
The trick is to use $filter on whitelist using $regexMatch (available from v4.2) and if the filtered array is empty, that means script_url doesn't match anything in whitelist
db.collection.aggregate([
{
$match: {
$expr: {
$eq: [
{
$filter: {
input: "$whiteList",
cond: {
$regexMatch: { input: "$script_url", regex: "$$this" }
}
}
},
[]
]
}
}
}
])
Mongo Playground
It's also possible to use $reduce instead of $filter
db.collection.aggregate([
{
$match: {
$expr: {
$not: {
$reduce: {
input: "$whiteList",
initialValue: false,
in: {
$or: [
{
$regexMatch: { input: "$script_url", regex: "$$this" }
},
"$$value"
]
}
}
}
}
}
}
])
Mongo Playground

How to exclude substring in Elasticsearch regexp

I'm trying to write an elasticsearch regexp that excludes elements that have a key that contains a substring, let's say in the title of books.
The elasticsearch docs suggest that a substring can be excluded with the following snippet:
#&~(foo.+) # anything except string beginning with "foo"
However, in my case, I've tried to create such a filter and failed.
{
query: {
constant_score: {
filter: {
bool: {
filter: query_filters,
},
},
},
},
size: 1_000,
}
def query_filters
[
{ regexp: { title: "#&~(red)" } },
# goal: exclude titles that start with "Red"
]
end
I've used other regexp in the same query filter that have worked, so I don't think there's a bug in the way the regexp is being passed to ES.
Any ideas? Thanks in advance!
Update:
I found a workaround: I can add a must_not clause to the filter.
{
query: {
constant_score: {
filter: {
bool: {
filter: query_filters,
must_not: must_not_filters,
},
},
},
},
size: 1_000,
}
def must_not_filters
[ { regexp: { title: "red.*" } } ]
end
Still curious if there's another idea for the original regex though

search elements in array using regex, kibana

I am searching for records which contain an array field payload.params
I would like to display all the fields which contain the string aabb
example: payload.params = [3raabb, 44aabb66, grgeg]
display: 3raabb, 44aabb66
how do I use regex on arrays?
{
"query": {
"regexp": {
"payload.params": "aabb"
}
}
}
get no results.
See the Elasticsearch regex documentation:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, use
{
"query": {
"regexp": {
"payload.params": ".*aabb.*"
}
}
}

ElasticSearch and Regex queries

I am trying to query for documents that have dates within the body of the "content" field.
curl -XGET 'http://localhost:9200/index/_search' -d '{
"query": {
"regexp": {
"content": "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}'
Getting closer maybe?
curl -XGET 'http://localhost:9200/index/_search' -d '{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"regexp":{
"content" : "^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.]((19|20)\\d\\d)$"
}
}
}
}'
My regex seems to have been off. This regex has been validated on regex101.com The following query still returns nothing from the 175k documents I have.
curl -XPOST 'http://localhost:9200/index/_search?pretty=true' -d '{
"query": {
"regexp":{
"content" : "/[0-9]{4}-[0-9]{2}-[0-9]{2}|[0-9]{2}-[0-9]{2}-[0-9]{4}|[0-9]{2}/[0-9]{2}/[0-9]{4}|[0-9]{4}/[0-9]{2}/[0-9]{2}/g"
}
}
}'
I am starting to think that my index might not be set up for such a query. What type of field do you have to use to be able to use regular expressions?
mappings: {
doc: {
properties: {
content: {
type: string
}title: {
type: string
}host: {
type: string
}cache: {
type: string
}segment: {
type: string
}query: {
properties: {
match_all: {
type: object
}
}
}digest: {
type: string
}boost: {
type: string
}tstamp: {
format: dateOptionalTimetype: date
}url: {
type: string
}fields: {
type: string
}anchor: {
type: string
}
}
}
I want to find any record that has a date and graph the volume of documents by that date. Step 1. is to get this query working. Step 2. will be to pull the dates out and group them by them accordingly. Can someone suggest a way to get the first part working as I know the second part will be really tricky.
Thanks!
You should read Elasticsearch's Regexp Query documentation carefully, you are making some incorrect assumptions about how the regexp query works.
Probably the most important thing to understand here is what the string you are trying to match is. You are trying to match terms, not the entire string. If this is being indexed with StandardAnalyzer, as I would suspect, your dates will be separated into multiple terms:
"01/01/1901" becomes tokens "01", "01" and "1901"
"01 01 1901" becomes tokens "01", "01" and "1901"
"01-01-1901" becomes tokens "01", "01" and "1901"
"01.01.1901" actually will be a single token: "01.01.1901" (Due to decimal handling, see UAX #29)
You can only match a single, whole token with a regexp query.
Elasticsearch (and lucene) don't support full Perl-compatible regex syntax.
In your first couple of examples, you are using anchors, ^ and $. These are not supported. Your regex must match the entire token to get a match anyway, so anchors are not needed.
Shorthand character classes like \d (or \\d) are also not supported. Instead of \\d\\d, use [0-9]{2}.
In your last attempt, you are using /{regex}/g, which is also not supported. Since your regex needs to match the whole string, the global flag wouldn't even make sense in context. Unless you are using a query parser which uses them to denote a regex, your regex should not be wrapped in slashes.
(By the way: How did this one validate on regex101? You have a bunch of unescaped /s. It complains at me when I try it.)
To support this sort of query on such an analyzed field, you'll probably want to look to span queries, and particularly Span Multiterm and Span Near. Perhaps something like:
{
"span_near" : {
"clauses" : [
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|[12][0-9]|3[01]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "0[1-9]|1[012]"}
}
}},
{ "span_multi" : {
"match": {
"regexp": {"content": "(19|20)[0-9]{2}"}
}
}}
],
"slop" : 0,
"in_order" : true
}
}
For newer elasticsearch versions (tested 8.5).
We can use .keyword in the field. It will match the whole sentence.
{
"size": 10,
"_source": [
"load",
"unload"
],
"query": {
"bool": {
"should": [
{
"regexp": {
"load.keyword": {
"value": ".*Search Term.*",
"flags": "ALL"
}
}
}
]
}
}
}