Using grep to replace every instance of a pattern after the first in bbedit - regex

So I've got a really long txt file that follows this pattern:
},
"303" :
{
"id" : "4k4hk2l",
"color" : "red",
"moustache" : "no"
},
"303" :
{
"id" : "4k52k2l",
"color" : "red",
"moustache" : "yes"
},
"303" :
{
"id" : "fask2l",
"color" : "green",
"moustache" : "yes"
},
"304" :
{
"id" : "4k4hf4f4",
"color" : "red",
"moustache" : "yes"
},
"304" :
{
"id" : "tthj2l",
"color" : "red",
"moustache" : "yes"
},
"304" :
{
"id" : "hjsk2l",
"color" : "green",
"moustache" : "no"
},
"305" :
{
"id" : "h6shgfbs",
"color" : "red",
"moustache" : "no"
},
"305" :
{
"id" : "fdh33hk7",
"color" : "cyan",
"moustache" : "yes"
},
and I'm trying to format it to be a proper json object with the following structure....
"303" :
{ "list" : [
{
"id" : "4k4hk2l",
"color" : "red",
"moustache" : "no"
},
{
"id" : "4k52k2l",
"color" : "red",
"moustache" : "yes"
},
{
"id" : "fask2l",
"color" : "green",
"moustache" : "yes"
}
]
}
"304" :
{ "list" : [
etc...
meaning I look for all patterns of ^"\d\d\d" : and leave the first unique one , but remove all the subsequent ones (example, leave first instance of "303" :, but completely remove the rest of them. then leave the first instance of "304" :, but completely remove all the rest of them, etc.).
I've been attempting to do this within the bbedit application, which has a grep option for search/replace. My pattern matching fu is too weak to accomplish this. Any ideas? Or a better way to accomplish this task?

You can't capture repeating capturing group. The capture will always contain only last match of a group. So there's no way you can do this with a single search/replace except of dumb repeating your group in pattern. But even that can be a solution only if you know a max count of elements in resulting groups.
Say we have a tring that is a simplified version of your data:
1a;1b;1c;1d;1e;2d;2e;2f;2g;3x;3y;3z;
We see that maximum count of element is 5, so we repeat the capturing group 5 times.
/([0-9])([a-z]*);?(\1([a-z]);)?(\1([a-z]);)?(\1([a-z]);)?(\1([a-z]);)?/
And replace that with
\1:\2\4\6\8\10;
Then we get desired result:
1:abcde;2:defg;3:xyz;
You can apply this technique to your data if you're in great hurry (and after 2 days I suppose you don't), but using some scripting language will be better and cleaner solution.
For my simplified example you have to iterate through matches of /([0-9])[a-z];?(\1[a-z];?)*/. Those will be:
1a;1b;1c;1d;1e;
2d;2e;2f;2g;
3x;3y;3z;
And there you can capture all values and bind them to responsive key, which is only one for each iteration.

Related

Case Insensitive Search With Regex

I'm trying to implement a case-insensitive search with regex.
Example: /^sanford/i (searching for anything starting with "sanford" disregarding case sensivity.
For case insensitive queries, creating indeces with a custom collation is recommended by the documentation. This works fine as long as no regex is involved.
The problem: searching with a regex (in this case "starts with"), the case-insensitive search does NOT take the index into account.
This is stated in the documentation multiple times and is also reproducable with an explain command.
To sum it up: It works, but without effectively using the index. I'd be glad to get any hints, I can't get rid of the feeling that I'm missing something fundamentally important here.
Inserting the string with toLowerCase and then searching only with lower cased strings is not an option.
I can't use a text index because there can only be one per collection.
Example from the documentation see here, the green info box on the bottom.
#D.SM: The index is used, but it scans all documents.
https://docs.atlas.mongodb.com/schema-suggestions/case-insensitive-regex/
Example document:
{
"name": [{
"family": "Test",
"given": "Name",
}],
}
Index with collation:
{ "name" : "name_family", "key" : { "name.family" : 1 }, "host" : "noneofyourbusiness.com", "accesses" : { "ops" : NumberLong(114), "since" : ISODate("2020-07-30T20:25:59.079Z") }, "shard" : "shA", "spec" : { "v" : 2, "key" : { "name.family" : 1 }, "name" : "name_family", "ns" : "noneofyourbusiness.somethingwithaname", "collation" : { "locale" : "de", "caseLevel" : false, "caseFirst" : "off", "strength" : 1, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } }
}

Regexp query seems to be ignored in elasticsearch

I have the following query:
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "dog cat",
"analyzer" : "standard",
"default_operator" : "AND",
"fields" : ["title", "content"]
}
},
{
"range" : {
"dateCreate" : {
"gte" : "2018-07-01T00:00:00+0200",
"lte" : "2018-07-31T23:59:59+0200"
}
}
},
{
"regexp" : {
"articleIds" : {
"value" : ".*?(2561|30|540).*?",
"boost" : 1
}
}
}
]
}
}
}
The fields title, content and articleIds are of type text, dateCreate is of type date. The articleIds field contains some IDs (comma-separated).
Ok, what happens now? I execute the query an get two results: Both documents contain the words "dog" and "cat" in the title or in the content. So far it's correct.
But the second result has the number 3507 in the articleIds field which doesn't match to my query. It seems that the regexp is ignored because title and content already match. What is wrong here?
And here's the document that should not match my query but does:
{
"_index" : "example",
"_type" : "doc",
"_id" : "3007780",
"_score" : 21.223656,
"_source" : {
"dateCreate" : "2018-07-13T16:54:00+0200",
"title" : "",
"content" : "Its raining cats and dogs.",
"articleIds" : "3507"
}
}
And what I'm expecting is that this document should not be in the results because it contains 3507 which is not part of my query...

Regular expression to split two events in JSON file

I have one JSON log file and I am looking for a regex to split the events within it. I have written one regex but it is reading all events as one group.
Log file:
[ {
"name" : "CounterpartyNotional",
"type" : "RiskBreakdown",
"duration" : 20848,
"count" : 1,
"average" : 20848.0
}, {
"name" : "CounterpartyPreSettlement",
"type" : "RiskBreakdown",
"duration" : 15370,
"count" : 1,
"average" : 15370.0
} ]
[ {
"name" : "TraderCurrency",
"type" : "Formula",
"duration" : 344,
"count" : 1,
"average" : 344.0
} ]
PS: I will be using this regex for a Splunk tool.
Your regex does not read all events together. In the line above the regex (on the linked page) there is written "2 matches", which means your regex has split the log, but you must know how to iterate through the matches (i.e. the events) in the language which runs the regex matching.
For example in Python 3 (If you don't mind I simplify the regex):
import re
log = """[ {
"name" : "CounterpartyNotional",
"type" : "RiskBreakdown",
"duration" : 20848,
"count" : 1,
"average" : 20848.0
}, {
"name" : "CounterpartyPreSettlement",
"type" : "RiskBreakdown",
"duration" : 15370,
"count" : 1,
"average" : 15370.0
} ]
[ {
"name" : "TraderCurrency",
"type" : "Formula",
"duration" : 344,
"count" : 1,
"average" : 344.0
} ]"""
event = re.compile(r'{[^}]*?"RiskBreakdown"[^}]*}')
matches = event.findall(log)
print(matches)
And yes, it is true, this is not valid JSON, but on the linked page it is OK, so maybe it's a typo.

Elasticsearch regexp query finds no results

I've a problem to build the correct query. I have an index with a field "ids" with the following mapping:
"ids" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
A sample content could look like this:
10,20,30
It's a list of ids. Now I want to make a query with multiple possible ids and I want to make a disjunction (OR) so I decided to use a regexp:
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "Test"
}
},
{
"regexp" : {
"ids" : {
"value" : "10031|20|10038",
"boost" : 1
}
}
}
]
}
},
"size" : 10,
"from" : 0
}
The query is executed successfully but with no results. I expected to find 3 results.
If you want to get 10031 or 20 or 10038, you need to add parenthesis.
Change "10031|20|10038" => "(10031|20|10038)"

combining regex and embedded objects in mongodb queries

I am trying to combine regex and embedded object queries and failing miserably. I am either hitting a limitation of mongodb or just getting something slightly wrong maybe someone out ther has encountered this. The documentation certainly does'nt cover this case.
data being queried:
{
"_id" : ObjectId("4f94fe633004c1ef4d892314"),
"productname" : "lightbulb",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
{
"country" : "USA",
"storeCode" : "xzy-6784"
},
{
"country" : "USA",
"storeCode" : "abc-3454"
},
{
"country" : "CANADA",
"storeCode" : "abc-6845"
}
]
}
assume the collection contains only one record
This query returns 1:
db.testCol.find({"availability":{"country" : "USA","storeCode":"xzy-6784"}}).count();
This query returns 1:
db.testCol.find({"availability.storeCode":/.*/}).count();
But, this query returns 0:
db.testCol.find({"availability":{"country" : "USA","storeCode":/.*/}}).count();
Does anyone understand why? Is this a bug?
thanks
You are referencing the embedded storecode incorrectly - you are referencing it as an embedded object when in fact what you have is an array of objects. Compare these results:
db.testCol.find({"availability.0.storeCode":/x/});
db.testCol.find({"availability.0.storeCode":/a/});
Using your sample doc above, the first one will not return, because the first storeCode does not have an x in it ("abc-1234"), the second will return the document. That's fine for the case where you are looking at a single element of the array and pass in the position. In order to search all of the objcts in the array, you want $elemMatch
As an example, I added this second example doc:
{
"_id" : ObjectId("4f94fe633004c1ef4d892315"),
"productname" : "hammer",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
]
}
Now, have a look at the results of these queries:
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/a/}}}).count();
2
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/x/}}}).count();
1