using sed i want to print only some string in lines - regex

I have a file which has following data. I just want the ownerId numbers and the profileID values separated by :.
My file:
ObjectId("57a046a06f858a9c73b3468a"), "ownerId" : "923003345778", "profileId" : "FreeBundles,LBCNorthParentOffer", "instanceId" : null, "queuedFor" : "unassigned", "state" : "active", "createDateTime" : 1470121632, "startDateTime" : 1470121632, "expireDateTime" : 1485673632, "removeDateTime" : 1487747232, "extensionDateTime" : null, "cancelled" : false, "mode" : "onceOff", "nextMode" : "none", "profileData" : { "serviceProfileId" : "ecs19", "counter" : 1 } }
ObjectId("57a046a06f858a9c73b34688"), "cancelled" : false, "createDateTime" : 1470121632, "expireDateTime" : 1557514799, "extensionDateTime" : null, "instanceId" : null, "mode" : "onceOff", "nextMode" : "none", "ownerId" : "923003345778", "profileData" : { "serviceProfileId" : "ecs19", "counter" : 1 }, "profileId" : "Prov3G,HLRProv", "queuedFor" : "unassigned", "removeDateTime" : 1557514799, "startDateTime" : 1470121632, "state" : "active" }
ObjectId("56d48bd38a8b93baa708fcfa"), "ownerId" : "923003309452", "profileId" : "DiscountOnUsage,Segment04", "instanceId" : null, "queuedFor" : "unassigned", "state" : "active", "createDateTime" : 1456770003, "startDateTime" : 1456770003, "expireDateTime" : null, "removeDateTime" : null, "extensionDateTime" : null, "cancelled" : false, "mode" : "onceOff", "nextMode" : "none", "profileData" : { "serviceProfileId" : "ecs19", "counter" : 1 } }
ObjectId("560ed95f6ca6e0703cf26fcc"), "cancelled" : false, "createDateTime" : 1443813727, "expireDateTime" : 1544381999, "extensionDateTime" : null, "instanceId" : null, "mode" : "onceOff", "nextMode" : "none", "ownerId" : "923003309452", "profileData" : { "serviceProfileId" : "ecs19", "counter" : 1 }, "profileId" : "Prov3G,HLRProv", "queuedFor" : "unassigned", "removeDateTime" : 1544381999, "startDateTime" : 1443813727, "state" : "active" }
Output:
923003345778 : FreeBundles,LBCNorthParentOffer
923003345778 : Prov3G,HLRProv
923003309452 : DiscountOnUsage,Segment04
923003309452 : Prov3G,HLRProv
Please also explain me in detail the answer if anyone knows.

$ sed 's/.*ObjectId("\([^"]*\).*"profileId" *: *"\([^"]*\).*/\1 : \2/' file
57a046a06f858a9c73b3468a : FreeBundles,LBCNorthParentOffer
57a046a06f858a9c73b34688 : Prov3G,HLRProv
56d48bd38a8b93baa708fcfa : DiscountOnUsage,Segment04
560ed95f6ca6e0703cf26fcc : Prov3G,HLRProv
I really don't think any explanation is needed as it's very straight forward but let me know if you have any questions.

This is a rather awkward situation you've managed to put yourself into.
As a rule, you do not want to handle structured data with plain-text tools like sed. Any solution you come up with will be brittle in the face of formatting changes (such as spaces or newlines between JSON fields), and certain corner cases (such as JSON strings with quotation marks in them) are awkward to handle with it. If you have JSON, you want to use a JSON tool to handle it.
However, you don't exactly have JSON there. This is a textual representation of BSON (likely from MongoDB) that has already had some parts chopped off.
What you really want to do
A sane way to solve this problem is to make MongoDB give you JSON and let something like jq do the formatting. Once you have a proper JSON file, this will be as simple as
jq -r '"\(.ownerId) : \(.profileId)"' file.json
mongoexport may be your friend here, or putting JSON.stringify() around your query in the MongoDB shell1; it depends on how you got this data in the first place. This approach will require you to save the unchopped data, but anyway I suspect that whatever made you chop the BSON into pieces should be replaced with something similar to improve reliability.
1If you got the data from the MongoDB shell, you may want to consider doing the formatting there, though.
How to hack yourself deeper into this mess with sed
However, since you don't currently have proper JSON, you may want to try to hack yourself out of this mess with sed. This is a terrible idea, and I cannot stress enough that you never ever want to do this in a production environment. If you do, you'll be in a deeper mess than before, and that sort of vicious cycle is not a happy place to be.
So, what I'm about to show you is the sort of thing that you do as a one-off in a hurry and are never going to use again because you promise yourself to do it properly next time. You want to check the results carefully. Here goes:
sed 'h;/^.*"profileId"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/!d;s//\1/;x;/^.*"ownerId"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/!d;s//\1/;G;s/\n/ : /' file.bsonish
This makes the following assumptions about the input data:
One full object per line. Newlines in the wrong place will break this.
No " in either the ownerId or the profileID field
Furthermore, it will not recognize broken data, which is always a nice feature. On the upside, it does not require the ownerId and profileId fields to appear in any particular order.
It works as follows:
# Save a copy of the input data; we'll isolate the fields separately.
h
# See if there's a profileId field. If not, the line is silently dropped.
/^.*"profileId"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/!d
# Isolate that profileId field. // in this context means: reuse the last
# regex (the big one)
s//\1/
# Now swap in the saved input data. We'll get ownerId next.
x
# Isolate ownerId as before. If there is no ownerId field, drop line silently.
/^.*"ownerId"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/!d
s//\1/
# append profileId field in hold buffer to what we have
G
# Replace the newline between the two with a colon and some spaces.
s/\n/ : /

Related

How to extract a double quote field in regex that start with a specific string

I'd like to extract the name content (David) and the url content (www.stackoverflow.com) from the following json file.
I have several questions:
How to extract a string that starts with " and ends with " ?
Hoe to force the regular expression to start with an expression that is not part of the matching regular expressing.
{
"id" : "1234",
"name" : "David",
"request" : {
"url" : "www.stackoverflow.com",
"method" : "POST",
"bodyPatterns" : [ {
"matchesXPath" : "example"
}, {
"matchesXPath" : "example/123"
}, {
"matchesXPath" : {
"expression" : "example/123/123/text()",
"equalTo" : "bbbb"
}
} ]
}
}
Note: a proper parser is the most recommended way to do this on the long term. For a simple, occasional situation regex might fit.
This regex does the job:
"name"\s*:\s*"(?'name'[^"]+)".*"url"\s*:\s*"(?'url'[^"]+)"
Test here. Groups name and url contain your data.
I do not recommend solving this with a regular expression. Such ad-hoc parsing solutions tend to be error-prone, overly complicated, hard to extend and turn on you when you least expect it.
Instead, I recommend using a proper json parser, depending on the language you use. For plain shell, jq is a good choice. With that, specifying the path to the property becomes trivial:
cat file.json | jq '.request.url'

How to write regex in elasticsearch "URI Search" style

I want to query Elasticsearch using the "URI Search" format (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.html#search-uri-request) with a regex but cannot find out how to deal with regex special characters symbols like \s and the simple space.
Let's say I have the term [ apple computer ] stored in my index (keyword analyzer used).
the term will be found with :
curl -XGET http://es:9200/myindex/mytype/_search?q=name:/.*comp.*/&pretty
curl -XGET http://es:9200/myindex/mytype/_search?q=name:/.*appl.*/&pretty
curl -XGET http://es:9200/myindex/mytype/_search?q=name:/.*pple.*/&pretty
but what syntax should I use (in curl, or with another tool) to query using these regex :
/.*pple\s+compu.*/
/.*le +compu.*/
I think I've found the asnwer to my question:
First with my index setting being like this, I need to use name.keyword for a full text search
{
"myindex" : {
"aliases" : { },
"mappings" : {
"mytype" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
...
Then, doing a query using the "URI Search" format I have to use the tipycal conversion
a space should be written as +
+ should be written as %2b
anyother special characters in a url should be written with its %ASCII equivalent
so it turns out my regular expression /.*le +compu.*/ must be queried like that :
curl -XGET "http://es:9200/myindex/mytype/_search?q=name.keyword:/.*pple+%2bcomp.*/&pretty"
Finally, I can't see in the regexp doc or lucene any mention of the \s symbol as a wildcard for space, but not a big deal as it can be rewritten using regexp sub-patterns.

Finding documents through regex in Mongo

I have a document structure like this one.
> db.urls.find()
{
"_id" : ObjectId("53d79c7020ba271c80b78b6c"),
"url" : "http://www.newstoday.com.bd?option=details&news_id=2368296&date=2014-01-27///",
"priority" : 0.25,
"date" : ISODate("2014-07-29T13:06:58.745Z"),
"seen" : 1
}
To find some document using regex I used the following,
> db.urls.find({url: { $regex: 'http://www.newstoday.com.bd?option='} })
>
Which resulted empty. I need some help on the proper regex to use here.
(?=.*?http:\/\/www\.newstoday\.com\.bd\?.*)(.*)
This will give the document based on the url if that is what you are looking for.
See Demo.
http://regex101.com/r/wE3dU7/1

sed - trying to replace first occurrence after a match

I am facing a situation that drives me nuts.
I am setting up an update server which uses a json file.
Don't ask why or how, it sucks and is my only possibility to achieve it.
I have been trying and researching for HOURS (many) because I went ballistic and wanted to crack this on my own. But I have to realize I got stuck and need help.
So sorry for this chunk but I think it is somewhat important to see...
The file is a one liner and repeating the following sequence with changing values (of course).
"plugin_name_foo_bar": {"buildDate": "bla", "dependencies": [{"name": "bla", "optional": true, "version": "1.00"}], "developers": [{"developerId": "bla", "email": "bla#gmail.com", "name": "Bla bla2nd"}], "excerpt": "some text {excerpt} !bla.png|thumbnail,border=1! ", "gav": "bla", "labels": ["report", "scm-related"], "name": "plugin_name_foo_bar", "previousTimestamp": "bla", "previousVersion": "1.0", "releaseTimestamp": "bla", "requiredCore": "1", "scm": "github.com", "sha1": "ynnBM2jWo25ZLDdP3ybBOnV/Pio=", "title": "bla", "url": "http://bla.org", "version": "1.0", "wiki": "https://bla.org"}, "Exclusion": {"buildDate": "bla", "dependencies": [],
and the next plugin block is glued straight afterwards.
What I now want to do is to search for "plugin_foo_bar": {" as this is the unique identifier for a new plugin description block.
I want to replace the first sha1 value occuring afterwards. That's where I keep failing. I always grab the first,last or any occurrence in the entire file and not the block :(
"title" is the unique identifier after the sha1 value.
So I tried to make the .* less greedy but it ain't working out.
last attempt was heading towards:
sed -i 's/("name": "plugin_name_foo_bar.*sha1": ")([a-zA-Z0-9!##\$%^&*()\[\]]*)(", "title"\)/\1blablabla\2/1' default.json
to find the sha1 value of that plugin but still no joy. I hope someone knows - preferably a simpler approach - before I now continue with trial and error until I have to puke and freakout.
I am working with SED on Windows, so Unix approach might help me to figure out how to achieve this in batch but please make it as one-liner if possible. Scripts are a real pain to convert.
And I just need SED and no other solution with other tools like AWK. That is absolutely out of discussion.
Any help is appreciated :)
Cheers
Jan
Don't use regex (sed) to parse JSON, instead use a proper JSON parser, or javascript directly like I do :
Using javascript and nodejs in a script :
File /tmp/file.json is :
{
"plugin_name_foo_bar" : {
"excerpt" : "some text {excerpt} !bla.png|thumbnail,border=1! ",
"dependencies" : [
{
"name" : "bla",
"version" : "1.00",
"optional" : true
}
],
"title" : "bla",
"previousTimestamp" : "bla",
"releaseTimestamp" : "bla",
"sha1" : "ynnBM2jWo25ZLDdP3ybBOnV/Pio=",
"labels" : [
"report",
"scm-related"
],
"buildDate" : "bla",
"version" : "1.0",
"previousVersion" : "1.0",
"name" : "plugin_name_foo_bar",
"scm" : "github.com",
"url" : "http://bla.org",
"gav" : "bla",
"developers" : [
{
"email" : "bla#gmail.com",
"developerId" : "bla",
"name" : "Bla bla2nd"
}
],
"wiki" : "https://bla.org",
"requiredCore" : "1"
},
"Exclusion" : {
"dependencies" : [],
"buildDate" : "bla"
}
}
The script script.js :
var js = require('/tmp/file.json')
js.plugin_name_foo_bar.sha1 = "xxx"
console.log(js)
Usage :
nodejs script.js
As sputnick points out parsing is a little beyond what sed's meant for. Still, sed's Turing-complete and bludgeoning it into doing what you want can satisfy that {sad,masoch}istic urge so many of us feel from time to time.
This one's even easy.
sed '
s/"sha1": /\n/g
s/\("name": "plugin_name_foo_bar"[^\n]*\n"\)[^"]*/\1thenewsha/
s/\n/"sha1": /g
'
For windows command line, with escaped quotes, replacing inline and using regular expression
sed -i -r "s/(plugin_name_foo_bar.+?sha1\": \")[^\"]+\"/\1abcdefghijkl\"/" default.json
sed -r "s/(plugin_name_foo_bar[^!]+sha1.: .)[^\"]+/\1abcdefghijkl/" file

Mongodb; use of the index in queries with a multiple condition

I use MongoDB. I create a database(500 000 000 documents) and collection in it for testing reasons. All documents look like bellow:
{
"_id" : ObjectId("50c1fbcda8cf8e11c43ea8ce"),
"sql_id" : 8311,
"text" : "WD7TYIM0H H3Q 3874 000 VFBF6H",
"xml" : "<root> <tag_0>WD7TYIM0H</tag_0> <tag_1>H3Q</tag_1> <tag_2>3874</tag_2><tag_3>000</tag_3><tag_4>VFBF6H</tag_4></root>",
"tags" : [
"WD7TYIM0H",
"H3Q",
"3874",
"000",
"VFBF6H"
]
}
I create index by field "tags" and want to use multiple regexp condition that uses "tags" index. Is this possible?
I tried:
> db.items.find({ "$and" : [{ "tags" : /^AAA/ }, { "tags" : /^BBB/ }] })
> db.items.find({ "tags" : { "$all" : [/^AAA/, /^BBB/] } })
twice Mongo went down.
If I search by one condition result returns very fast.
Thanks!
In my test it goes well. I'm using db version v1.6.5. Given the fact that I have only a few documents in DB, the problem may be a performance-related one.