Extracting message content from JSON using regular expresions - c++

I need to extract content of messages from JSON, I am not allowed to use JSON parser so I tried using regular expressions, however I got stuck on extracting message content. I am using C++.
Here's an example of the JSON file:
{
"id":"776752463986294785",
"type":0,
"content":"\"",
"channel_id":"762106839054811176",
"author":{
"id":"487706666905894923",
"username":"Emzak",
"avatar":"a70859ecda1355dfd55bddcfd0194458",
"discriminator":"6235",
"public_flags":0
},
"attachments":[
],
"embeds":[
],
"mentions":[
],
"mention_roles":[
],
"pinned":false,
"mention_everyone":false,
"tts":false,
"timestamp":"2020-11-13T10:16:58.777000+00:00",
"edited_timestamp":null,
"flags":0
}
as I said i need the Content field, my current regex is :
"content"[ :]+(\"[^"]*\")
Which works unless there are quotation marks in the content. If there are any, they are always escaped, but I haven't found a way to get past them. With quotation marks my current regex gives me this string :
"content": "\"
Which would be problematic if there was any message behind that quotation mark. I would like to get this string :
"content": "\""
Any help would be appreciated, Thanks :)

You can make it escape \" as follows:
"content"[ :]+(\"(?:\\.|[^"])*\")
It creates a non-capturing group that matches every \ with the following character, as well as the original [^"] criteria.

Related

Extracting a value from JSON Response using Regular Expression Extractor in Jmeter

I have JSON response from which i want to extract the "transaction id" value i.e (3159184) in this case and use it in my next sampler. Can somebody give me regular expression to extract the value for the same. I have looked for some solutions but it doesn't seem to work
{
"lock_release_date": "2021-04-03T16:16:59.7800000+00:00",
"party_id": "13623162",
"reservation_id": "reserve-1-81b70981-f766-4ca7-a423-1f66ecaa7f2b",
"reservation_line_items": [
{
"extended_properties": null,
"inventory_pool": "available",
"lead_type": "Flex",
"line_item_id": "1",
"market_id": 491759,
"market_key": "143278|CA|COBROKE|CITY|FULL",
"market_name": "143278",
"market_state_id": "CA",
"product_name": "Local Expert",
"product_size": "SOV30",
"product_type": "Postal Code",
"reserved_quantity": 0,
"transaction_id": 3159174
}
],
"reserved_by": "user1#abc.com"
}
Here's what i'm trying in Jmeter
If you really want the regular expression it would be something like:
"transaction_id"\s?:\s?(\d+)
Demo:
where:
\s? stands for an optional whitespace - this is why your expression doesn't work
\d+ stands for a number
See Regular Expressions chapter of JMeter User Manual for more details.
Be aware that parsing JSON using regular expressions is not the best idea, consider using JSON Extractor instead. It allows fetching "interesting" values from JSON using simple JsonPath queries which are easier to create/read and they are more robust and reliable. The relevant JSON Path query would be:
$.reservation_line_items[0].transaction_id
More information: API Testing With JMeter and the JSON Extractor
Use JSON Extractor for JSON response rather using Regular Expression extractor.
Use JSON Path Expressions as $..transaction_id
Results:
Simplest Regular Expression for extracting above is:
transaction_id": (.+)
Where:
() is used for creating capture group.
. (dot) matches any character except line breaks.
+ (plus) matches 1 or more of the preceding token.
(.+?) could be used to stop looking after first instance is found.
i.e. ? makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.

Regex: getting all the hashtags and mentions used in all my documents

I'm using the Kibana console to perform such queries (they are separated: one for the hashtags, one for the mentions). The collection of documents are blog entries with a textContent field, which may have user mentions like #theUserName #AnotherOne or hashtags like #helloWorld and #hello2. The queries look like the following one:
GET /xblog/_search
{
"source": [
"id",
"textContent"
],
"query": {
"regexp": {
"textContent": {
"value": "#([^-A-Za-z0-9])",
"flags": "ALL"
}
}
}
}
But the problem is it's returning also the documents that do not contain a #userMention. I think the # in the regex is being treated as a special symbol, but reading the documentation I couldn't find how to escape it.
Inthe docs 1, the authors say that you can escape any symbol with double quotes, so I tested:
""#""
But I got nothing.
I also testes expressions I'm used to, like:
/\s([##][\w_-]+)/g
But that produces multiple errors in Kibana. I tried replacing some parts according to the documentation, but it's still not working.
Can you point me in the right direction?
Thanks in advance,
You enabled the ALL flag that makes # match the whole string, see the ElasticSearch regex documentation:
If you enable optional features (see below) then these characters may also be reserved:
# # & < > ~
Then, in the Any string section:
The at sign "#" matches any string in its entirety.
Enabled with the ANYSTRING or ALL flags.
Since you do not need any special behavior here you may simply tell the engine to use a "simple" regex by passing "flags": "NONE", or escape the #, "\\#([^-A-Za-z0-9])":
Any reserved character can be escaped with a backslash "\*" including a literal backslash character: "\\"
And since you need a whole string match, you may need to add .* on both ends (to match strings containing your match):
"query": {
"regexp": {
"textContent": {
"value": ".*#[^-A-Za-z0-9].*",
"flags": "NONE"
}
}
}
Or
"query": {
"regexp": {
"textContent": {
"value": ".*\\#[^-A-Za-z0-9].*",
"flags": "ALL"
}
}
}

Remove an object from JSON using RegEx

I have JSON objects in this format:
{
"1f626": {
"name": "frowning face with open mouth",
"ascii": [],
"code_points": {
"base": "1f626",
"default_matches": [
"1f626"
],
"greedy_matches": [
"1f626"
],
"decimal": ""
}
}
}
I have to remove the code_points object using Regular Expressions.
I have tried using this RegEx:
(("code\w+)(.*)(}))
But it is only selecting the first line.
I have to select until end of curly brackets in order to fully get rid of the code_points object.
How can I do that?
Note: I have to remove it using Regular Expressions and not JavaScript.
Please don't post any JavaScript answers or mark this as a possible duplicate of a JavaScript-based question.
Alternatively, at the command-line, if you can use jq
jq "del(.[].code_points)" <monster.json >smaller_monster.json
This deletes the code_points key inside each 2nd-level object.
It took my machine about 5 seconds on a 60MB document.
It's not a regular expression but it's not JavaScript, either. So, it meets half of your non-functional requirements.
("code_points")([\s\S]*?)(})
The problem you had is that . is actually any character except \n, so in this case I usually use [\s\S] which means any whitespace and non-whitespace character (so it's actually any character). Also you should make * quantifier to be lazy by adding ?.
Remember that this Regular Expression won't work properly in case you have inner object (other {}) in code_points

GROK (regular expressions), field with backslash, space and a long

I'm using Logstash to get some text out of a string and create a field.
The string of the message is:
"\"07/12/2016 16:21:24.652\",\"13.99\",\"1467351040\""
I can't figure it out how to get three results, being the first:
07/12/2016 16:21:24.652
The second
13.99
The third
1467351040
match => {
"message"=> [
"\\"%{DATESTAMP:a}\\",\\"%{NUMBER:b}\\",\\"%{NUMBER:c}\\""
]
}
To help the next time you have to craft a grok pattern:
GrokConstructor, to test your pattern
The main patterns
Grok filter documentation
That's the correct line indeed.
I had to remove one backslash for my own config. Thanks very much. Saves me a lot of time and stuff.
grok{ match => { "message"=> [ "\"%{DATESTAMP:a}\",\"%{NUMBER:b}\",\"%{NUMBER:c}\"" ]} }

logstash grok filter regular expression works in debug tool but failed in actual execution

I'm trying to extract a filed out of log line, i use http://grokdebug.herokuapp.com/ to debug my regular expression with:
(?<action>(?<=action=).*(?=\&))
with input text like this:
/event?id=123&action={"power":"on"}&package=1
i was able to get result like this:
{
"action": [
"{"power":"on"}"
]
}
but when i copy this config to my logstash config file:
input { stdin{} }
filter {
grok {
match => { "message" => "(?<action>(?<=action=).*(?=\&))"}
}
}
output { stdout {
codec => 'json'
}}
the output says matching failed:
{"message":" /event?id=123&action={\"power\":\"on\"}&package=1","#version":"1","#timestamp":"2016-01-05T10:30:04.714Z","host":"xxx","tags":["_grokparsefailure"]}
i'm using logstash-2.1.1 in cygwin.
any idea why this happen?
You might experience an issue caused by a greedy dot matching subpattern .*. Since you are only interested in a string of text after action= till next & or end of string you'd better use a negated character class [^&].
So, use
[?&]action=(?<action>[^&]*)
The [?&] matches either a ? or & and works as a boundary here.
It doesn't answer your regexp question, but...
Parse the query string to a separate field and use the kv{} filter on it.