regular expression to extract JSON string from text - regex

I'm looking for regex to extract json string from text.
I have the text below, which contains
JSON string(mTitle, mPoster, mYear, mDate)
like that:
{"999999999":"138138138","020202020202":{"846":{"mTitle":"\u0430","mPoster":{"
small":"\/upload\/ms\/b_248.jpg","middle":"600.jpg","big":"400.jpg"},"mYear"
:"2013","mDate":"2014-01-01"},"847":{"mTitle":"\u043a","mPoster":"small":"\/upload\/ms\/241.jpg","middle":"600.jpg","big":"
138.jpg"},"mYear":"2013","mDate":"2013-12-26"},"848":{"mTitle":"\u041f","mPoster":{"small":"\/upload\/movies\/2
40.jpg","middle":"138.jpg","big":"131.jpg"},"mYear":"2013","mDate":"2013-12-19"}}}
In order to parse JSON string I should extract JSON string from the text.
That is why, my question: Could you help me to get only JSON string
from text? Please help.
I've tried this regular expression with no success:
{"mTitle":(\w|\W)*"mDate":(\w|\W)*}

The following regex should work:
\{\s*"mTitle"\s*:\s*(.+?)\s*,\s*"mPoster":\s*(.+?)\s*,\s*"mYear"\s*:\s*(.+?)\s*,\s*"mDate"\s*:\s*(.+?)\s*\}
Check demo here.
The main difference from your regex is the .+? part, that, broken down, means:
Match any character (.)
One or more times (+)
As little as possible (?)
The ? operator after the + is very important here --- because if you removed it, the first .+ (in \{\s*"mTitle"\s*:\s*(.+?)) would match the whole text, not the text up to the "mPoster" word, that is what you want.
Notice it is just a more complicated version of \{"mTitle":(.+?),"mPoster":(.+?),"mYear":(.+?),"mDate":(.+?)\} (with \s* to match spaces, allowed by the JSON notation).

Related

Regular Expression - Starting and ending with, and contains specific string in the middle

I would like to generate a regex with the following condition:
The string "EVENT" is contained within a xml tag called "SHEM-HAKOVETZ".
For example, the following string should be a match:
<SHEM-HAKOVETZ>104000514813450EVENTS0001dfd0.DAT</SHEM-HAKOVETZ>
I think you want something like this ^<SHEM-HAKOVETZ>.*EVENT.*<\/SHEM-HAKOVETZ>$
Regular expression
^<SHEM-HAKOVETZ>.*EVENTS.*<\/SHEM-HAKOVETZ>$
Parts of the regular expression
^ From the beginning of the line
<SHEM-HAKOVETZ> Starting tag
.* Any character - zero or more
EVENT Middle part
<\/SHEM-HAKOVETZ>$ Ending part of the match
Here is the working regex.
If you want to match this line, you could use this regex:
<SHEM-HAKOVETZ>*EVENTS.*(?=<\/SHEM-HAKOVETZ>)
However, I would not recommend using regex XML-based data, because there may be problems with whitespace handling in XML (see this article for more information). I would suggest using an actual XML parser (and then applying the reg to be sure about your results.
Here is a solution to only match the "value" part ignoring the XML tags:
(?<=<SHEM-HAKOVETZ>)(?:.*EVENTS.*)(?=<\/SHEM-HAKOVETZ>)
You can check it out in action at: https://regex101.com/r/4XiRch/1
It works with Lookbehind and Lookahead to make sure it will only match if the tags are correct, but for further coding will only match the content.

Perl, replace multiple matches in string

So, i'm parsing an XML, and got a problem. XML has objects containing script, which looks about that:
return [
['measurement' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_1.png')),
'kpi' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_2.png'))]]
I need to replace all filenames, saving file format, every entry of regexp template, because string can look like that:
['measurement' : org.apache.commons.io.FileUtils.readFileToByteArray(new File('tab_2_1.png'))('tab_2_1.png'))('tab_2_1.png')),
and i still need to replace all image_name before .png
I used this regexp : .*\(\'(.*)\.png\'\),
but it catches only last match in line before \n, not in whole string.
Can you help me with correcting this regexp?
The problem is that .* is greedy: it matches everything it can. So .*x matches all up to the very last x in the string, even if all that contains xs. You need the non-greedy
s/\('(.*?)\.png/('$replacement.png/g;
where the ? makes .* match up to the first .png. The \(' are needed to suitably delimit the pattern to the filename. This correctly replaces the filenames in the shown examples.
Another way to do this is \('([^.]*)\.png, where [^.] is the negated character class, matching anything that is not a .. With the * quantifier it again matches all up to the first .png
The question doesn't say how exactly you are "parsing an XML" but I dearly hope that it is with libraries like XML::LibXML of XML::Twig. Please do not attempt that with regex. The tool is just not fully adequate for the job, and you'll get to know about it. A lot has been written over years about this, search SO.

How to extract big mgrs using regex

I have an input json:
{"id":12345,"mgrs":"04QFJ1234567890","code":"12345","user":"db3e1a-3c88-4141-bed3-206a"}
I would like to extract with regular expression MGRS of 1000 kilometer, in my example result should be: 04QFJ1267
First 2 symbols always digits, next 3 always chars and the rest always digits. MGRS have a fix length of 15 chars at all.
Is it possible?
Thanks.
All you really need to do is remove characters 8-10 and 13-15. If you want/need to do that using regex, then you could use the replace method with regex: (EDIT Edited to remove the rest of the string).
.*?(\w{7})\d{3}(\d{2})\d+.*
and replacement string:
$1$2
I see now you are using Java. So the relevant code line might look like:
resultString = subjectString.replaceAll(".*?(\\w{7})\\d{3}(\\d{2})\\d+.*", "$1$2");
The above assumes all your strings look like what you showed, and there is no need to test to be sure that "mgrs" is in the string.

Regular Expression to unmatch a particular string

I am trying to use regular expression in Jmeter where in I need to unmatch a particular string. Here is my input test string : <activationCode>insvn</activationCode>
I need to extract the code insvn from it. I tried using the expression :
[^/<activationCode>]\w+, but does not yield the required code. I am a newbie to regular expression and i need help with this.
Can you use look-behind assertion in jmeter? If so, you can use thatr regex which will give you a word that follows <activationCode>
(?<=\<activationCode\>)\w+
If your input string is encoded (e.g for HTML), use:
(?<=\<activationCode\>)\w+
When designing a regular expression in any language for something like this you can match your input string as three groups: (the opening tag, the content, and the closing tag) then select the content from the second group.

Regex with Tab delimited text containing \x09

I've got a tough one.
I've got tab-delimited text to match with a regex.
My regex looks like:
^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$
and an example source text is (tabs converted to \t for clarity):
JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone
However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.
Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.
If not, is there any character that could be safely used for delimiting text that contains a regex string?
I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.
Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:
String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";
If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).