regular expression: how to ignore rest of the line - regex

I have an input like this (a JSON format)
{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}
Now, I want to pick up the first "id" value which is 1BCDEFGHIJKLM using regular expressions. I have managed upto the point using
[^({"location":[?{"id":")].{0,12} but this is incomplete. Could some one help how do I ignore the rest of the line after the value 1BCDEFGHIJKLM

Regex isn't the way to do this. Whatever platform you are using, it must have a JSON parser.
That will be your best error-free solution.
Assuming you must use regex, you can grab all the id's using "id":"(.*?)", and take the first match.
I found the following article, which might help you.

While messy, how is your regex incomplete?
It could be shortened to ("id":"([^"]+)") which is more readable, and doesn't limit the ID to twelve characters. If that is beneficial.
If you problem is getting more than one result, most languages have a "g" global switch.
In javascript, the following would return "1BCDEFGHIJKLM":
var firstID = str.match(/"id":"([^"]+)"/)[1]
As match()returns an array, in which [0] is the entire returned string, and [1] the first parenthasis.

Don't have to use regex. In your favourite language, split on commas. Then go through each item, check for "id" and split on colon (:). Get the last element. Eg Python
>>> s
'{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}'
>>> for i in s.split(","):
... if '"id"' in i:
... print i.split(":")[-1]
... break
...
"1BCDEFGHIJKLM"
Of course, ideally, you should use a dedicated JSON parser.

Related

Preg_match_all with nested matches

i'm developing a template system and running into some issues.
The plan is to create HTML documents with [#tags] in them.
I could just use str_replace (i can loop trough all posible replacements), but i want to push this a little further ;-)
I want to allow nested tags, and allow parameters with each tag:
[#title|You are looking at article [#articlenumber] [#articlename]]
I would like to get the following results with preg_match_all:
[0] title|You are looking at article [#articlenumber] [#articlename]
[1] articlenumber
[2] articlename
My script will split the | for parameters.
The output from my script will be something like:
<div class='myTitle'>You are looking at article 001 MyProduct</div>
The problem i'm having is that i'm not exprerienced with regex. Al my paterns results almost what i want, but have problems with the nested params.
\[#(.*?)\]
Will stop at the ] from articlenumber.
\[#(.*?)(((?R)|.)*?)\]
Is more like it, but it doesn't catch the articlenumber; https://regex101.com/r/UvH7zi/1
Hope someone can help me out! Thanks in advance!
You cannot do this using general Python regular expressions. You are looking for a feature similar to "balancing groups" available in the .NET RegEx's engine that allows nested matches.
Take a look at PyParsing that allows nested expression:
from pyparsing import nestedExpr
import pyparsing as pp
text = '{They {mean to {win}} Wimbledon}'
print(pp.nestedExpr(opener='{', closer='}').parseString(text))
The output is:
[['They', ['mean', 'to', ['win']], 'Wimbledon']]
Unfortunately, this does not work very well with your example. You need a better grammar, I think.
You can experiment with a QuotedString definition, but still.
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="'", endQuoteChar="'")
parser = pp.nestedExpr(opener="[", closer="]",
content=single_value,
ignoreExpr=None)
example = "['#title|You are looking at article' ['#articlenumber'] ['#articlename']]"
print(parser.parseString(example, parseAll=True))
I'm typing this on my phone so there might be some mistakes, but what you want can be quite easily achieved by incorporating a lookahead into your expression:
(?=\\[(#(?:\\[(?1)\\]|.)*)\\])
Edit: Yup, it works, here you go: https://regex101.com/r/UvH7zi/4
Because (?=) consumes no characters, the pattern looks for and captures the contents of all "[#*]" substrings in the subject, recursively checking that the contents themselves contain balanced groups, if any.
here is my code:
#\w+\|[\w\s]+\[#(\w+)]\s+\[#(\w+)]
https://regex101.com/r/UvH7zi/3
For now i've crated a parser:
- get all opening tags, and put their strpos in array
- loop trough all start positions of the opening tags
- Look for the next closingtag, is it before the next open-tag? than the tag is complete
- If the closingtag was after an opening tag, skip that one and look for the next (and keep checking for openingtags in between)
That way i could find all complete tags and replace them.
But that took about 50 lines of code and multiple loops, so one preg_match would be greater ;-)

Regex HTTP Response Body Message

I use a jmeter for REST testing.
I have made a HTTP Request, and this is the response data:
{"id":11,"name":"value","password":null,"status":"ACTIVE","lastIp":"0.0.0.0","lastLogin":null,"addedDate":1429090984000}
I need just the ID (which is 11) in
{"id":11,....
I use the REGEX below :
([0-9].+?)
It works perfectly but it will be a problem if my ID more than 2 digits. I need to change the REGEX to :
([0-9][0-9].+?)
Is there any dynamic REGEX for my problem. Thank you for your attention.
Regards,
Stefio
If you want any integer between {"id": and , use the following Regular Expression:
{"id":(\d+),
However the smarter way of dealing with JSON data could be JSON Path Extractor (available via JMeter Plugins), going forward this option can be much easier to use against complex JSON.
See Using the XPath Extractor in JMeter guide (scroll down to "Parsing JSON") to learn more on syntax and use cases.
I suggest using the following regular expression:
"id":([^,]*),
This will first find "id": and then look for anything that is not a comma until it finds a comma. Note the character grouping is only around the value of the ID.
This will work for ANY length ID.
Edit:
The same concept works for almost any JSON data, for example where the value is quoted:
"key":"([^"]*)"
That regular expression will extract the value from given key, as long as value is quoted and does not contain quotes. It first finds "key": and then matches anything that is not a quote until the next quote.
You can use the quantifier like this:
([0-9]{2,}.+?)
It will catch 2 or more digits, and then any symbol, 1 or more times. If you want to allow no other characters after the digits, use * instead of +:
([0-9]{2,}.*?)
Regex demo

Extract all tokens from string using regex in Scala

I have a string like "httpx://__URL__/__STUFF__?param=value"
This sample is a url by convention...it could be anything with zero or more __X__ tokens in it.
I want to use a regex to extract a list of all the tokens, so output here would be List("__URL__","__STUFF__"). Remember, I don't know beforehand how many (if any) tokens may be in the input string.
I've been struggling but unable to come up with a regex expression that will do the trick.
Something like this did not work:
(?:.?(__[a-zA-Z0-9]+__).?)+
Scala Regex, which is just a wrapper around Java Regex, will never return multiple subgroups for repetitions.
The only way about it is to have a regex for the token, and then find it multiple times. You pretty much already have everything you want:
"__[a-zA-Z0-9]+__".r findAllIn "httpx://__URL__/__STUFF__?param=value"
That returns an Iterator. Use .toSeq or similar to convert into a collection.
Greg, have you tried a simple
_+[^_]+_+
This will match all the __TOKENS__
It doesn't do any check for any __TOKENLIKE__ string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.
Combine a regex with split:
def urlPathComponents(s: String): Option[Array[String]] =
"""(?<=http(s?)://)[^?]+""".r findFirstIn s map (_.split("/"))

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.
That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.
I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.
s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',