Regex Lookaround with only one match where multiple possible matches available

Regex Lookaround with only one match where multiple possible matches available - regex

Got this:
<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>
I want only match: something two
I try: (?<=<TAG>)(.*two.*)(?=<\/TAG>)
but got:
something one</TAG><TAG>something two</TAG><TAG>something three
Maybe I give another example
RECORDsomething beetwenRECORD RECORDanything beetwenRECORD etc.
want to get words beetwen RECORD

You can use
<TAG>.+?<TAG>(.*?)</TAG>
Your something two is in the first match in $1

Try this:
(?<=</TAG><TAG>)[^<]*(?=</TAG><TAG>)

As already said, parsing HTML using regular expressions is discouraged! There are plenty of HTML parsers for doing this. But if you want a regex at all costs, here is how I would it in Python:
In [1]: import re
In [2]: s = '<TAG>something one</TAG><TAG>something two</TAG><TAG>something three</TAG>'
In [3]: re.findall(r'(?<=<TAG>).*?(?=</TAG>)', s)[1]
Out[3]: 'something two'
However, this solution only works if you always want to extract the content of the second tag pair. But as I said, don't do this.

If you know that the TAG is not the first and not the last, you can do
(?<=.+<TAG>)(.*two.*)(?=<\/TAG>.+)
Of course, it's much better to capture the tags as well and use a capturing group
.*<TAG>(.*two.*?)<\/TAG>

Related

RegEx remove part of string and and replace another part

I have a challenge getting the desired result with RegEx (using C#) and I hope that the community can help.
I have a URL in the following format:
https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1
I want make two modifications, specifically:
1) Remove everything after 'value' e.g. '&ida=0&idb=1'
2) Replace 'category' with e.g. 'newcategory'
So the result is:
https://somedomain.com/subfolder/newcategory/?abc=text:value
I can remove the string from 1) e.g. ^[^&]+ above but I have been unable to figure out how to replace the 'category' substring.
Any help or guidance would be much appreciated.
Thank you in advance.

Use the following:
Find: /(category/.+?value)&.+
Replace: /new$1 or /new\1 depending on your regex flavor
Demo & explanation
Update according to comment.
If the new name is completely_different_name, use the following:
Find: /category(/.+?value)&.+
Replace: /completely_different_name$1
Demo & explanation

You haven't specified language here, I mainly work on python so the solution is in python.
url = re.sub('category','newcategory',re.search('^https.*value', value).group(0))
Explanation
re.sub is used to replace value a with b in c.
re.search is used to match specific patterns in string and store value in the group. so in the above code re.search will store value from "https to value" in group 0.

Using Python and only built-in string methods (there is no need for regular expressions here):
url = r"https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1"
new_url = (url.split('value')[0] + "value").replace("category", 'newcategory')
print(new_url)
Outputs:
https://somedomain.com/subfolder/newcategory/?abc=text:value

How to get string following a certain pattern?

I have a URL:
https://fakedomain.com/2017/07/01/the-string-i-want-to-get/
I can recognize the 2017/07/01/ via this pattern:
(\d{4}/\d{2}/\d{2}/)
But what I want, is the string that comes after it: the-string-i-want-to-get/.
How do I achieve that?

Depending on the language you're using, you might find a library that does that for you (instead of writing your own regex). Anyway, if you want to achieve this by regex, you can:
\d{4}\/\d{2}\/\d{2}\/(.*)\/
This will catch anything after the date, up to the next "/".
You can also use a positive lookbehind:
(?<=\d{4}\/\d{2}\/\d{2}\/)(.*)\/

I suggest you this regex, which matches 2017/07/01/ in the first group and the-string-i-want-to-get/ in the second group:
(\d{4}/\d{2}/\d{2}/)(.*/)
Here is an implementation example in Python3:
import re
url = 'https://fakedomain.com/2017/07/01/the-string-i-want-to-get/'
m = re.search(r'(\d{4}/\d{2}/\d{2}/)(.*/)', url)
print(m.group(1)) # 2017/07/01/
print(m.group(2)) # the-string-i-want-to-get/

Preg_match_all with nested matches

i'm developing a template system and running into some issues.
The plan is to create HTML documents with [#tags] in them.
I could just use str_replace (i can loop trough all posible replacements), but i want to push this a little further ;-)
I want to allow nested tags, and allow parameters with each tag:
[#title|You are looking at article [#articlenumber] [#articlename]]
I would like to get the following results with preg_match_all:
[0] title|You are looking at article [#articlenumber] [#articlename]
[1] articlenumber
[2] articlename
My script will split the | for parameters.
The output from my script will be something like:
<div class='myTitle'>You are looking at article 001 MyProduct</div>
The problem i'm having is that i'm not exprerienced with regex. Al my paterns results almost what i want, but have problems with the nested params.
\[#(.*?)\]
Will stop at the ] from articlenumber.
\[#(.*?)(((?R)|.)*?)\]
Is more like it, but it doesn't catch the articlenumber; https://regex101.com/r/UvH7zi/1
Hope someone can help me out! Thanks in advance!

You cannot do this using general Python regular expressions. You are looking for a feature similar to "balancing groups" available in the .NET RegEx's engine that allows nested matches.
Take a look at PyParsing that allows nested expression:
from pyparsing import nestedExpr
import pyparsing as pp
text = '{They {mean to {win}} Wimbledon}'
print(pp.nestedExpr(opener='{', closer='}').parseString(text))
The output is:
[['They', ['mean', 'to', ['win']], 'Wimbledon']]
Unfortunately, this does not work very well with your example. You need a better grammar, I think.
You can experiment with a QuotedString definition, but still.
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="'", endQuoteChar="'")
parser = pp.nestedExpr(opener="[", closer="]",
content=single_value,
ignoreExpr=None)
example = "['#title|You are looking at article' ['#articlenumber'] ['#articlename']]"
print(parser.parseString(example, parseAll=True))

I'm typing this on my phone so there might be some mistakes, but what you want can be quite easily achieved by incorporating a lookahead into your expression:
(?=\\[(#(?:\\[(?1)\\]|.)*)\\])
Edit: Yup, it works, here you go: https://regex101.com/r/UvH7zi/4
Because (?=) consumes no characters, the pattern looks for and captures the contents of all "[#*]" substrings in the subject, recursively checking that the contents themselves contain balanced groups, if any.

here is my code:
#\w+\|[\w\s]+\[#(\w+)]\s+\[#(\w+)]
https://regex101.com/r/UvH7zi/3

For now i've crated a parser:
- get all opening tags, and put their strpos in array
- loop trough all start positions of the opening tags
- Look for the next closingtag, is it before the next open-tag? than the tag is complete
- If the closingtag was after an opening tag, skip that one and look for the next (and keep checking for openingtags in between)
That way i could find all complete tags and replace them.
But that took about 50 lines of code and multiple loops, so one preg_match would be greater ;-)

regular expression : get super scripted text

I would like to get super scripted text via following html string.
testing to <sup>supers</sup>cript o<sup>n</sup>e
The result I would like to get is like below
supers
n
This is what I tried right now
But the result is not what I want.
<sup>supers
<sup>n
Could anyone give me suggestion please?

You can use lookbehind in your regex:
(?<=<sup>)[^<]*
Update Demo

Use this if there may be other HTML tags between <sup> and </sup>:
(?<=<sup>)(.*?)(?=<\/sup>)
Check the demo.

You were close, just not capturing your match:
Updated regex
(?:<sup>)([^<]*) I just added a capture group around your match

(?<=<sup>)([^<]*?)(?=<\/)
This should work.
See demo.
http://regex101.com/r/sA7pZ0/13

Regex for URL with GET variable

Trying to get a regex for any string that matches view.php with the GET variable file with as value [a-zA-Z0-9_]*. FYI, I need to rewrite this URL to /file/value
What I did but didn't work: ^view.php\?.*?(&|\?)file=([a-zA-Z0-9_]*).*$
What does work?

Leading ^ means your entire string begins with view.php, which is probably not true.
Also in your regex your assume, that file is the last GET
etc.
This regex should match get value for file in any string
view\.php\?.*?\bfile=(\w*)\b

view.php\? here you accept viewaphp?. Is that ok? You probably mean view\.php. Also, you enforce a question mark at the end, whereas:
(&|\?) here you again enforce either an ampersand or a question mark. Hence, you require something like view.php??file=... or view.php?.*&file=...
What you want is probably something like (although untested, and note the + to not allow empty filenames):
^view\.php\?(?:file=)|(?:.*&file=)([a-zA-Z0-9_]+)(?:&|$)

As #Lindrian asked, will this be run against a string starting with view.py or against an entire url? For the former case, this simple regex should work fine in my opinion (using Python here):
In [1]: import re
In [2]: s = 'view.php?foo=bar&file=blablah123&anotherfoo=anotherbar'
In [3]: re.sub(r'view\.php\?.*\bfile=(\w+).*', '/file/\g<1>', s)
Out[3]: '/file/blablah123'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Lookaround with only one match where multiple possible matches available - regex

You can use <TAG>.+?<TAG>(.*?)</TAG> Your something two is in the first match in $1

Try this: (?<=</TAG><TAG>)[^<]*(?=</TAG><TAG>)

If you know that the TAG is not the first and not the last, you can do (?<=.+<TAG>)(.two.)(?=<\/TAG>.+) Of course, it's much better to capture the tags as well and use a capturing group .<TAG>(.two.*?)<\/TAG>

Related

RegEx remove part of string and and replace another part

How to get string following a certain pattern?

Preg_match_all with nested matches

regular expression : get super scripted text

Regex for URL with GET variable

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Lookaround with only one match where multiple possible matches available - regex

You can use <TAG>.+?<TAG>(.*?)</TAG> Your something two is in the first match in $1

Try this: (?<=</TAG><TAG>)[^<]*(?=</TAG><TAG>)

If you know that the TAG is not the first and not the last, you can do (?<=.+<TAG>)(.*two.*)(?=<\/TAG>.+) Of course, it's much better to capture the tags as well and use a capturing group .*<TAG>(.*two.*?)<\/TAG>

Related

RegEx remove part of string and and replace another part

How to get string following a certain pattern?

Preg_match_all with nested matches

regular expression : get super scripted text

Regex for URL with GET variable

Categories

Resources

If you know that the TAG is not the first and not the last, you can do (?<=.+<TAG>)(.two.)(?=<\/TAG>.+) Of course, it's much better to capture the tags as well and use a capturing group .<TAG>(.two.*?)<\/TAG>