I have strings that looks like this:
"Grand Theft Auto V (5)" border="0" src="/product_images/Gaming/Playstation4 Software/5026555416986_s.jpg" title="Grand... (the string continues for a while here)
I want to use regex to grab this: /product_images/Gaming/Playstation4 Software/5026555416986_s.jpg
Basically, everything in src="..."
At the moment I produce a list using re.findall(r'"([^"]*)"', line) and grab the appropriate one, but there's a lot of quotes in the full string and I'd like to be more efficient.
Can anyone help me put together an expression for this please?
Try with this
(?<=src=").+(?=" )
Use this as RE :
src="(.+?)"
This will return result as you want.
re.findall('src="(.+?)"', text_to_search_from)
Related
I am trying to write a regex which will strip away the rest of the path after a particular folder name.
If Input is:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
Output should be:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
Some constrains:
ChangePack- will be followed change pack id which is a mix of numbers or alphabets a-z or A-Z only in any order. And there is no limit on length of change pack id.
ChangePack- is a constant. It will always be there.
And the text before the ChangePack can also change. Like it can also be:
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
My regex-fu is bad. What I have come up with till now is:
^(.*?)\-6a7B6
I need to make this generic.
Any help will be much appreciated.
Below regex can do the trick.
^(.*?ChangePack-[\w]+)
Input:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
Output:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6
Check out the live regex demo here.
^(.*?ChangePack-[a-zA-Z0-9]+)
Try this.Instead of replace grab the match $1 or \1.See demo.
https://regex101.com/r/iY3eK8/17
Will you always have '/Repository/Framework/PITA/branches/' at the beginning? If so, this will do the trick:
/Repository/Framework/PITA/branches/\w+-\w*
Instead of regex you could can use split and join functions. Example python:
path = "/a/b/c/d/e"
folders = path.split("/")
newpath = "/".join(folders[:3]) #trims off everything from the third folder over
print(newpath) #prints "/a/b"
If you really want regex, try something like ^.*\/folder\/ where folder is the name of the directory you want to match.
I have the following html string:
F.V.Adamian, G.G.Akopian
I want to form a single plain text string with the author names so that it looks something like (I can fine tune the punctuation later):
F.V.Adamian, G.G.Akopian.
I'm trying to use 'regexp' in Matlab. When I do the following:
regexpi(htmlstring,'">.*</a>','match')
I get:
">F.V.Adamian</a>, G.G.Akopian,
Why? I'm trying to get it to continuously output (hence I did not use the 'once' operator) all characters between "> and , which is the author's name. It works fine for the first one but not for the second. I am happy to truncate the "> and with a regexprep(regexpstring,'','') later.
I see that regexprep(htmlstr, '<.*?>','') works and does what I want. But I don't get it...
In .*? the ? is telling the .* to be lazy as opposed to greedy. By default, .* will try to match the largest thing it can. When you add the ? it instead goes for the smallest thing it can
source
I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.
That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.
I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.
s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',
need some help with this RegEx magic..
I have this:
delete
and this:
(<a)*([^>]*>)[^<]*(</a>)
$1 = <a
$2 = href="/en/node/1032/delete?destination=node%2F5%2Fblog">
$3 = </a>
I need some aditional strings:
1032
href="/en/ en is dynamic!
How can I get this strings?
Used in php
Your sample could be captured with
(<a)\b.*?((href="/en/).*?(?</)(\d+)/.*?").*?>).*?(</a>)
...but perhaps replacing the "en" with something broader, depending on what you want to capture.
HOWEVER, and I want to emphasize this, don't use regex to parse HTML. The above regex won't work for certain HTML-valid input, and due to the limitations of regex it cannot be refined to work for every possible case. You'll get better, more correct results with an HTML or XML parser.
([^/ ]). That will give you
href="
en
node
1032
Lets say I need to get a string inside some h1, h2, or h3 tags
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
This works great if the user decides to take a sane approach to headers:
<h1>My Header</h1>
but knowing my users, they want bold, italic, underlined h1's. And they have that coding quagmire tinyMCE to help them do it. TinyMCE would output:
<h1><b><span style='text-decoration: underline'><i>My Hideous Header</i></span></b></h1>
So my question is:
How do i get a string inside h1 h2, or h3, and then inside any amount of surrounding other tags as well?
Thanks,
Joe
/<(h[1-3])[^>]*>(?:.*?>)?([^<]+)(?:<.*?)?<\/\1>/i
It will not be too hard to make cases that break it hideously, since (as I'm sure people will tell you) parsing HTML is a job for an HTML parser, not a regex, but it works for your given case and various similar ones.
If you're in php you can use your regex:
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
then pass the captured result through strip_tags() function to get rid of all the insanity inside.
If you are not on php you can pass the result through regexp replace that removes tags. Something like replace
/<\/?[^>]+?>/
with empty string.
If you only want to capture the ultimately nested text you could just drop all tags inside the header tag with:
/<([hH][1-3]).*>(.*?)<.*\/$1>/
Untested, but I think it should work.