How to find and replace in a regex code - regex

I am trying to find and replace in a regex code
<div class="gallery-image-container">
<div jstcache="1116"
class="gallery-image-high-res loaded"
style="width: 396px;
height: 264px;
background-image: url("https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no");
background-size: 396px 264px;"
jsan="7.gallery-image-high-res,7.loaded,5.width,5.height,5.background-image,5.background-size">
</div>
</div>
In the code above I used This
(https:\/\/[^&]*)
To extract this URL
https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no
I used This regex s\d{3} to get s396
Now I want to replace s396 to s1000 in the URL
Now am Stock and don't know how to go about it.
Please is there anyway all these can be done in just one regex code not multiple codes?

I would suggest using an HTML parser, but I understand sometimes that is not possible. Here is a little example in python.
import re
data = '''
<div class="gallery-image-container">
<div jstcache="1116"
class="gallery-image-high-res loaded"
style="width: 396px;
height: 264px;
background-image: url("https://lh5.googleusercontent.com/p/AF1QipMcTfMPZj_d5iip9WKtN2SQB9Je5U4rRB0nT_t8=s396-k-no");
background-size: 396px 264px;"
jsan="7.gallery-image-high-res,7.loaded,5.width,5.height,5.background-image,5.background-size">
</div>
</div>
'''
match = re.search("(https?://[^&]+)", data)
url = match.group(1)
url = re.sub("s\d{3}", "s1000", url)
print(url)
They key part is the regex of
(https?://[^&]+)
It is using a negative character class. It's saying, look for http with an optional s followed by :// and then all the non & You can use this site to play around with regexs:
https://regex101.com/r/b0APFA/1
I'm sure you could do a clever 1 liner nested regex to find and replace all at once, but it's going to be easier to troubleshoot if you have a few lines.

Related

Match only spefic url via regex

I want to match only this specific url
https://www.facebook.com/princessaustine.alcantara.3/about?lst=100002159119314%3A100022260619396%3A1507039852
Here's the source code
<div class="hidden_elem"><code id="u_0_17"><!-- <div class="fbTimelineTopSectionBase _6-d _529n"><div class="_5h60" id="pagelet_above_header_timeline" data-referrer="pagelet_above_header_timeline"></div><div id="above_header_timeline_placeholder"></div><div class="fbTimelineSection fbTimelineTopSection"><div id="fbProfileCover"><div class="cover" id="u_0_13"><a class="coverWrap coverImage" data-referrerid="100022260619396" href="https://www.facebook.com/photo.php?fbid=118243868927633&set=a.117907638961256.1073741827.100022260619396&type=3" rel="theater" ajaxify="https://www.facebook.com/photo.php?fbid=118243868927633&set=a.117907638961256.1073741827.100022260619396&type=3&size=1440%2C1080&source=10&player_origin=profile&referrer_profile_id=100022260619396" data-ploi="https://scontent.fmnl4-1.fna.fbcdn.net/v/t31.0-8/22136852_118243868927633_2950847275004458372_o.jpg?oh=fbcb3c8abc2023b35a5a36fb2989d850&oe=5A821DA8" title="Cover Photo" id="u_0_12" data-cropped="1"><img class="coverPhotoImg photo img" src="https://scontent.fmnl4-1.fna.fbcdn.net/v/t31.0-8/c0.81.851.315/p851x315/22136852_118243868927633_2950847275004458372_o.jpg?oh=7d0222f3c38b31acb33a7b1ffba2ac9e&oe=5A797385" style="top:0px;width:100%" data-fbid="118243868927633" alt="Cover Photo, Image may contain: 1 person, sitting" /><div class="coverBorder"></div><img class="coverChangeThrobber img" src="https://static.xx.fbcdn.net/rsrc.php/v3/yk/r/LOOn0JtHNzb.gif" alt="" width="16" height="16" /></a><div class="_2nlj _2xc6"><h1 class="_2nlv"><a class="_2nlw" href="https://www.facebook.com/princessaustine.alcantara.3"><span id="fb-timeline-cover-name" data-testid="profile_name_in_profile_page">Princess Austine Alcantara</span></a><span class="_2nly"></span></h1></div></div><div id="fbTimelineHeadline" class="clearfix"><div class="_50zj"><div class="actions _70j"><div class="_5h60 actionsDropdown" id="pagelet_timeline_profile_actions" data-referrer="pagelet_timeline_profile_actions"></div></div></div><div class="_70k"><ul class="_6_7 clearfix" data-referrer="timeline_light_nav_top" id="u_0_14"><li><a class="_6-6 _6-7" href="https://www.facebook.com/princessaustine.alcantara.3?lst=100002159119314%3A100022260619396%3A1507039852" data-tab-key="timeline">Timeline<span class="_513x"></span></a></li><li><a class="_6-6" href="https://www.facebook.com/princessaustine.alcantara.3/about?lst=100002159119314%3A100022260619396%3A1507039852" data-tab-key="about">About<span class="_513x"></span></a></li><li><a class="_6-6" href="https://www.facebook.com/princessaustine.alcantara.3/friends?lst=100002159119314%3A100022260619396%3A1507039852&source_ref=pb_friends_tl" data-tab-key="friends">Friends<span class="_gs6"><span id="u_0_10">7 Mutual</span></span><span class="_513x"></span></a></li><li><a class="_6-6" href="https://www.facebook.com/princessaustine.alcantara.3/photos?lst=100002159119314%3A100022260619396%3A1507039852&source_ref=pb_friends_tl" data-tab-key="photos">Photos<span class="_513x"></span></a></li><li><div class="_6a uiPopover _6-6 _9rx" id="u_0_15"><a class="_9ry _p" href="#" aria-haspopup="true" aria-expanded="false" rel="toggle" role="button" id="u_0_16">More<i class="_bxy img sp_AWfL8SqGWNa sx_41c408"></i></a></div></li></ul></div><div class="name"><div class="photoContainer"><div><a class="profilePicThumb" href="https://www.facebook.com/photo.php?fbid=116140922471261&set=a.116141002471253.1073741826.100022260619396&type=3&source=11&referrer_profile_id=100022260619396" rel="theater" id="u_0_11"><img class="profilePic img" alt="Princess Austine Alcantara's Profile Photo, Image may contain: 1 person, smiling, closeup" src="https://scontent.fmnl4-1.fna.fbcdn.net/v/t1.0-1/c0.0.160.160/p160x160/22050231_116140922471261_8103110572544919612_n.jpg?oh=d942ae339c7c9dc7c8add2e3dd34f6c4&oe=5A413CB6" /></a></div><meta content="https://scontent.fmnl4-1.fna.fbcdn.net/v/t1.0-1/p50x50/22050231_116140922471261_8103110572544919612_n.jpg?oh=e43d8f6e5cfb1387f1a5d864b7947225&oe=5A3CC115" itemprop="image" /></div></div></div></div></div>
I tried to use this regex code below but it also match other items inside. How can i match only that specific url? Thanks
The class is dynamic.
(?i)(?<=a class=".+" href=").*?(?=" data-tab-key="about)
If you want to match the href, you can use [^"]+ inside of href, this way you regex will not capture more than what you need as it will be stopped by ".
You can then create something like href="([^"]*?)" data-tab-key="about".
I'd suggest avoiding using regex to match html though.
Try..
(?i)a class=".+" href="\K.*?(?=" data-tab-key="about)
I believe you are struggling to get a variable length look behind to work, which is
(?<=a class=".+" href=")
.+ in the above is not a valid syntax as it introduces variable length in a look behind. This is not supported in any of the regex engines I know(I would be happy to know if I'm wrong here).
That said in-order to emulate a variable length look-behind one could use the \K flag which resets the starting point of the match to the current position(there by dropping all the the previously grabbed items out of the final match).
Demo regex is here.

Regex to get src of iframe element

I am trying to retrieve part of the src from different iframes from an HTML input.
So far, I've tried different methods but none of them works for all iframes. What I've tried so far:
<iframe(.*?)><\/iframe>
<iframe src="(.+?)".+</iframe>
<iframe.+?src=[\"'](.+?)[\"'].*?>
And here is a sample of iframe tags that I have:
<iframe src="http://www.youtube.com/embed/NM51qOpwcIM?modestbranding=1;rel=0;showinfo=0;autoplay=0;autohide=1;yt:stretch=16:9;wmode=transparent;?wmode=transparent" allowfullscreen="" style="width: 640px; height: 361.057px;" frameborder="0"></iframe>
<iframe src="https://www.youtube.com/embed/VASywEuqFd8?feature=oembed" allowfullscreen="" width="660" height="371" frameborder="0"></iframe>
Ideally, I would like to retrieve the src from the beginning and just before the first question mark (?) as such:
http://www.youtube.com/embed/NM51qOpwcIM
This can be achieved using
(?<=src=").*?(?=[\?"])
See working example on Regex101
Explanation
(?<=src=") Prepended by src="
.*? Lazy match any token
(?=[\?"]) Until either a ? or " would be the next token
If you might have a longer URL that doesn't end with ?
(?<=src=").*?(?=[\*"])

RegExp replace all but selected

So I'm trying to erase everything except the matched case in this 1900 line document with Notepad++ RegExp Find/Replace, so that I only have the file names, which shorten it to under about 1000 lines at minimum. I know the code that selects the text ((?<=/images/item/)(.*)(?=" a) but the problem is I don't know how to make it erase anything that doesn't match that case. Here's a portion of the document.
using notepad++, it would find and select abyssal-scepter.gif, aegis-of-the-legion.gif, etc
<img src="/images/item/abyssal-scepter.gif" alt="LoL Item: Abyssal Scepter"><br> <div id="id_77" class="tier-wrapper drag-items health magic-resist health-regen champ-box float-left ajax-tooltip {t:'Item',i:'77'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-advanced filter-bonus-aura filter-category-health filter-category-magic-resist filter-category-health-regen ui-draggable ui-draggable-handle">
<img src="/images/item/aegis-of-the-legion.gif" alt="LoL Item: Aegis of the Legion"><br> <div id="id_235" class="tier-wrapper drag-items ability-power movement champ-box float-left ajax-tooltip {t:'Item',i:'235'} filter-tier-advanced filter-bonus-unique-passive filter-category-ability-power filter-category-movement ui-draggable ui-draggable-handle">
<img src="/images/item/aether-wisp.gif" alt="LoL Item: Aether Wisp"><br>
<div class="info">
<div class="champ-name">Aether Wisp</div>
<div class="champ-sub">
<img src="/images/gold.png" alt="Item Cost" style="width:16px; vertical-align:middle;"> 850 / 415
</div>
</div>
</div>
<div id="id_21" class="tier-wrapper drag-items ability-power champ-box float-left ajax-tooltip {t:'Item',i:'21'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-basic filter-category-ability-power ui-draggable ui-draggable-handle">
<img src="/images/item/amplifying-tome.gif" alt="LoL Item: Amplifying Tome"><br>
<div class="info">
<div class="champ-name">Amplifying Tome</div>
<div class="champ-sub">
I'm not familiar with RegExp, so to summarize, I need it to look like this at the end of it.
abyssal-scepter.gif
aegis-of-thelegion.gif
aether-wisp.gif
amplifying-tome.gif
Thank you for your time
A Notepad++ solution:
Find what : .*?/images/item/(.*?)"|.*
Replace with : $1\n
Search mode : Regular expression (with ". matches newline" checked)
The result will have an extra linefeed at the end.
But that shouldn't pose a problem I suppose.
Maybe this can help. or not since you dropped the Javascript tag out of your original post
<script type="text/javascript">
var thestring = "<img src=\"/images/item/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
var thestring2 = "<img src=\"/images/otherstuff/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
function ParseIt(incomingstring) {
var pattern = /"\/images\/item\/(.*)" /;
if (pattern.test(incomingstring)) {
return pattern.exec(incomingstring)[1];
}
else {
return "";
}
//return pattern.test(incomingstring) ? pattern.exec(incomingstring)[1] : "";
}
</script>
Calling ParseIt(thestring) returns "aegis-of-the-legion.gif"
Calling ParseIt(thestring2) return ""
Since you are doing this in NP++, this works for me. In cases like this where speed and results are more important than specific technique, I'll usually run several regexes. First, I'll get each tag on its own line by doing a search for > and replacing it with >\n. This gets each tag on its own line for simpler processing. Then a replace of ^>*<.*?".*?/?([\w\d\-_]+\.\w{2,4})?".*>.*$ with $1 will will extract all the filenames from the tags, removing the unneeded text. Then, finally, to clear all the tags that didn't have a filename in them, just replace <.*> with an empty string. Finally, use Edit>Line Operations>Remove empty lines, and you'll have the result you're looking for. It's not a 100% regex solution, but this is a one time action that you just need a simple result from.

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

Matching text that is not html tags with regular expression

So I am trying to create a regular expression that matches text inside different kinds of html tags. It should match the bold text in both of these cases:
<div class="username_container">
<div class="popupmenu memberaction">
<a rel="nofollow" class="username offline " href="http://URL/surfergal.html" title="Surfergal is offline"><strong><!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end --></strong></a>
</div>
<div class="username_container">
<span class="username guest"><b><a>**Advertisement**</a></b></span>
</div>
I have tried with the following regular expression without any result:
/<div class="username_container">.*?((?<=^|>)[^><]+?(?=<|$)).*?<\/div>/is
This is my first time posting here on stackoverflow so if I am doing something incredibly stupid I can only apologize.
Using regex to parse html is.. hard. See the links in the comments to your question.
What do you plan to do with these matches? Here's a quick jquery script that logs the results in the console:
var a = [];
$('strong, b').each(function(){
a.push($(this).html());
});
console.log(a);
results:
["<!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end -->", "<a>**Advertisement**</a>"] ​
http://jsfiddle.net/Mk7xf/