imacros extraction from a range of data - imacros

Hi here is how my page looks like
<div class="Bango 1 Beamer Beamer-1"> Beamer </div>
<div class ="menu1"> menu1 </div>
<div class ="menu2"> menu2 </div>
<div class ="menu3"> menu3 </div>
<div class ="menu4"> menu4 </div>
<div class="Bango 1 Beamer Beamer-2"> Beamer2 </div>
<div class ="menu1"> menu21 </div>
<div class ="menu2"> menu22 </div>
<div class ="menu3"> menu23 </div>
<div class ="menu4"> menu24 </div>
<div class="Bango 1 Beamer Beamer-3"> Beamer3 </div>
<div class ="menu1"> menu31 </div>
<div class ="menu2"> menu32 </div>
<div class ="menu3"> menu33 </div>
<div class ="menu4"> menu34 </div>
How can I extract only elements under Beamer-1 only ? Note the number of elements under this group may also vary from time to time. Thanks

I suggest solving this issue with a number of pseudo-URLs:
' get bounds
URL GOTO=javascript:{var<SP>doc=window.document;var<SP>els=doc.getElementsByTagName("div");for(i=0;i<els.length;i++){var<SP>b=(els[i].outerHTML.match("Beamer-1"))<SP>?<SP>(i+1)<SP>:<SP>b;var<SP>e=(els[i].outerHTML.match("Beamer-2"))<SP>?<SP>i<SP>:<SP>e;}}
' set extract
URL GOTO=javascript:{var<SP>ext="";for(i=b;i<e;i++){ext+=els[i].innerHTML.trim()+((i==e-1)<SP>?<SP>""<SP>:<SP>"[EXTRACT]");}underfined;}
' create dummy element
URL GOTO=javascript:{var<SP>elt=doc.createElement("input");elt.type="hidden";elt.id="myHiddenExtract";elt.value=ext;doc.getElementsByTagName("html")[0].appendChild(elt);underfined;}
' get extract
TAG POS=1 TYPE=INPUT ATTR=ID:myHiddenExtract EXTRACT=TXT
' remove dummy element
URL GOTO=javascript:{doc.getElementsByTagName("html")[0].removeChild(doc.getElementById("myHiddenExtract"));underfined;}

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.
This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo
I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Extract information from all matching nodes without looping xpath

<ul class="products-grid">
<li class="item">
<div class="product-block">
<div class="product-block-inner">
<img src="#/producta.jpg">
<h2 class="product-name">Product A</h2>
<div class="price-box">
<span class="regular-price" id="#">
<span class="price">Rs 1,849</span>
</span>
</div>
</div>
</div>
</li>
<li class="item">
<div class="product-block">
<div class="product-block-inner">
<img src="#/productb.jpg">
<h2 class="product-name">Product B</h2>
<div class="price-box">
<span class="regular-price" id="#">
<span class="price">Rs 1,849</span>
</span>
</div>
</div>
</div>
</li>
</ul>
I am at this moment scraping the item in a loop.
products = response.xpath('//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]').extract()
After getting the product-block-inner node, I save it into products and then I will have to loop like
for product in products:
// parse the div.product-block-inner further deep down
// to get name, price, image etc
// and save it to a dict and yeild
pass
Is this possible that i get text, href for all div.product-block-inner in the final list without looping
Yes, but it's very confusing, for example you could try this:
products = response.xpath(
'//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]'
).css(
'.product-name a::attr(href), .product-name a::text, .price::text'
).extract()
but I would suggest to always loop (btw, why do you call extract() when you assign it to products?)
products = response.xpath(
'//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]'
)
for product in products:
yield {'name': product.css('.product-name a::text').extract_first()
'url': product.css('.product-name a::attr(href)').extract_first()
'price': product.css('.price::text').extract_first()}
(I've used css selectors in this case because the equivalent xpaths are longer, but the same can also be achieved using xpath)

jquery regex get several key not only one

I would like to get
PA-1400-11PA ADP-40PH ABA
Here html code
</div>
<div class="ref">
<h2 id='affiche_sous_titre'>eee :</h2> <p>
<a href='eee' title='PA-1400-11PA' class='lien_menu'>PA-1400-11PA</a> - <a href='uuu' title='ADP-40PH ABA' class='lien_menu'>ADP-40PH ABA</a> </p>
</div>
<div class="modele_tout">
</div>
<div class="star-customer">
Here my reg code
line=line.replace(/[\"\']lien_menu[\"\']>(.*?)<\/a>/ig,"$1\n")
But I have only
ADP-40PH ABA
What is the problem.I dont understand?
thanks for your help

Diazo rule to append element before closing tag

I'm trying to find a diazo rule to append a new element in a container before its closing tag. For example:
Case 1
<div class="some-A">
<div class="some-B">1</div>
<div class="some-B">2</div>
<div class="some-B">3</div>
</div>
Case 1 - after rule applied
<div class="some-A">
<div class="some-B">1</div>
<div class="some-B">2</div>
<div class="some-B">3</div>
<div class="some-B">NEW</div>
</div>
Case 2
<div class="some-A">
</div>
Case 2 - after rule applied
<div class="some-A">
<div class="some-B">NEW</div>
</div>
I need to have it working for each case - with and without content in container.
None of these are ok:
<replace css:theme=".some-A">
<div class="some-A">
<div class="some-B">NEW</div>
</div>
</replace>
because replaces all.
<before css:theme=".some-A">
<div class="some-B">NEW</div>
</before>
because appends before my container.
<after css:theme=".some-A">
<div class="some-B">NEW</div>
</after>
because appends after it.
<after css:theme-children=".some-A">
<div class="some-B">NEW</div>
</after>

Nested floats do not work in CFDOCUMENT css

The below html was provided inside a <cfdocumentitem type="header"> block.
But the output is empty.
<div class="grid">
<div class="span5">
<div class="span5">
Label1
</div>
<div class="span5">
Data1
</div>
</div>
<div class="span5">
<div class="span5">
Label2
</div>
<div class="span5">
Data2
</div>
</div>
<div style="clear:both"></div>
</div>
But when I remove the nested 'class="span5"' divs and put some content there, it is working fine. Is there any problem with nested float in cfdocument???
Unfortunately, CSS support in CFDOCUMENT is kind of hit or miss.
2 rules to follow that might help:
Make sure your HTML validates as XHTML 1.0 Transitional
Import your style sheets using
<style type="text/css" media="screen">#import "style.css";</style>
This same information can be found here: http://rip747.wordpress.com/2007/09/10/cfdocument-it-works-if-you-know-how/