imacros extraction from a range of data

imacros extraction from a range of data - imacros

Hi here is how my page looks like
<div class="Bango 1 Beamer Beamer-1"> Beamer </div>
<div class ="menu1"> menu1 </div>
<div class ="menu2"> menu2 </div>
<div class ="menu3"> menu3 </div>
<div class ="menu4"> menu4 </div>
<div class="Bango 1 Beamer Beamer-2"> Beamer2 </div>
<div class ="menu1"> menu21 </div>
<div class ="menu2"> menu22 </div>
<div class ="menu3"> menu23 </div>
<div class ="menu4"> menu24 </div>
<div class="Bango 1 Beamer Beamer-3"> Beamer3 </div>
<div class ="menu1"> menu31 </div>
<div class ="menu2"> menu32 </div>
<div class ="menu3"> menu33 </div>
<div class ="menu4"> menu34 </div>
How can I extract only elements under Beamer-1 only ? Note the number of elements under this group may also vary from time to time. Thanks

I suggest solving this issue with a number of pseudo-URLs:
' get bounds
URL GOTO=javascript:{var<SP>doc=window.document;var<SP>els=doc.getElementsByTagName("div");for(i=0;i<els.length;i++){var<SP>b=(els[i].outerHTML.match("Beamer-1"))<SP>?<SP>(i+1)<SP>:<SP>b;var<SP>e=(els[i].outerHTML.match("Beamer-2"))<SP>?<SP>i<SP>:<SP>e;}}
' set extract
URL GOTO=javascript:{var<SP>ext="";for(i=b;i<e;i++){ext+=els[i].innerHTML.trim()+((i==e-1)<SP>?<SP>""<SP>:<SP>"[EXTRACT]");}underfined;}
' create dummy element
URL GOTO=javascript:{var<SP>elt=doc.createElement("input");elt.type="hidden";elt.id="myHiddenExtract";elt.value=ext;doc.getElementsByTagName("html")[0].appendChild(elt);underfined;}
' get extract
TAG POS=1 TYPE=INPUT ATTR=ID:myHiddenExtract EXTRACT=TXT
' remove dummy element
URL GOTO=javascript:{doc.getElementsByTagName("html")[0].removeChild(doc.getElementById("myHiddenExtract"));underfined;}

Related

Why doesnt this regexp work for this html?

<div class="_1zGQT _2ugFP message-in">
<div class="-N6Gq">
<div class="copyable-text" data-pre-plain-text="[18:09, 3.6.2019] Лера сестра: ">
<div class="_12pGw">
<div class="_3X58t selectable-text invisible-space copyable-text">
<span class="_2ZDCk">
<img crossorigin="anonymous" src="URL" alt="😆" draggable="false" class="_298rb _2FANH selectable-text invisible-space copyable-text" data-plain-text="😆" style="visibility: visible;">
</span>
</div>
</div>
</div>
</div>
</div>
Ive try to get with this code:
soup.find('div', class_=re.compile('^selectable-text invisible-space copyable-text'))
All i got: None.
The problem is that part of the class (_3X58t ) is changing.

This would be likely due to using ^ anchor, which we could modify to:
soup.find('div', class_=re.compile('selectable-text invisible-space copyable-text'))
or we might try this expression for the divs:
(.+?selectable-text invisible-space copyable-text)
Demo

I would first see if a single class, from the compound class list, could be used e.g.
soup.select_one('.selectable-text')
Else combine classes
soup.select_one('[class$="selectable-text invisible-space copyable-text"]')
Rather than resorting to regex.

Extract information from all matching nodes without looping xpath

<ul class="products-grid">
<li class="item">
<div class="product-block">
<div class="product-block-inner">
<img src="#/producta.jpg">
<h2 class="product-name">Product A</h2>
<div class="price-box">
<span class="regular-price" id="#">
<span class="price">Rs 1,849</span>
</span>
</div>
</div>
</div>
</li>
<li class="item">
<div class="product-block">
<div class="product-block-inner">
<img src="#/productb.jpg">
<h2 class="product-name">Product B</h2>
<div class="price-box">
<span class="regular-price" id="#">
<span class="price">Rs 1,849</span>
</span>
</div>
</div>
</div>
</li>
</ul>
I am at this moment scraping the item in a loop.
products = response.xpath('//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]').extract()
After getting the product-block-inner node, I save it into products and then I will have to loop like
for product in products:
// parse the div.product-block-inner further deep down
// to get name, price, image etc
// and save it to a dict and yeild
pass
Is this possible that i get text, href for all div.product-block-inner in the final list without looping

Yes, but it's very confusing, for example you could try this:
products = response.xpath(
'//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]'
).css(
'.product-name a::attr(href), .product-name a::text, .price::text'
).extract()
but I would suggest to always loop (btw, why do you call extract() when you assign it to products?)
products = response.xpath(
'//ul[#class="products-grid"]//li//div[#class="product-block"]//div[#class="product-block-inner"]'
)
for product in products:
yield {'name': product.css('.product-name a::text').extract_first()
'url': product.css('.product-name a::attr(href)').extract_first()
'price': product.css('.price::text').extract_first()}
(I've used css selectors in this case because the equivalent xpaths are longer, but the same can also be achieved using xpath)

jquery regex get several key not only one

I would like to get
PA-1400-11PA ADP-40PH ABA
Here html code
</div>
<div class="ref">
<h2 id='affiche_sous_titre'>eee :</h2> <p>
<a href='eee' title='PA-1400-11PA' class='lien_menu'>PA-1400-11PA</a> - <a href='uuu' title='ADP-40PH ABA' class='lien_menu'>ADP-40PH ABA</a> </p>
</div>
<div class="modele_tout">
</div>
<div class="star-customer">
Here my reg code
line=line.replace(/[\"\']lien_menu[\"\']>(.*?)<\/a>/ig,"$1\n")
But I have only
ADP-40PH ABA
What is the problem.I dont understand?
thanks for your help

Diazo rule to append element before closing tag

I'm trying to find a diazo rule to append a new element in a container before its closing tag. For example:
Case 1
<div class="some-A">
<div class="some-B">1</div>
<div class="some-B">2</div>
<div class="some-B">3</div>
</div>
Case 1 - after rule applied
<div class="some-A">
<div class="some-B">1</div>
<div class="some-B">2</div>
<div class="some-B">3</div>
<div class="some-B">NEW</div>
</div>
Case 2
<div class="some-A">
</div>
Case 2 - after rule applied
<div class="some-A">
<div class="some-B">NEW</div>
</div>
I need to have it working for each case - with and without content in container.
None of these are ok:
<replace css:theme=".some-A">
<div class="some-A">
<div class="some-B">NEW</div>
</div>
</replace>
because replaces all.
<before css:theme=".some-A">
<div class="some-B">NEW</div>
</before>
because appends before my container.
<after css:theme=".some-A">
<div class="some-B">NEW</div>
</after>
because appends after it.

<after css:theme-children=".some-A">
<div class="some-B">NEW</div>
</after>

Nested floats do not work in CFDOCUMENT css

The below html was provided inside a <cfdocumentitem type="header"> block.
But the output is empty.
<div class="grid">
<div class="span5">
<div class="span5">
Label1
</div>
<div class="span5">
Data1
</div>
</div>
<div class="span5">
<div class="span5">
Label2
</div>
<div class="span5">
Data2
</div>
</div>
<div style="clear:both"></div>
</div>
But when I remove the nested 'class="span5"' divs and put some content there, it is working fine. Is there any problem with nested float in cfdocument???

Unfortunately, CSS support in CFDOCUMENT is kind of hit or miss.
2 rules to follow that might help:
Make sure your HTML validates as XHTML 1.0 Transitional
Import your style sheets using
<style type="text/css" media="screen">#import "style.css";</style>
This same information can be found here: http://rip747.wordpress.com/2007/09/10/cfdocument-it-works-if-you-know-how/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

imacros extraction from a range of data - imacros

Related

Why doesnt this regexp work for this html?

Extract information from all matching nodes without looping xpath

jquery regex get several key not only one

Diazo rule to append element before closing tag

Nested floats do not work in CFDOCUMENT css

Categories

Resources