lxml - unable to replace children of children - python-2.7

I'm using lxml as a solution for XML parsing in my application.
I understand that lxml can only replace the immediate child of a parent, but no levels under that child using .replace
Example XML:
<root>
<dc>
<batman alias='dark_knight' />
</dc>
</root>
I have a modified tag in a string like so
<batman alias='not_dark_knight' />
I need some help with replacing the original XML using xpath '/root/dc/batman'.
from lxml import etree
original_xml = "<root><dc><batman alias='dark_knight' /></dc></root>"
modified_tag = "<batman alias='not_dark_knight' />"
x_path = '/root/dc/batman'
original_obj = etree.fromstring(original_xml)
modified_obj = etree.fromstring(modified_tag)
original_obj.replace(original_obj.xpath(x_path)[0], modified_obj)
This throws a ValueError: Element is not a child of this node.
Is there a way i can replace the string nicely? (only using lxml)
Please understand that I would like a solution using lxml library only.

As you already know, you should be calling replace() on the parent of the element you want to replace. You may use .getparent() to dynamically get to the parent:
batman = original_obj.xpath(x_path)[0]
batman.getparent().replace(batman, modified_obj)

Related

Python xpath returns an empty list

I need to extract some of the href attributes under the "ARTICLES" section on this page.
I am using the following code
from lxml import html
import requests
page = requests.get('http://www.dlib.org/dlib/november14/11contents.html')
tree = html.fromstring(page.content)
result = tree.xpath('/html/body/form/table[3]/tbody/tr/td/table[5]/tbody/tr/td/table/tbody/tr/td[2]/p[6]/#href')
print result
I know for sure that the XPath is correct but when I run the script it prints
[]
I've tried with some others elements on the page and it works as expected.
Any idea?

How to properly use xpath & regexp extractor in jmeter?

I have the following text in the HTML response:
<input type="hidden" name="test" value="testValue">
I need to extract the value from the above input tag.
I've tried both regexp and xpath extractor, but neither is working for me:
regexp pattern
input\s*type="hidden"\s*name="test"\s*value="(.+)"\s*>
xpath query
//input[#name="test"]/#value
The above xpath gives an error at the Xpath Assertion Listener .. "No node matched".
I tried a lot and concluded that the xpath works only if I use it as //input[#name].
At the moment I'm trying to add an actual name it gives the error .. "No node matched".
Could anyone please suggest me how to resolve the above issue?
Please take a look at my previous answer :
https://stackoverflow.com/a/11452267/169277
The relevant part for you would be step 3:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Element;
String html = prev.getResponseDataAsString(); // get response from your sampler
Document doc = Jsoup.parse(html);
Element inputElement = doc.select("input[name=test]").first();
String inputValue = inputElement.attr("value");
vars.put("inputTextValue", inputValue);
Update
So you don't get tangled with the code I've created jMeter post processor called Html Extractor here is the github url :
https://github.com/c0mrade/Html-Extractor
Since you are using XPath Extractor to parse HTML (not XML) response ensure that Use Tidy (tolerant parser) option is CHECKED (in XPath Extractor's control panel).
Your xpath query looks fine, check the option mentioned above and try again.

Scraperwiki scrape query: using lxml to extract links

I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build.
https://scraperwiki.com/scrapers/thisisscraper/
I'm working line-by-line through the tutorial 3 and have got so far with trying to extract the next page link. I can use cssselect to identify the link, but I can't work out how to isolate just the href attribute rather than the whole anchor tag.
Can anyone help?
def scrape_and_look_for_next_link(url):
html = scraperwiki.scrape(url)
print html
root = lxml.html.fromstring(html) #turn the HTML into lxml object
scrape_page(root)
next_link = root.cssselect('ol.pagination li a')[-1]
attribute = lxml.html.tostring(next_link)
attribute = lxml.html.fromstring(attribute)
#works up until this point
attribute = attribute.xpath('/#href')
attribute = lxml.etree.tostring(attribute)
print attribute
CSS selectors can select elements that have an href attribute with eg. a[href] but they can not extract the attribute value by themselves.
Once you have the element from cssselect, you can use next_link.get('href') to get the value of the attribute.
link = link.attrib['href']
should work

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}

Umbraco copy-of not showing xml markup

I’m using Umbraco 4.7.0
My goal is to get the image path from a hard coded media node id of 4191. If I create a new macro with the code:
<xsl:copy-of select="umbraco.library:GetMedia(4191, false())"/>
I get the output:
/media/17675/my image.jpg50033618497jpg
I was expecting some well formed xml, however, it appears I’m missing all the tags. I therefore cannot reference the path for the image directly.
Am I missing something really simple here?
EDIT
I discovered how to get the raw xml output from my copy-of statement. I needed to wrap it in a <textarea> tag:
<textarea>
<xsl:copy-of select="umbraco.library:GetMedia(4191, false())"/>
</textarea>
This should do it:
<xsl:copy-of select="umbraco.library:GetMedia(4191, 0)/umbracoFile"/>
See also http://our.umbraco.org/wiki/reference/umbracolibrary/getmedia