How can I ignore libxml HTML Parsing error

How can I ignore libxml HTML Parsing error - c++

I'm using libxml to parse and build a tree from html:
htmlDoc = htmlParseDoc((xmlChar*)s.c_str(), "windows-1252");
s contains the HTML in a string. I am using curl to retrieve the HTML and storing it in s. The functionality is working exactly as I want however I just want to remove the htmlParsing errors from being output to the console:
HTML parser error : htmlParseStartTag: misplaced <body> tag
<html><head><title>UPC-E Home Page</title></head><body topmargin=8 leftmargin=8
I get these errors because every once in a while curl will timeout and won't be able to retrieve the entirety of the HTML. This is fine, I just want to ignore it and try again. However, I don't want the user to see these errors. How can I disable these from being output to the user?

Related

How to use filepond with Django

As the title suggest.
I have searched Google and stackoverflow, so far I don't find any tutorial that doesn't involve (https://github.com/ImperialCollegeLondon/django-drf-filepond).
While this library seems maintain, at 68 stars, too much risk and I prefer to do without it.
What I tried
When you use filepond input tag with class file-uploader file-uploader-grid, in browser, it will compile and generate a div tag.
The issue is that the id in input will be generated under the div instead of input tag.
Without that id, when the form is submitted, self.request.FILES will be empty dictionary.
So I tried writing a JavaScript to add id to input tag, which don't work unfortunately.
Anyone successfully do it in Django without additional library? Thanks

The input generated is only there to catch files, the actual data is either stored in hidden input fields (if you use server property) or encoded in those fields (if you use file encode plugin).
You can set storeAsFile to true to have FilePond update the fileList property of a file field. But that doesn't work on older versions of iOS, see link in property description:
https://pqina.nl/filepond/docs/api/instance/properties/

scraping the text from source code using python

I'm trying to scrape google search results using python and selenium. I'm able to get only the first search result. Here is the code I'm using.
driver.get(url)
res = driver.find_elements_by_css_selector('div.g')
link = res[0].find_element_by_tag_name("a")
href = link.get_attribute("href")
How can I get all the search results?

Try to get list of links (from first page only. If you need to scrape more pages, you need to click "Next" button in a loop and append results from following pages) as below:
href = [link.get_attribute("href") for link in driver.find_elements_by_css_selector('div.g a')]
P.S. You also might use solutions from this question to get results as GET request response with requests lib

scrapy: xpath not returning the full url for #href

performing a scrape using xpath with scrapy i dont get the full URL
here is the url i am looking at
using scrapy shell
scrapy shell "http://www.ybracing.com/omp-ia01854-omp-first-evo-race-suit.html"
i perform the following xpath select from the shell
sel.xpath("//*[#id='Thumbnail-Image-Container']/li[1]/a//#href")
and get only half the href
[<Selector xpath="//*[#id='Thumbnail-Image-Container']/li[1]/a//#href" data=u'http://images.esellerpro.com/2489/I/160/'>]
here's the snippet of html i am looking at in a browser
<li><a data-medimg="http://images.esellerpro.com/2489/I/160/260/1/medIA01854-GALLERY.jpg" href="http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg" class="cloud-zoom-gallery Selected" title="OMP FIRST EVO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg'"><img src="http://images.esellerpro.com/2489/I/160/260/1/smIA01854-GALLERY.jpg" alt="OMP FIRST EVO RACE SUIT Thumbnail 1"></a></li>
and here it is from wget
<li><a data-medimg="http://images.esellerpro.com/2489/I/513/0/medIA01838_GALLERY.JPG" href="http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG" class="cloud-zoom-gallery Selected" title="OMP DYNAMO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG'"><img src="http://images.esellerpro.com/2489/I/513/0/smIA01838_GALLERY.JPG" alt="OMP DYNAMO RACE SUIT Thumbnail 1" /></a></li>
i have tried varying my xpath to pull the same but still get the same result
what is causing this and what can i do to work around it would like to understand rather than someone just correct my xpath for me
some thoughts on the page itself i disabled javascript to see if the js was generating half the url but its not. I also downloaded the page with wget to confirm the urls are complete in the orriginal html
i havent tested any other builds but i'm using scrapy 1.2.1 on with 2.7 in centos 7
I've googled and only find people who cant grab the data due to javascript generating the data on the fly but my data is there in the html

By using
sel.xpath("//*[#id='Thumbnail-Image-Container']/li[1]/a//#href")
you get a list of Selector instances, in which the data field shows only the first few bytes of all its content (since it might be very long).
To retrieve the content as a string (instead of a Selector instance), you would need to use something like .extract or .extract_first:
>>> print(sel.xpath("//*[#id='Thumbnail-Image-Container']/li[1]/a//#href").extract_first())
http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg

Test XSS vulnerability of a website which is using smartgwt

I am testing whether I can inject script code into a website which is using smartgwt and then queried out the input string to run the script.
I first input the following string into a text field on a webpage and submitted:
"<script>alert(1)</script>"(without double quotes),
then I queried out the input string which is loaded through a smartgwt table component.
With a HTML debug tool, I can see that the input string was placed inside a <nobr> tag inside a <td> tag, the HTML characters inside the input string wasn't encoded, but the alert(1) code doesn't execute and no popup was shown, does smartgwt handle the XSS automatically, or there is other reason that the script isn't executed?

Trouble parsing remote RSS feed using ColdFusion

I'm having a vexing time displaying a remote RSS feed on an intranet site. I'm using the MM_ XSLTransform.cfc version 0.6.2 to pull in the feed and a basic xsl to output. The feed url is www.fedsources.com/FedsourcesNet/RssFeeds/RSS_MarketFlash.aspx. If you open it in a browser, you'll see it appears to be an ordinary RSS feed. But when I try to display it in CF, I get the following" MM_ XSLTransform error.
www.fedsources.com/FedsourcesNet/RssFeeds/RSS_ MarketFlash.aspx is not a valid XML document.
Parsing www.fedsources.com/FedsourcesNet/RssFeeds/RSS_ MarketFlash.aspx
An error occured while Parsing an XML document.
Content is not allowed in prolog." (the actual error included http:// in the urls. Then the feed is dumped as part of the error message.
What's especially frustrating is if I view the source of the RSS and copy and paste it into a text file, then parse that text file, it displays fine.
Running CF version 7.
I tried changing the charset from UTF-8 to windows-1252, but that added some weird characters at the beginning and didn't help. I also tried stripping out everything between <channel> and <item> but that didn't help.
I've successfully parsed other RSS feeds outside our firewall using the same code. Is there something about the aspx extension that's causing the error? Any thoughts? Anyone?
Thanks.

What's the exact code that you're using to parse the XML document? This particular error normally happens if you have some data before the <?xml?> tag in the document, even a single space can cause a problem.
I'm not familiar with the particular CFC you mentioned, so I can't troubleshoot that one for you, but make sure that you use the Trim function around any XML content you're going to try to parse.
UPDATE: A quick Google search led me to this post from Ben Nadel: http://www.bennadel.com/blog/1206-Content-Is-Not-Allowed-In-Prolog-ColdFusion-XML-And-The-Byte-Order-Mark-BOM-.htm
You need to remove the Byte-Order-Mark from the feed. This code works without an error:
<cfhttp method="get" url="http://www.fedsources.com/FedsourcesNet/RssFeeds/RSS_MarketFlash.aspx" />
<cfset xmlResult = XmlParse(REReplace( cfhttp.FileContent, "^[^<]*", "", "all" )) />
<cfdump var="#XMLParse(xmlResult)#" />

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I ignore libxml HTML Parsing error - c++

Related

How to use filepond with Django

scraping the text from source code using python

scrapy: xpath not returning the full url for #href

Test XSS vulnerability of a website which is using smartgwt

Trouble parsing remote RSS feed using ColdFusion

Categories

Resources