HTML parsing using pugixml or an actual HTML parser

HTML parsing using pugixml or an actual HTML parser - c++

I'm interested in using pugixml to parse HTML documents, but HTML has some optional closing tags. Here is an example: <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
Pugixml stops reading the HTML as soon as it encounters a tag that's not closed, but in HTML missing a closing tag does not necessarily mean that there is a start-end tag mismatch.
A simple test of parsing the HTML documentation of pugixml fails because the meta tag is the second line of the HTML document: http://pugixml.googlecode.com/svn/tags/latest/docs/quickstart.html
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<title>pugixml 1.0</title>
<link rel="stylesheet" href="pugixml.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
<link rel="home" href="quickstart.html" title="pugixml 1.0">
</head>
<!--- etc... -->
A lot of HTML documents in the wild would fail if I try to parse them with pugixml. Is there a way to avoid that? If there is no way to "fix" that, then is there another HTML parsing tool that's as about as fast as pugixml?
Update
It would also be great if the HTML parser also supports XPATH.

I ended up taking pugixml, converting it into an HTML parser and I created a github project for it: https://github.com/rofldev/pugihtml
For now it's not fully compliant with the HTML specifications, but it does a decent enough job at parsing HTML that I can use it. I'm working on making it compliant with the HTML specifications.

One way to address this is to do some pre-processing that converts the HTML to XHTML, then it would "officially" be considered XML and usable with XML tools. If you want to go that route, see this question: How to convert HTML to XHTML?

Related

AWS Glue won't Classify my Data

I have a html file which structured like this:
<!doctype html public "-//w3c//dtd html 4.0transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="ERA">
<LINK REL=STYLESHEET TYPE="text/css" HREF="Style_Sheets/ERA_Internet_Printer.css">
</head>
<body>
<pre>
<font face="courier new" size=-4> 14V-IG-TEST-DATA - SERVC - EXEC# 4515
[11| Blubb,abcons, Port: 18 For: abcons
For period : GE 08/04/18 AND LE 11/04/18 OR GE 11/04/18 AND LE 11/05/18
01:45:40 11-04-18 - Page # 1
Serial#........................ 564561215
Make Desc...................... VW
Carline........................ MUX
Year........................... 2015
Cust# ........................ 512
License#....................... 78365HH
Open RO........................ R25625
EOR............................ EOR
Serial#........................ 2151512315
Make Desc...................... VOLKSWAGEN
Carline........................ VOLKSWAGEN
Year........................... 2017
Cust# ........................ 552
License#....................... DPA2151
Open RO........................ T52165
EOR............................ EOR
2 records listed.
</pre>
</body>
</html>
I want to get the Information out of the file like "Key.......... Value".
So I've created a custom classifier in AWS Glue with Grok to get the Info.
The classifier is configured like this:
Custom Classifier
So the Grok Pattern is configured as followed:
%{KEY:mykey}%{GREEDYDATA:myvalue}
with the custom Pattern:
KEY ([a-zA-Z# 1-9]+\.+ )
Every Grok online debugger (like https://grokdebug.herokuapp.com/) get the information out of the data structure with this configuration. But when I start the crawler in Glue with the custom classifier, it won't find any tables or structures.
What am I doing wrong?

I think you're running into the problem I answered here: https://github.com/aws-samples/aws-glue-samples/issues/4
There's a buried sentence in AWS documentation that states "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"
Simply updating the classifier and re-running the crawler will not use the updated classifier.

Parse HTML with BeautifulSoup replaces existing HTML tag

I am using BeautifulSoup v4 to parse out a string of HTML that looks like this:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head></head>
<body><p>Hello, world</p></body>
</html>
Here is how I am parsing it:
soup = BeautifulSoup(html)
Where html is the pasted HTML above. For whatever reason, BS keeps replaces the <html> tag with a standard tag without the extra meta info. Any way I can tell BS to not do this?

I was able to figure it out by passing in html5lib as the HTML parser to BS. But, now, it keeps dropping in a random HTML comment tag for the DOCTYPE
<!--<!DOCTYPE HTML-->

Ractive / Moustache Template for Head Tag

I have been searching for this all over, but can not find an answer or example.
Can a Ractive template be used to construct head elements that are consistant across pages, and can that be loaded from a separate file?
For example: all html, head, and title tag info is loaded via a referencable template from an external file into an index page.
+html+
+head+
+title+
+/title+
+/head+
And if so, how do you do it? As I try to wrap my head around it, jquery and ractive.js would need to load. Is there a different/better solution?

It is possible. But it's not practical and it raises other issues.
Here's a basic implementation that shows how <head> can be templated but without concentrating on putting the template in an external file. This works for me in Chrome and IE.
<html>
<head id="output"></head>
<script id="template" type="text/html">
<title>{{ title }}</title>
</script>
<script type="text/javascript" src="ractive.min.js"></script>
<script type="text/javascript">
var ractive = new Ractive({
template: "#template",
el: "#output",
data: {
title: "This is the title"
}
});
</script>
<body>
...
</body>
</html>
You'll run into problems with this approach because the head elements won't be loaded until after the page has loaded and Ractive kicks in. This may cause the following problems:
Search engines might not be able to read the page title and meta tags
Any javascript you need to load into <head> may not work (I tried some simple examples and was able to get the javascript to run but it failed to reference any elements in the body. Maybe it's a context issue and maybe Ractive has support to overcome this but this is an area I'm unfamiliar with.)
If you require valid HTML, this probably won't work for you because script tags can't be direct children of <html>, and <head> is supposed to have <title> as a direct child.
You're better off using a server-side solution to template <head>.

how to test existence of context variable using gulp-file-include

I am using gulp-file-include to build my html pages using some partials & templates. By using context variables, I can customize each meta headers. However, don't know how I could include a line only if a context variable exists, as the "##else" statement doesn't seem to exist.
My parent HTML looks like:
##include ('_header.html', {
"title":"my page",
"description": "description",
"canonical":"http://www.sourcefromquote.com" })
<body>
A wonderful Page
##include ('_footer.html")
</body></html>
I was thinking to use a _header.html close to something like that :
<html>
<head>
<title>##title</title>
<meta name="description" content="##description">
##if (canonical) { <link rel="canonical" href="##canonical" /> }
</head>
If the "canonical" variable is not set in the the parent HTML, it throws an error (canonical is not defined).
I guess I could include the full tag in a variable and forget about the ##if, but that would not be as clean as expected !
Any ideas ?
Thank you in advance.

In the head, you enter:
##if (context.canonical) {<link rel="canonical" href="##canonical" />}
In the file that includes the header you enter:
##include('_head.html', {
"canonical" : "https://www.website.com/canonical-link.html"
})

Go language strange behavior by handling templates

gotemplates
Hello!
I'm learning Go language now and trying to port some simple WEB code (Laravel 4).
Everything was well, until I tried to reproduce Blade templates into text templates.
I found that Go can load my CSS and JavaScript files only from the catalog with a name "bootstrap" only.
Here is my catalog tree which I tried to use:
start-catalog
bootstrap (link to bootstrap-3.3.1)
bootstrap-3.3.1
css
bootstrap.min.css
js
bootstrap.min.js
jquery
jquery (link to jquery-2.1.1.min.js)
jsquery-2.1.1.min.js
go_prg.go
Here are my templates:
base_js.tmpl
{{define "base_js"}}
{{template "login_1"}}
<script src = "/bootstrap/js/jquery"></script>
<script src = "/bootstrap/js/bootstrap.min.js"></script>
{{end}}
base_header.tmpl
{{define "base_header"}}
<head>
<title>PAGE TITLE</title>
<meta name = "viewport" content = "width=device-width, initial-scale=1.0">
<meta charset="utf-8">
<link href = "/bootstrap/css/bootstrap.min.css" rel = "stylesheet">
</head>
{{end}}
If the catalog name differs from "bootstrap" Go language or Firefox can't load files from the templates above: bootstrap.min.css, bootstrap.min.js, jquery.
If I use not the link but the catalog name directly "bootstrap-3.3.1" than Go or Firefox can't load.
If all required files are moved under "bootstrap" I'm getting the results I expected (exactly the same as in Laravel 4).
To launch go language code the command go run go_prg.go was used.
Environment: Ubuntu 14.04, go-1.3.3, Firefox 31.
Who's wrong: Go language, Firefox or me?
Any help will be highly appreciated!

The problem described was caused by
http.Handle("/bootstrap/", http.StripPrefix("/bootstrap/", http.FileServer(http.Dir("bootstrap"))))
before any template was handled. It allowed access files under the directory 'bootstrap' only.
The problem was fixed by changing to
http.Handle( , http.StripPrefix(, http.FileServer(http.Dir("."))))
and adding to pathes for CSS and JavaScript files. Like so
/bootstrap/js/jquery">.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

HTML parsing using pugixml or an actual HTML parser - c++

One way to address this is to do some pre-processing that converts the HTML to XHTML, then it would "officially" be considered XML and usable with XML tools. If you want to go that route, see this question: How to convert HTML to XHTML?

Related

AWS Glue won't Classify my Data

Parse HTML with BeautifulSoup replaces existing HTML tag

Ractive / Moustache Template for Head Tag

how to test existence of context variable using gulp-file-include

Go language strange behavior by handling templates

Categories

Resources