Parsing HTML in Swift other than using regex - regex

Below is the HTML code that I want to parse through in Swift:
<td class="pinyin">
<a href="rsc/audio/voice_pinyin_pz/yi1.mp3">
<span class="mpt1">yī</span></a>
<a href="rsc/audio/voice_pinyin_pz/yan3.mp3">
<span class="mpt3">yǎn</span>
</a>
</td>
I have read that Regex is not a good way to parse through HTML but nevertheless I have written an expression that capture what I want (which are the letters between the span): yī and yǎn
Regex expression:
/pinyin.+<span.+>(.+)<\/.+<span.+>(.+)<\//Us
I was wondering how to implement it in so that I can capture both yī and yǎn at the same time and save it into an array. Also, I was wondering if there is another way that I would be able to do this without Regex.
EDIT:
I ended up using TFHpple as suggested by Rob. Although I did take a long time to figure out how to import it into Swift so I thought it would be helpful to post it here for convenience:
1. Open your project and drag the TFHpple files into it
2. At this point XCode will probably prompt you to create a bridging-header class file if you haven't included any Obj-C code in your current project. In this bridging-header file you should add:
#import <Foundation/Foundation.h>
#import "TFHpple.h"
#import "TFHppleElement.h"
3. Select the target, under General, in Linked Frameworks and Libraries (just scroll down when you are in the General tab and you will see it, add libxml2.2.dylib and libxml2.dylib
4. Under Build Settings, in Header Search Paths, add $(SDKROOT)/usr/include/libxml2
WARNING: be sure that it isn't User Header Search Paths as this is not the same
5. Under Build Settings, in Other Linker Flags, add -lxml2
Enjoy!

You can use the typical iOS HTML parser, TFHpple:
let data = NSData(contentsOfFile: path)
let doc = TFHpple(HTMLData: data)
if let elements = doc.searchWithXPathQuery("//td[#class='pinyin']/a/span") as? [TFHppleElement] {
for element in elements {
println(element.content)
}
}
Or you can use NDHpple:
let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//td/a/span") {
for element in elements {
println(element.children?.first?.content)
}
}
I have more miles with TFHpple, so I'm personally more comfortable with that. NDHpple seems like it theoretically could be an alternative, though I'm not as crazy about it personally (e.g. why does HTMLData parameter take string and not NSData? why do I have to navigate through children to get contents of //td/a/span results? the [#class='pinyin'] qualifier doesn't appear to work, etc.). But, try both and see which you prefer.
Both require bridging header: TFHpple requires TFHpple.h in the bridging header, NDHpple requires the libxml headers there. See the documentation for each for more information.

As you've said, you shouldn't use regex to parse HTML, it will go wrong (obligatory link). Just wrap yī within another <span> and you'll see why.
Instead, you should use a full-blown HTML parser. Make sure to check out How to Parse HTML on iOS for a detailed tutorial.

Related

JSDuck Guides: How to Generate Subsection Navigation?

I am using the Guides feature of JSDuck where I specify README.md files in the guides.json file.
The problem I have is that I can't specify an anchor in a README.md file in guides.json.
For example, my README.md file has an H1 Head, then 5 H2 heads. Assume above each H2 header, I put an anchor--e.g, <a name="h2_1">, <a name="h2_2">, <a name="h2_3">, etc.
I want the title: (param) I enter in the guides.json to appear in the JSDuck-generated navigation on the left of the page. So, assume i entered these parameters:
{
name:foo-section-h2_1,
title: This is Header2
}
The problem is that the -section tail - which is a valid link reference in JSDuck causes the parser to fail to render the README.md into the target README.js in a directory "foo".
Anyone have any suggestions - it is a huge hinderance if one can't express subsections in the main navigation of a guide.
Unfortunately this is not supported by JSDuck. It expects each entry name in the guides.json to reference a directory name, to which it appends "/README.md" and expects to find the guide file in there. There is no workaround that I know of, and I'm the author of this whole thing.
The whole guides feature is full of various quirks and unresolved corner-cases like this. It's largely a bolt-on feature to JSDuck. My main suggestion is, that when you want to do anything more than the most basic guides, you should look for an alternative solution.
Or you could try patching this. But I took a look around the code, and it's not a simple fix to make.

MVC - Strip unwanted text from rss feed

Ive got the following code in my RSS consumer (Vandelay Industries RemoteRSS) in my Orchard CMS implementation:
#using System.Xml.Linq
#{
var feed = Model.Feed as XElement;
}
<ul>
#foreach(var item in feed
.Element("channel")
.Elements("item")
.Take((int)Model.ItemsToDisplay))
{
<li>#T(item.Element("description").Value)</li>
}
</ul>
The rss feed Im using is from Pinterest, and this bundles the image, link, and a short description all inside the 'description' elements of the feed.
<description><a href="/pin/215609900882251703/"><img src="http://media-cache-ec2.pinterest.com/upload/88664686384961121_UIyVRN8A_b.jpg"></a>How to install Orchard CMS on IIS Server</description>
My issue is that I don't want the text bits, and I also need to prefix the 'href=' links with 'http://www.pinterest.com'.
I've managed to edit the original code with my newbie skills to the above,, which essentially displays the images as links which are only relative and thus pointing locally to my server. These images are also then followed by the short description.
So to summarise, I need a way to prefix all links with 'http://pinterest.com' and then to remove the fee text after the image/links.
Any pointers will be greatly appreciated, Thanks.
You should probably parse the description, with something like http://htmlagilitypack.codeplex.com/, and then tweak it to add the prefix. Or you can learn regular expression and do without a library. Could be a little trickier and error-prone however.

Whitelist tags exempt from escaping using Go's html/template

Pass a []byte into a template as the body of a message post on a forum-style web app. In the template, call a method to convert to string and along the way, switch out all newlines for line breaks:
<p>{{.BodyString}}</p>
...
func (p *Post) BodyString() string {
nl := regexp.MustCompile(`\n`)
return nl.ReplaceAllString(string(p.Body), `<br>`)
}
What you'll end up with:
paragraphs <br> <br>in <br> <br>this <br> <br>post
I don't want to pass the entire post in with HTML(p.Body), as it represents third party data from potentially untrustworthy sources. Is there a way to whitelist only some tags for formatting purposes using the vanilla Go1 template package?
I do think you want to parse the HTML. The HTML parser in exp/html was deemed incomplete and so removed from Go 1, although the exp tree is still in the Go source tree and can be accessed by weekly tag, for example. I don't know exactly what is incomplete. I used it for a simple task once and it met my needs.
Also of course, check the dashboard and see related SO post, Any smart method to get exp/html back after Go1?, mostly for the recomendation of http://code.google.com/p/go-html-transform/
I'm affraid the template package cannot help with this too much. If you want to remove specific (black-listed) tags (resp. the sub-tree enclosed by such tags) or allow to pass only specific tags (white-listed) then I think probably nothing less than parsing and rewriting the html AST can be a good solution. That said, one can see here and there some crazy REs trying to do the same, but I don't consider that a "good solution" and I doubt they can be a "correct" solution in the general case of a specs conforming HTML, including several legal irregularities, as it is probably ruled out of a regular grammar category problem.

Customizing Containable Content in Orchard CMS

I am currently trying to understand a bit more about how Orchard handles Lists of Custom Content Types and I have run into a bit of an issue.
I created a Content Type named Story, which has the following parts:
Body
Common
Containable
Route
I created a list that holds these items, and all I am attempting to do is style them in such a way:
Story Title
Story Description (Basically a truncated version of the body?)
However, I cannot seem to figure out how to do the following:
Get the Title to actually appear (Currently all that appears is the body and a more link)
Remove the "more" link (and change this to be the actual Title)
I have looked into changing the Placement.info, and have looked all over in an attempt to find where the "more" link is added in each of the items. Any help would be greatly appreciated.
I finally managed to figure it out - Thanks to the Designer Tools Module, which made it very simple to go look into what was going on behind the scenes during Page Generation.
Basically - all that was necessary to accomplish this was to make some minor changes to the Parts.Common.Body.Summary.cshtml file. (found via ../Core/Common/Views/)
Which initially resembles the following:
#{
[~.ContentItem] contentItem = Model.ContentPart.ContentItem;
string bodyHtml = Model.Html.ToString();
var body = new HtmlString(Html.Excerpt(bodyHtml, 200).ToString()
.Replace(Environment.NewLine,"</p>"+Environment.NewLine+"<p>"));
}
<p>#body #Html.ItemDisplayLink(T("more").ToString(), contentItem)</p>
however by making a few changes (by using the Designer Tools) I change it into the following:
#{
[~.ContentItem] contentItem = Model.ContentPart.ContentItem;
string bodyHtml = Model.Html.ToString();
string title = Model.ContentPart.ContentItem.RoutePart.Title;
string summary = Html.Excerpt(bodyHtml, 100) + "...";
}
<div class='story'>
<p>
#Html.ItemDisplayLink(title, contentItem)
</p>
<summary>
#summary
</summary>
</div>
Although it could easily be shortened a bit - It does make the styling quite a big easier to handle. Anyways - I hope this helps :)
Alternately you can use the placement.info file in your theme assign different fields to your Summary and Detail views. It's much simplier.
http://orchardproject.net/docs/Understanding-placement-info.ashx
But, I used the same method you did till I discovered the .info file as well. It works and gives you a good understanding of how the system works, but the placement.info file seems easier.
Also, you probably don't want to be editing the view files in Core. I think your meant to override views in your theme directory.

How do I encode html leaving out the safe html

My data coming from the database might contain some html. If I use
string dataFromDb = "Some text<br />some more <br><ul><li>item 1</li></ul>";
HttpContext.Current.Server.HtmlEncode(dateFromDb);
Then everything gets encoded and I see the safe Html on the screen.
However, I want to be able to execute the safe html as noted in the dataFromDb above.
I think I am trying to create white list to check against.
How do I go about doing this?
Is there some Regex already out there that can do this?
Check out this article the AntiXSS library is also worth a look
You should use the Microsoft AntiXSS library. I believe the latest version is available here. Specifically, you'll want to use the GetSafeHtmlFragment method.