XHTML5 and HTML4 character entities - html-entities

Does XHTML5 support character entities such as and —. At work we can require specific software to access the admin side of the site, and people are demanding multi-file-upload. For me this is an easy justification to require migrating to FF 3.6+, so I'll be doing it soonish. We currently use XHTML 1.1, and upon moving to HTML5, I'm only having issues with character entity names... Does anyone have a doc on this?
I see there is a list on the WHATWG spec but I'm not sure if it affects files served as application/xhtml+xml. By any means the two mentioned trigger errors in both Chromium nightly and FF 3.6.

There is no DTD for XHTML5, so an XML parser will see no entity definitions (other than the predefined ones). If you wanted to use an entity you would have to define it for yourself in the internal subset.
<!DOCTYPE html [
<!ENTITY mdash "—">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
... — ...
</html>
(Of course using the internal subset is likely to trip browsers up if you serve it to them as text/html. Sending an internal subset in a non-XHTML HTML5 document is disallowed.)
The HTML5 wiki currently recommends:
Do not use entity references in XHTML (except for the 5 predefined entities: &, <, >, " and &apos;)
And I agree with this advice not just for XHTML5 but for XML and HTML in general. There's little reason to be using the HTML entities for anything today. Unicode characters typed directly are far more readable for everyone, and &#...; character references are available for those sad cases when you can't guarantee a 8-bit/encoding-clean transport. (Since HTML entities are not defined for the majority of Unicode characters, you are going to need those anyway.)

I needed an XML validation of potentially HTML 5. HTML 4 and XHTML only had a mediocre 250 or so entities, while the current draft (January 2012) has more than 2000.
GET 'http://www.w3.org/TR/html5-author/named-character-references.html' |
xmllint --html --xmlout --format --noent - |
egrep '<code|<span.*glyph' | # get only the bits we're interested in
sed -e 's/.*">/__/' | # Add some "__" markers to make e.g. whitespace
sed -e 's/<.*/__/' | # entities work with xargs
sed 's/"/\"/' | # xmllint output contains " which messes up xargs
sed "s/'/\&apos;/" | # ditto apostrophes. Make them HTML entities instead.
xargs -n 2 echo | # Put the entity names and values on one line
sed 's/__/<!ENTITY /' | # Make a DTD
sed 's/;__/ /' |
sed 's/ __/"/' |
sed 's/__$/">/' |
egrep -v '\bapos\b|\bquot\b|\blt\b|\bgt\b|\bamp\b' # remove XML entities.
You end up with a file containing 2114 entities.
<!ENTITY AElig "Æ">
<!ENTITY Aacute "Á">
<!ENTITY Abreve "Ă">
<!ENTITY Acirc "Â">
<!ENTITY Acy "А">
<!ENTITY Afr "𝔄">
Plugging this into an XML parser should allow the XML parser to resolve these character entities.
Update October 2012: Since the working draft now has a JSON file (yes, I'm still using regular expressions) I worked it down to a single sed:
curl -s 'http://www.w3.org/TR/html5-author/entities.json' |
sed -n '/^ "&/s/"&\([^;"]*\)[^0-9]*\[\([0-9]*\)\].*/<!ENTITY \1 "\&#\2;">/p' |
uniq
Of course a javascript equivalent would be a lot more robust, but not everyone has node installed. Everyone has sed, right? Random sample output:
<!ENTITY subsetneqq "⫋">
<!ENTITY subsim "⫇">
<!ENTITY subsub "⫕">
<!ENTITY subsup "⫓">
<!ENTITY succapprox "⪸">
<!ENTITY succ "≻">

My best advice is to not upgrade to HTML5 or XHTML5 until support for character entity names is provided.
Anyone who thinks that 〹 makes more sense than — needs a brain upgrade. Most people can't remember huge tables of numbers.
Those of us who have to remain with older operating systems to be compatible with existing scientific, real-time, or point-of-sale hardware (or government networks) can't just type the character or pick it from a list. It won't save correctly in the file.
The reason this has been imposed on us is that w3c no longer wants the expense of serving DTD files, so we must go back to the stone age.
Nothing like this that has been provided should ever be deprecated.

The right answer (the modern way)
I asked this question five years ago. Now every browser supports UTF-8. And, every inception of UTF-8 includes glyph support for all named character entities. The rightmost current solution to this problem is not to use named entities at all but to serve only UTF-8 (strict) and to use actually characters in that.
This is a list of all XML entities. All of these have UTF-8 character alternatives -- and that's how they'd normally be rendered anyway.
For instance, take
U+1D6D8, MATHEMATICAL BOLD SMALL CHI , b.chi
I suppose in some variant of xml you could have &b.chi or something, searching for MATHEMATICAL BOLD SMALL CHI you'll find some page on fileformat.info, which has the 𝛘 character listed.
Alternatively, in Windows you can type Alt + 1 D 6 D 8 (the 1d68d comes from the table of XML entities), or in Linux Ctrl + Shift + u 1 D 6 D 8.
This will put the character into your document the right way.

Using the following answer: https://stackoverflow.com/a/9003931/689044 , I created the file and posted it as a Gist on GitHub: https://gist.github.com/cerkit/c2814d677854308cef57 for those of you who need the Entities in a file.
I used it successfully with ASP.NET MVC by loading the text file into the Application object and using that value with my (well-formed) HTML to parse a System.Xml.XmlDocument.
XmlDocument doc = new XmlDocument();
// load the HTML entities into the document and add a root element so it will load
// The HTML entities are required or it won't load the document if it uses any entities (ex: –)
doc.LoadXml(string.Format("{0}<root>{1}</root>", Globals.HTML_ENTITIES, control.HtmlText));
var childNodes = doc.SelectSingleNode("//root").ChildNodes;
// do your work here
foreach(XmlNode node in childNodes)
{
// or here
}
Globals.HTML_ENTITIES is a static property that loads the entities from the text file and stores them in the Application object, or it uses the values if they're already loaded in the Application object.
public static class Globals
{
public static readonly string APPLICATION_KEY_HTML_ENTITIES = "HTML_ENTITIES";
public static string HTML_ENTITIES
{
get
{
string retVal = null;
// load the HTML entities from a text file if they're not in the Application object
if(HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES] != null)
{
retVal = HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES].ToString();
}
else
{
using (StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath("~/Content/HtmlEntities/RootHtmlEntities.txt")))
{
retVal = sr.ReadToEnd();
HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES] = retVal;
}
}
return retVal;
}
}
}
I tried creating a long string to hold the values, but it kept crashing Visual Studio, so I decided that the best route would be to load the text file at runtime and store it in the Application object.

Related

Matching and storing part of a string in a variable using JScript.NET

I am fiddling with some a script for Fiddler, which uses JScript.NET. I have a string of the format:
{"params":{"key1":"somevalue","key2":"someothervalue","key3":"whatevervalue", ...
I want to match and show "key2":"someothervalue" where someothervalue could be any value but the key is static.
Using good old sed and bash I can replace the part I am looking for with:
$ a='{"params":{"key1":"somevalue","key2":"someothervalue","key3":"whatevervalue", ...'
$ echo $a | sed -r 's/"key2":"[^"]+"/replaced/g'
{"params":{"key1":"somevalue",replaced,"key3":"whatevervalue", ...
Now. Instead of replacing it, I want to extract that part into a variable using JScript.NET. How can that be done?
The most graceful way is to use a JSON parser. My personal preference is to import IE's JSON parser using the htmlfile COM object.
import System;
var str:String = '{"params":{"key1":"foo","key2":"bar","key3":"baz"}}',
htmlfile = new ActiveXObject('htmlfile');
// force htmlfile COM object into IE9 compatibility
htmlfile.IHTMLDocument2_write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
// clone JSON object and methods into familiar syntax
var JSON = htmlfile.parentWindow.JSON,
// deserialize your JSON-formatted string
obj = JSON.parse(str);
// access JSON values as members of a hierarchical object
Console.WriteLine("params.key2 = " + obj.params.key2);
// beautify the JSON
Console.WriteLine(JSON.stringify(obj, null, '\t'));
Compiling, linking, and running results in the following console output:
params.key2 = bar
{
"params": {
"key1": "foo",
"key2": "bar",
"key3": "baz"
}
}
Alternatively, there are also at least a couple of .NET namespaces which provide methods to serialize objects into a JSON string, and to deserialize a JSON string into objects. Can't say I'm a fan, though. The ECMAScript notation of JSON.parse() and JSON.stringify() are certainly a lot easier and profoundly less alien than whatever neckbeard madness is going on at Microsoft.
And while I certainly don't recommend scraping JSON (or any other hierarchical markup if it can be helped) as complicated text, JScript.NET will handle a lot of familiar Javascript methods and objects, including regex objects and regex replacements on strings.
sed syntax:
echo $a | sed -r 's/("key2"):"[^"]*"/\1:"replaced"/g'
JScript.NET syntax:
print(a.replace(/("key2"):"[^"]*"/, '$1:"replaced"'));
JScript.NET, just like JScript and JavaScript, also allows for calling a lambda function for the replacement.
print(
a.replace(
/"(key2)":"([^"]*)"/,
// $0 = full match; $1 = (key2); $2 = ([^"]*)
function($0, $1, $2):String {
var replace:String = $2.toUpperCase();
return '"$1":"' + replace + '"';
}
)
);
... Or to extract the value of key2 using the RegExp object's exec() method:
var extracted:String = /"key2":"([^"]*)"/.exec(a)[1];
print(extracted);
Just be careful with that, though, as retrieving element [1] of the result of exec() will cause an index-out-of-range exception if there is no match. Might either want to if (/"key2":/.test(a)) or add a try...catch. Or better yet, just do what I said earlier and deserialize your JSON into an object.

Annotating a document with JAPE

I have been searching for a solution to this for weeks, I have some documents(about 95) that I am trying to classify using GATE. I have put them in one corpus I called training_corpus, however, after ANNIE has annotated the corpus, I have to go back into each file, select all token in the document, and create an annotation called Mention, with feature type and value the class for the document. for example:
type Start End id Features
Mention 0 70000 2588 {type=neg}
Is there anyway to automatically do this with JAPE? Basically, I want to select all tokens and create a new annotation with feature(type=class). Also, the class is appended to the document. Since there are many documents, can JAPE extract the class from the document name and set it to the value of Mentions feature. Example document name is neg_data1.txt, so the annotation will be Mention.type = neg?
Any help will be greatly appreciated. Thanks
I think you answered to your question by yourself.If the class assignment based on just a token present in text - why not simply process text outside of GATE?
For example to create an xml file like:
text and then use it in training process.
Also you can create a simple JAPE rule which will:
a) will take a text within document boundaries (see gate.Utils.length methods AFAIR)
b) based on presence of your token will create a new Annotation instance with features necessary.
an abstract example:
Phase: Instance
Input: Token
Options: control = once
Rule:Instance
(
{Token}
):instance
-->
{
AnnotationSet instances = outputAS.get("INSTANCE_ANNOTATION");
FeatureMap featureMap = Factory.newFeatureMap();
if (instances!=null&&!instances.isEmpty()){
featureMap.put("features when annotation presented in doc");
}else{
featureMap.put("features when annotation not in doc");
}
outputAS.add(new Long(0), new Long(documentLength), "Mention", featureMap);
}

How to replace text in content control after, XML binding using docx4j

I am using docx4j 2.8.1 with Content Controls in my .docx file. I can replace the CustomXML part by injecting my own XML and then calling BindingHandler.applyBindings after supplying the input XML. I can add a token in my XML such as ¶ then I would like to replace that token in the MainDocumentPart, but using that approach, when I iterate through the content in the MainDocumentPart with this (link) method none of my text from my XML is even in the collection extracted from the MainDocumentPart. I am thinking that even after binding the XML, it remains separate from the MainDocumentPart (??)
I haven't tried this with anything more than a little test doc yet. My token is the Pilcrow: ¶. Since it's a single character, it won't be split in separate runs. My code is:
private void injectXml (WordprocessingMLPackage wordMLPackage) throws JAXBException {
MainDocumentPart part = wordMLPackage.getMainDocumentPart();
String xml = XmlUtils.marshaltoString(part.getJaxbElement(), true);
xml = xml.replaceAll("¶", "</w:t><w:br/><w:t>");
Object obj = XmlUtils.unmarshalString(xml);
part.setJaxbElement((Document) obj);
}
The pilcrow character comes from the XML and is injected by applying the XML bindings to the content controls. The problem is that the content from the XML does not seem to be in the MainDocumentPart so the replace doesn't work.
(Using docx4j 2.8.1)

mediawiki: is there a way to automatically create redirect pages that redirect to the current page?

My hobby is writing up stuff on a personal wiki site: http://comp-arch.net.
Currently using mediawiki (although I often regret having chosen it, since I need per page access control.)
Often I create pages that define several terms or concepts on the same page. E.g. http://semipublic.comp-arch.net/wiki/Invalidate_before_writing_versus_write_through_is_the_invalidate.
Oftentimes such "A versus B" pages provide the only definitions of A and B. Or at least the only definitions that I have so far gotten around to writing.
Sometimes I will define many more that two topics on the same page.
If I create such an "A vs B" or other paging containing multiple definitions D1, D2, ... DN, I would like to automatically create redirect pages, so that I can say [[A]] or [[B]] or [[D1]] .. [[DN]] in other pages.
At the moment the only way I know of to create such pages is manually. It's hard to keep up.
Furthermore, at the time I create such a page, I would like to provide some page text - typicaly a category.
Here;s another example: variant page names. I often find that I want to create several variants of a page name, all linking to the same place. For example
[[multithreading]],
[[multithreading (MT)]],
[[MT (multithreading)]],
[[MT]]
Please don;t tell me to use piped links. That's NOT what I want!
TWiki has plugins such as
TOPICCREATE automatically create topics or attach files at topic save time
More than that, I remember a twiki plugin, whose name I cannot remember or google up, that included the text of certain subpages within your current opage. You could then edit all of these pages together, and save - and the text would be extracted and distributed as needed. (By the way, if you can remember the name of tghat package, please remind me. It had certain problems, particularly wrt file locking (IIRC it only locked the top file for editing, bot the sub-topics, so you could lose stuff.))
But this last, in combination with parameterized templtes, would be almost everything I need.
Q: does mediawiki have something similar? I can't find it.
I suppose that I can / could should wrote my own robot to perform such actions.
It's possible to do this, although I don't know whether such extensions exist already. If you're not averse to a bit of PHP coding, you could write your own using the ArticleSave and/or ArticleSaveComplete hooks.
Here's an example of an ArticleSaveComplete hook that will create redirects to the page being saved from all section titles on the page:
$wgHooks['ArticleSaveComplete'][] = 'createRedirectsFromSectionTitles';
function createRedirectsFromSectionTitles( &$page, &$user, $text ) {
// do nothing for pages outside the main namespace:
$title = $page->getTitle();
if ( $title->getNamespace() != 0 ) return true;
// extract section titles:
// XXX: this is a very quick and dirty implementation;
// it would be better to call the parser
preg_match_all( '/^(=+)\s*(.*?)\s*\1\s*$/m', $text, $matches );
// create a redirect for each title, unless they exist already:
// (invalid titles and titles outside ns 0 are also skipped)
foreach ( $matches[2] as $section ) {
$nt = Title::newFromText( $section );
if ( !$nt || $nt->getNamespace() != 0 || $nt->exists() ) continue;
$redirPage = WikiPage::factory( $nt );
if ( !$redirPage ) continue; // can't happen; check anyway
// initialize some variables that we can reuse:
if ( !isset( $redirPrefix ) ) {
$redirPrefix = MagicWord::get( 'redirect' )->getSynonym( 0 );
$redirPrefix .= '[[' . $title->getPrefixedText() . '#';
}
if ( !isset( $reason ) ) {
$reason = wfMsgForContent( 'editsummary-auto-redir-to-section' );
}
// create the page (if we can; errors are ignored):
$redirText = $redirPrefix . $section . "]]\n";
$flags = EDIT_NEW | EDIT_MINOR | EDIT_DEFER_UPDATES;
$redirPage->doEdit( $redirText, $reason, $flags, false, $user );
}
return true;
}
Note: Much of this code is based on bits and pieces of the pagemove redirect creating code from Title.php and the double redirect fixer code, as well as the documentation for WikiPage::doEdit(). I have not actually tested this code, but I think it has at least a decent chance of working as is. Note that you'll need to create the MediaWiki:editsummary-auto-redir-to-section page on your wiki to set a meaningful edit summary for the redirect edits.

How to use regex in selenium locators

I'm using selenium RC and I would like, for example, to get all the links elements with attribute href that match:
http://[^/]*\d+com
I would like to use:
sel.get_attribute( '//a[regx:match(#href, "http://[^/]*\d+.com")]/#name' )
which would return a list of the name attribute of all the links that match the regex.
(or something like it)
thanks
The answer above is probably the right way to find ALL of the links that match a regex, but I thought it'd also be helpful to answer the other part of the question, how to use regex in Xpath locators. You need to use the regex matches() function, like this:
xpath=//div[matches(#id,'che.*boxes')]
(this, of course, would click the div with 'id=checkboxes', or 'id=cheANYTHINGHEREboxes')
Be aware, though, that the matches function is not supported by all native browser implementations of Xpath (most conspicuously, using this in FF3 will throw an error: invalid xpath[2]).
If you have trouble with your particular browser (as I did with FF3), try using Selenium's allowNativeXpath("false") to switch over to the JavaScript Xpath interpreter. It'll be slower, but it does seem to work with more Xpath functions, including 'matches' and 'ends-with'. :)
You can use the Selenium command getAllLinks to get an array of the ids of links on the page, which you could then loop through and check the href using the getAttribute, which takes the locator followed by an # and the attribute name. For example in Java this might be:
String[] allLinks = session().getAllLinks();
List<String> matchingLinks = new ArrayList<String>();
for (String linkId : allLinks) {
String linkHref = selenium.getAttribute("id=" + linkId + "#href");
if (linkHref.matches("http://[^/]*\\d+.com")) {
matchingLinks.add(link);
}
}
A possible solution is to use sel.get_eval() and write a JS script that returns a list of the links. something like the following answer:
selenium: Is it possible to use the regexp in selenium locators
Here's some alternate methods as well for Selenium RC. These aren't pure Selenium solutions, they allow interaction with your programming language data structures and Selenium.
You can also get get HTML page source, then regular expression the source to return a match set of links. Use regex grouping to separate out URLs, link text/ID, etc. and you can then pass them back to selenium to click on or navigate to.
Another method is get HTML page source or innerHTML (via DOM locators) of a parent/root element then convert the HTML to XML as DOM object in your programming language. You can then traverse the DOM with desired XPath (with regular expression or not), and obtain a nodeset of only the links of interest. From their parse out the link text/ID or URL and you can pass back to selenium to click on or navigate to.
Upon request, I'm providing examples below. It's mixed languages since the post didn't appear to be language specific anyways. I'm just using what I had available to hack together for examples. They aren't fully tested or tested at all, but I've worked with bits of the code before in other projects, so these are proof of concept code examples of how you'd implement the solutions I just mentioned.
//Example of element attribute processing by page source and regex (in PHP)
$pgSrc = $sel->getPageSource();
//simple hyperlink extraction via regex below, replace with better regex pattern as desired
preg_match_all("/<a.+href=\"(.+)\"/",$pgSrc,$matches,PREG_PATTERN_ORDER);
//$matches is a 2D array, $matches[0] is array of whole string matched, $matches[1] is array of what's in parenthesis
//you either get an array of all matched link URL values in parenthesis capture group or an empty array
$links = count($matches) >= 2 ? $matches[1] : array();
//now do as you wish, iterating over all link URLs
//NOTE: these are URLs only, not actual hyperlink elements
//Example of XML DOM parsing with Selenium RC (in Java)
String locator = "id=someElement";
String htmlSrcSubset = sel.getEval("this.browserbot.findElement(\""+locator+"\").innerHTML");
//using JSoup XML parser library for Java, see jsoup.org
Document doc = Jsoup.parse(htmlSrcSubset);
/* once you have this document object, can then manipulate & traverse
it as an XML/HTML node tree. I'm not going to go into details on this
as you'd need to know XML DOM traversal and XPath (not just for finding locators).
But this tutorial URL will give you some ideas:
http://jsoup.org/cookbook/extracting-data/dom-navigation
the example there seems to indicate first getting the element/node defined
by content tag within the "document" or source, then from there get all
hyperlink elements/nodes and then traverse that as a list/array, doing
whatever you want with an object oriented approach for each element in
the array. Each element is an XML node with properties. If you study it,
you'd find this approach gives you the power/access that WebDriver/Selenium 2
now gives you with WebElements but the example here is what you can do in
Selenium RC to get similar WebElement kind of capability
*/
Selenium's By.Id and By.CssSelector methods do not support Regex and By.XPath only does where XPath 2.0 is enabled. If you want to use Regex, you can do something like this:
void MyCallingMethod(IWebDriver driver)
{
//Search by ID:
string attrName = "id";
//Regex = 'a number that is 1-10 digits long'
string attrRegex= "[0-9]{1,10}";
SearchByAttribute(driver, attrName, attrRegex);
}
IEnumerable<IWebElement> SearchByAttribute(IWebDriver driver, string attrName, string attrRegex)
{
List<IWebElement> elements = new List<IWebElement>();
//Allows spaces around equal sign. Ex: id = 55
string searchString = attrName +"\\s*=\\s*\"" + attrRegex +"\"";
//Search page source
MatchCollection matches = Regex.Matches(driver.PageSource, searchString, RegexOptions.IgnoreCase);
//iterate over matches
foreach (Match match in matches)
{
//Get exact attribute value
Match innerMatch = Regex.Match(match.Value, attrRegex);
cssSelector = "[" + attrName + "=" + attrRegex + "]";
//Find element by exact attribute value
elements.Add(driver.FindElement(By.CssSelector(cssSelector)));
}
return elements;
}
Note: this code is untested. Also, you can optimize this method by figuring out a way to eliminate the second search.