How to convert html mixed markdown to html/docx/pdf? - list

i'm working in azure devops wiki for create specifications and other software documentations.
I have to create tables and in detail some bulleted list. It is possible in github flavored markdown (exactly in azure devops):
#header1
|TableHeader1|TableHeader2|
|--|--|
|Text1|Details 1|
|ListCell|<ul><li>FirstBullet</li><li>SecondBullet</li></ul>|
Html output screenshot
I try with pandoc for first, but the list fall out from the table.
Any idea to convert into html/docx?
Regards,
Andras

You probably can't. As the Pandoc documentation warns:
Because pandoc’s intermediate representation of a document is less
expressive than many of the formats it converts between, one should
not expect perfect conversions between every format and every other.
Pandoc attempts to preserve the structural elements of a document, but
not formatting details such as margin size. And some document
elements, such as complex tables, may not fit into pandoc’s simple
document model. While conversions from pandoc’s Markdown to all
formats aspire to be perfect, conversions from formats more expressive
than pandoc’s Markdown can be expected to be lossy.
HTML is certainly more expressive than Markdown. Therefore, Pandoc does not guarantee that HTML source will be preserved. That said, a simple list is something that can be expressed in Markdown just fine, so one would think that would not be lossy.
However, the table complicates things. Pandoc actually supports four different table formats. However, only two of those formats (multi-line and grid tables) support cells which contain block level elements.
However, you appear to be using pipe_tables, which do not support block level elements within table cells. As the documentation states:
The cells of pipe tables cannot contain block elements like paragraphs and lists, and cannot span multiple lines.
While all of the above extensions (table formats) are supported by Pandoc's markdown format, only pipe_tables is supported by the gfm format (see Markdown Variants). Therefore, you might consider using the markdown format instead. However, that will only help if your table actually uses the proper syntax for grid or multiline tables.
Unfortunately, grid and multiline tables are only supported by Pandoc. I'm not aware of any other Markdown implementations which support them. Therefore, you cannot parse a table with block level elements in both Pandoc and other implementations.
So why does the other implementation you are using work fine with a raw HTML list within a table cell? Presumably the parser is not very smart and is blindly passing the raw HTML through unaltered. Any more sophisticated parsers which attempt to understand the raw HTML would not work for you. And , of course, if you want to convert the document to another (not HTML) format, then the parser needs to understand the raw HTML.
Maybe you could find some random parser which does what you want, but it is not likely. A better solution might be to take the HTML output of your other Markdown tool and use Pandoc (or another tool) to convert that to docx/pdf.

Related

eXist DB and Xquery : xincludes or collections (TEI-XML)?

I have a corpus in TEI-XML which uses a 'master' corpus XML document that then contains, via xi:include, thousands of other documents. Each of these documents themselves contain xi:includes to master lists of named entities (people, places, etc linked by xml:ids) . All of this works very well in XSLT (and in my IDE Oxygen for fast encoding).
I am now embarking on building a website using eXist-DB applications. I am rewriting everything directly in Xquery (to replace XSLT), and I have hit upon an unexpected decision. I am used to using xi:includes to traverse the corpus and the various XMLs files. But reading the documentation of eXist DB, it seems that the encouraged practice is to use collections and query them directly, instead of navigating via xi:includes. It also seems that eXist-DB does not support the full implementation of xi:includes anyway and requires some work arounds?
I am looking for guidance as to best practices of eXist-DB/Xquery in this context.
Many thanks in advance.
Correct, eXist's XInclude implementation is focused on output (i.e., serialization) rather than on querying or indexing. As eXist's documentation page on XInclude states:
The XInclude processor is implemented as a filter in between the serializer's output event stream and the receiver... XInclude processing is therefore applied whenever eXist-db serializes an XML fragment, whether it's a document, the result of an XQuery or an XSLT stylesheet.
Thus, if you use XInclude to assemble your corpus and you want to query/traverse this corpus, you could do so by (1) writing a query to read your XInclude and following it like a map to find the component documents, (2) pre-serializing your data into a new document and then querying the resulting document directly, or (3) placing the documents into collections that facilitate the kinds of queries you want to do.
Depending on the size of those thousands of documents, traversing the xinclude when running xqueries tends to be slow and quite memory intensive. In my experience Joe's option 3 is usually the way to go.
Unlike with straight-up xslt, in exist-db you can define indexes. E.g. you have a <listPerson> element as a wrapper for 1000s xincludes going to <person> elements as root of their own document.
If you have defined and index for <person> you can use e.g. ft:query() to query the index directly, irrespective of where in the tree of sub-collections and documents the element is located. This tends to be orders of magnitude faster, compared to traversing the whole document starting at master, and resolving xincludes.
As for validation, you will need to decide if a full validation run of the whole expanded document is really always necessary. This requires some fiddling, but there isn't much general advice I can offer, without seeing the actual files and code.
You can find more information about indexing in exist in the documentation

Generate inline rather than list-style footnotes in Pandoc Markdown output?

When converting from some format (say, HTML or Docx) to Markdown in Pandoc, is it possible to render all footnotes in the inline style ("this is the main text^[this is a footnote]") rather than as numbered references with a corresponding list at the end of the document? I want to work on my Markdown documents (converted from a Docx of my thesis) as master texts, but now if I add a new footnote it messes up the numbering.
Alternatively, is there another convenient way (i.e. not Pandoc) that this could be done? Cutting text in one part of a file and adding corresponding text in another part seems a bit beyond a simple regex.
Thanks in advance for any help.
EDIT: I've just hacked up an extremely simple Python script to do this, in case anyone else has the same issue.
Pandoc's Markdown syntax is quite flexible about footnotes:
The footnotes themselves need not be placed at the end of the document. They may appear anywhere except inside other block elements (lists, block quotes, tables, etc.).
Like:
Here is a footnote reference[^1] and some more text.
[^1]: Here is the footnote.
Here's the next paragraph.
However, the Markdown Writer (the module that generates markdown files, as opposed to reading them) currently simply places all of them at the end of the document. But this could be implemented behind a flag, similar to the --reference-links flag. Feel free to submit an issue or pull request!
Inline footnotes and references are quite nice for writing and editing markdown documents, but cumbersome for reading them.
I used ltrgoddard's inliner with success to process several files that I use with pandoc and latexmk to produce PDF. inliner works well for transforming end-style references to inline style references in an already-written document.
Cross references to other questions and clues for posterity:
Convert markdown links from inline to reference
Vim plugin for adding external links
Also see http://drbunsen.github.io/formd/
and https://instant-thinking.de/2014/02/20/markdown-footnotes-with-vim/ for more info re: formd, which should work for converting inline references end-style references, and vice-versa.
Note that formd works on URLs and ignores footnotes, so this may be seen as a similar project (with different goals) but not an alternative.

Is there a way to count tags on a physical (PDF) page using XSL-FO?

Here is the scenario. I have an XML document which contains tags. I want to create a transform that does this
<tag>content A</tag> 1. content A
<tag>content B</tag> ----> 2. content B
<tag>content C</tag> 3. content C
but only if the tag contents appear on the same physical page. The numbering should restart on each new page. Is there any way to do this using XSL-FO? I know with latex the only way to accomplish something like this is to run latex twice, with the interim document used to determine content page placement.
As far as I can tell (and as confirmed by the Antenna House tech support team), there is no way to do this using standard XSL-FO. Antenna House offers <axf:footnote*/> extensions which include the ability to set an axf:footnote-number-reset="page" attribute, and as suggested in the comments, RenderX offers a generic mechanism which might be used for this purpose, but both of these involve vendor-specific extensions to the language.
This points to a number of shortcomings in XSL-FO that really should have been addressed a long time ago with a 2.0 version of the specification. A w3c committee to develop an XSL-FO 2.0 spec was formed and then disbanded quite some time ago; I have no idea why, as I find the tool indispensable for a large class of document to PDF conversions.

Regx to exclude elements in an xml file

I am comparing two xml files using win merge. The files are deployment files and im looking for variation between the environments. The main issue is that the xml files are littered with tags that indicate a change in underlying id e.g. 123 but this is unimportant for comparing.
I want to create a regex that i can use in winmerge to exclude elements to compare only the interesting elements. e.g. compare element in the example below
Environment 1
<table>
<tableInfo>
<tableId>293</tableId>
<name>Table Name New</name>
<repositoryId>0</repositoryId>
Environment 2
<table>
<tableInfo>
<tableId>965</tableId>
<name>Table Name Old</name>
<repositoryId>0</repositoryId>
Please note that the application producing the xml spits these out in line by line order so it is not a true xml compare
I would not recommend using a regex for this... to do it truly accurately, you would really need to effectively parse the XML, which is really not something for which you want to use a regex.
Win Merge is a line-based diff tool, which really isn't necessarily wholly effective for XML. I would recommend trying an XML-based diff tool, which has more of a concept of XML's tree structure. Most XML-based diff tools appear to be commercial products, but there is diffxml, which is open source, and may be worth a look.
If you can get an XML-based diff of the files, which should inherently be more accurate, since they are not wholly line-based, and take the tree structure into account, you could then further delve into the diffs using an XML parser, such as ElementTree in Python, specifically targeting the tags you consider to be interesting and comparing them to each other to see if they are different.
If diffxml proves to be too unwieldy, it may be worth just doing the parsing using ElementTree or similar (i.e. lxml) and doing the comparison yourself against the two different sources targeted just at the tags in which you are interested.
In short, I think XML parsers, perhaps in combination with a XML-aware diff tool, will be more useful than pure regexes in this case.

C++ Logger-Should I use an ordinary xml parser?

I'm working on a logging system for my 2D engine, and I'm confused on how I should go about creating/editing the file, and how I should output that file.
I've learned that XML is more of a data carrier rather than a data displayer like HTML is. I've read that I can use XML to HTML converters. One method I've thought about is writing characters to a file in HTML.
Clarity on these matters is what I ask of you, stack overflow.
Creating an XML (or HTML) file doesn't need any special library. Straightforward string concatenation is usually good enough, you may have to encode some special characters (e.g. > into >.
But as Owen says, plain text is a log more common for log files. One reasonable compromise is comma-separated values in a text file, this gives you a little bit of structure without much overhead. For example, the Windows web server (IIS) uses this format by default, and if you have some fields that are output for each line such as timestamp or source filename and line number, this makes it easy to separate those out again.
Just about every log I've ever worked with has been pure text delimited by newlines. If you're going to depart from that, you may want to ask yourself what it is about your logging needs that you want to accomplish with markup.
If you must go the way of markup, I would suggest an XML format that contains a minimal set of markup that would be useful in your situation. You could use XML to capture structure in your log entries (timestamp, severity, and operational code, for example) that would be inconvenient to code for in HTML.
Note that you could also go hybrid and embed some XHTML tags in an XML element whose purpose is to capture displayable text, if you want.
The problem with XML or HTML files is that you cannot append at any time. You have to close the final tag (document tag) properly at the end of writing.
Therefore, it's not a popular format for logging.
For logging, I suggest using one of the existing log engines, such as Apache logger, or, John Torjo's boost log candidate. They will support log levels, runtime configuration, etc.
If you are considering writing logs in XML files, please, stop.
Log files should be simple plain text files, XML-izing it is introducing needless complexity. They are not structured data, they are meant to be read by people, not automated tools.
It all starts with XML logs, and then it goes downhill from there.