XSL-FO: Need to disable page break at page sequence level - xslt

My requirement is to produce, essentially, a page WITHIN a page. The xsl defines a page 1/3 the size of an A4 sheet, but up to 3 of them must print on the sheet. The page is a standard header/body/footer, with a 'Page X of Y' on it and of course it is simple. But if there are 4 of these "pages", only 2 pages should be printed, 3 on the first and 1 on the second. But 4 pages are printed, because the output PDF "tells" the printer that is a complete sheet. So what I want to do is either:
code the fo so it does not page break after it finished a page (something like page-break-after="avoid" but at the page sequence level)
OR
generate a page sequence within a page sequence, the outer one being defined as A4 size, the inner 1/3 of that.
I've tried the 2nd directly in a simple way, i.e.,
<fo:page-sequence master-reference="A4">
<fo:page-sequence master-reference="one-third_A4">
...
</fo:page-sequence>
</fo:page-sequence>
..and the processor definitely does NOT like it.
Are there any instructions I can apply that could do either of these? I looked in places like schemacentral and w3schools.com and cannot find anything.

Not sure if I understood you needs (a drawing would help), but can't you simply put your content into a series of fo:block-container elements with specified height? They would come out stacked vertically.
If you need more complex geometry, check out flow-maps feature of XSL 1.1. Three body regions per page connected sequentially in a <fo:flow-target-list>, and a single fo:flow providing content for them.
Having fo:page-sequence within another fo:page-sequence is not valid according to the spec (what would a page number mean then?), and every fo:page-sequence starts a new physical page by definition.
fo:page-sequence-wrapper won't help you on sub-page level either.

Related

Multiple page sequences and page numbers

I am using XSLT and XSL-FO for document creation.
I need to introduce multiple sequences of page numbering, i.e main section numbered from 1 to end page, which is broken by two subsequences of corresponding subdocuments, numbered from 1 to end of each of these sections, and after them the main sequence is continued.
The problem I can't overcome is that the subsequences are counted in main sequence, and page number of main section after continuation is incremented page number of previous subsequence.
So I get e.g.
main section
subsection
subsection
continuation of main
1, 2, 3
1, 2
1, 2
3, 4, 5...
And I want
main section
subsection
subsection
continuation of main
1, 2, 3
1, 2
1, 2
4, 5, 6...
How can I achieve it?
In an FO file, the fo:page-sequence elements define a flat list of contents to be paginated, so there is no concept of a "main section" containing "subsections".
The initial-page-number property controls how page numbers are computed for a page sequence.
In particular, when a page sequence ends and a new one starts the only available options concerning page numbering are:
continue the enumeration with the next page number (optionally, the next odd one or even one)
restart from a specified number
There is no way to refer to a different page sequence and say "continue from that".
That said, if your document is not much more complicated than your example you can achieve what you want with a workaround.
This is probably the easiest (and dirtiest) one:
create the pdf output without changing anything; note down the page number that the "continuation of main" section should start from (4, in your example)
modifiy the FO file (or the XSLT), inserting initial-page-number="4" in the appropriate fo:page-sequence
recreate the output file
Slightly more complicated:
modify your XSLT so that it produces the fo:page-sequence elements in the order: (main section) - (continuation of main) - (subsection) - (subsection)
there is no need to specify initial-page-number for the "continuation of main" section; for subsections, use initial-page-number="1" to restart the page numbers (but they are already like this, from your example)
once you have the pdf, use some tool to move the pages around and have them in the order you desire
A cleaner and fully-automated solution would require using FOP's intermediate format:
create the intermediate format output
modify it to fix the page numbers as needed
create the final output from the modified intermediate file
A final note: you did not specify why your document needs such a peculiar page numbering and what degree of control you have on this requirement, but I cannot help wondering whether it is confusing for a reader to see page numbers restart that way, and difficult for them to find what they need ("I am told the information I need is on page 2; but which page 2?").

XPath How to optimize performance over "preceding" axis?

I am using XSLT to transform XML files and this XPath is a very small part of it. The main object is a performance issue. First, I will describe the context:
Part of the transformation is a complex grouping operation, used to group up a sequence of similar elements, in the order they appear. This is a small sample from the data:
<!-- potentially a lot more data-->
<MeaningDefBlock>
<!-- potentially a lot more data-->
<MeaningSegment>
<Meaning>
<value> or </value>
</Meaning>
</MeaningSegment>
<MeaningSegment>
<MeaningInsert>
<OpenBracket>
<value>(</value>
</OpenBracket>
<Meaning>
<value>ex.: </value>
</Meaning>
<IllustrationInsert>
<value>ita, lics</value>
</IllustrationInsert>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
</MeaningInsert>
</MeaningSegment>
<!-- potentially a lot more data-->
</MeaningDefBlock>
<!-- potentially a lot more data-->
There are only parent elements (ex.: MeaningInsert) and elements that only contain a value element, which contains text (ex.: IllustrationInsert).
The text from the input gets grouped into elements that have such text segments: or (ex.:, ita, lics and ) (in this case, the "ita, lics" segment separates the groups that would otherwise be all in one). The main point is that elements from different levels can be grouped. XPath is used to identify groups via previous segments and keyed in the XSL. The whole key is very complicated and not the object of the question (but I still provide it for context):
<xsl:key name="leavesInGroupL4" match="MeaningSegment//*[value]" use="generate-id(((preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[not(boolean(self::IllustrationInsert|self::LatinName)=boolean(current()/self::IllustrationInsert|current()/self::LatinName))]|ancestor-or-self::MeaningDefBlock)[last()])"/>
The important part being:
(preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[...]
From the context of an element with a value child (like Meaning or OpenBracket), this XPath selects the previous siblings and all the elements with values from the preceding siblings of the parent/ancestor MeaningSegment. In practice, it basically selects all the text that came before it (or, rather, the grandparent of the text itself)
I have later realized that there might be even further complications with layers and differing depth of the elements with values. I might need to select all such preceding elements regardless of their parent and siblings but still in the same block. I have substituted "the important part" with a somewhat simpler XPath expression:
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock) = generate-id(current()/ancestor-or-self::MeaningDefBlock)]
This only checks that it's in the same block and it works! It successfully selects the preceding segments of text in the block, even if elements with values and parent elements are mixed together. Example input fragment:
...
<OpenBracket>
<value>(</value>
</OpenBracket>
<SomeParentElement>
<LatinName>
<value>also italics</value>
</LatinName>
</SomeParentElement>
<ClosedBracket>
<value>)</value>
</ClosedBracket>
...
This is not something the first approach could do because the brackets and the LatinName are not siblings.
However, the new approach with preceding:* is extremely slow! On a real document, the XSL transformation takes up to 5 minutes instead of the usual 3 seconds that the original approach takes (including overhead), which is a 100x increase in time taken. Of course, that is because preceding checks nearly every node in the whole document when it is executed (a lot of times). The document has a lot of MeaningDefBlock blocks (nearly 2000), each with a couple segments of text (usually single-digit) and a bunch of other straight-forward elements/nodes unrelated to said text (usually in the low hundreds, each block). Quite easy to see how this all adds up to preceding trashing performance over preceding-sibling.
I was wondering if this could be optimized somehow. In XSL, keys have greatly improved performance multiple times in our project but I'm not sure if preceding and keys can be combined or if the XPath needs to be more complex and tailored to my specific case, perhaps enumerating the elements it should look at (and hopefully ignoring everything else).
Since the input will currently always work with the first approach, I have conceded and rolled back the change (and would probably rather take the 5 min hit every time than trying optimization myself).
I use XSLT 1.0 and XPath 1.0
I guess you've probably already worked out that
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock)
= generate-id(current()/ancestor-or-self::MeaningDefBlock)]
is going to search back to the beginning of the document; it's not smart enough to know that it only needs to search within the containing meaningDefBlock element.
One answer to that would be to change it to something like this:
ancestor-or-self::MeaningDefBlock//*[value][. << current()]
The << operator requires XPath 2.0 and for a problem as complex as this, you really ought to consider moving forwards. However you can simulate the operator in 1.0 with an expression like generate-id(($A|$B)[1]) = generate-id($A).
There's no guarantee this will be faster, but unlike your existing solution it should be independent of how many MeaningDefBlock elements there are in the document.

How to count number of citations/references in wikipedia raw text?

I'm building a model to classify raw Wikipedia text by article quality (Wikipedia has a dataset of ~30,000 hand-graded articles and their corresponding quality grades.). Nonetheless, I am trying to figure out a way to algorithmically count the number of citations that appear on the page.
As a quick example: here is an excerpt from a raw Wiki page:
'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>
So far, I've concluded that I can find the number of images by counting the number of [[Image: occurrences. I was hoping I could do something similar for references. In fact, after comparing raw Wiki pages and their corresponding live pages, I think I was able to determine that </ref> corresponds to the end notation of a reference on a Wiki page. --> For example: Here, you can see that the author makes a statement at the end of the paragraph and references Hammond, 58–9 within <ref> {text} </ref>
If somebody is familiar with Wiki's raw data and can shed some light on this, please let me know! Also, if you know a better way to do this, please tell me that, too!
Many thanks in advance!
ref not always contains link to source. Sometimes contain specify explanations and etc.
You must counting not only <ref>...</ref>, but also footnote templates.
If you need count of unique refs, then you must except grouped refs (ref with name="xxx" parameter or auto grouped footnotes templates with same content).
Sorry for my English.
Counting reference tags in wiki markup isn't necessarily accurate as references can be reused so that two </ref> would only show up as one reference in the list at the end. There is an API that should give a list of the articles, but for some reason it's deactivated, but BeautifulSoup makes this pretty simple. I haven't tested this to check it counts all articles correctly, but it works:
from bs4 import BeautifulSoup
import requests
page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')
soup=BeautifulSoup(page.content,'html.parser')
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
count = count + 1
print (count)

XSLFO no page-break on first page

I have an XSLFO document with a couple of block elements that have page-break-inside="avoid". Also there is a title-Element before each block-element with keep-with-next.within-page="always".
So basically I have paragraphs with a title and title and paragraph should always be on the same page and there shouldn't be a page-break inside the paragraphs.
The problem is that there are some blocks that have too much content for one page. If the content only overflows the region-body but not the entire page, no page-break occurs, so the block is still on one page.
However, there are blocks where the text overflows the entire page and in that case, there is a page-break-before. One such block element with too much content should be on the first page of the document. However, there is a page-break and it is on the second page of the document.
So in essence, my problem is that there should be no page-breaks within the block-elements (the pargraphs), title and paragraph should always be on the same page AND there should be no page-break before the very first block-element, even if it overflows the entire page. The content should always start on the first page and there should be no empty pages at all.
Thanks for your help and suggestions!
The spec says:
Keep conditions are imposed by the "within-page", "within-column", and "within-line" components of the "keep-with-previous", "keep-with-next", and "keep-together" properties. The refined value of each component specifies the strength of the keep condition imposed, with higher numbers being stronger than lower numbers and the value always being stronger than all numeric values.
Have you tried different values? Maybe you could change page-break-inside="avoid" into keep-together="<your value here>" to use tuned values.
Edit: See spec http://www.w3.org/TR/2006/REC-xsl11-20061205/#keepbreak

How to set page numbers in XSLT having multiple simple-page-master?.This is for dynamically inserting pagenumbers in PDF

I have an XSLT file having multiple simple-page-master. The multiple simple-page-master is used for setting different height and width.
Here I am facing an issue that the pagenumbers corresponding to each simple-page-master starts with 1.
I used < fo:page-number/> for dynamically generating Page numbers.
I also want to get the total number of pages since I have to write Page numbers as
Page 1 of 20
I need pagenumbers in a sequence.
How can I solve this?
Mmmh without more samples, it's hard to try an answer...
I gamble one but quite blindly :
You should have some somewhere. Add a pagenumber param in it. And in your add .
You should be able to output the page number via the param. To get the total number of simple-page-master, add another param or use the XPath count(//simple-page-master) (better to count once and use params).