I am using bookdown for a documentation which is outputted with bookdown::gitbook and bookdown::pdf_book.
In my Rmd files, I am using a div to wrap around notes and warnings styled with a css file. For example:
<div class="note">
This is a note.
</div>
Obviously, HTML and CSS is ignored when generating the PDF file. I was wondering if there is a way to "inject" a small script that would replace the div with, for example, a simple prefix text.
Or, is there another way to have it formatted in HTML and in the PDF without littering my file by adding something lengthy every time like:
if (knitr::is_html_output(excludes='epub')) {
cat('
<div class="note">
This is a note.
</div>
')
} else {
cat('Note: This is a note.')
}
I could also style blockquotes as described here but it is not an option as I still need blockquotes.
The appropriate way to do this is to use a fenced div rather than inserting HTML directly into your markdown and trying to parse it later with LUA. Pandoc already allows you to insert custom styles and process them to both file types. In other words, it will take care of creating the appropriate HTML and LaTeX for you, and then you just need to style each of them. The Bookdown documentation references this here, but it simply points to further documentation here, and here.
This method will create both your custom classed div in html and apply the same style name in the LaTeX code.
So, for your example, it would look like this:
::: {.note data-latex=""}
This is a note.
:::
The output in HTML will be identical to yours:
<div class="note">
<p>This is a note.</p>
</div>
And you've already got the CSS you want to style that.
The LaTeX code will be as follows:
\begin{note}
This is a note.
\end{note}
To style that you'll need to add some code to your preamble.tex file, which you've already figured out as well. Here's a very simple example of some LaTeX that would simply indent the text from both the left and right sides:
\newenvironment{note}[0]{\par\leftskip=2em\rightskip=2em}{\par\medskip}
I found this answer on tex.stackexchange.com which brought me on the right track to solve my problem.
Here is what I am doing.
Create boxes.lua with following function:
function Div(element)
-- function based on https://tex.stackexchange.com/a/526036
if
element.classes[1] == "note"
or element.classes[1] == "side-note"
or element.classes[1] == "warning"
or element.classes[1] == "info"
or element.classes[1] == "reading"
or element.classes[1] == "exercise"
then
-- get latex environment name from class name
div = element.classes[1]:gsub("-", " ")
div = div:gsub("(%l)(%w*)", function(a, b) return string.upper(a)..b end)
div = "Div"..div:gsub(" ", "")
-- insert element in front
table.insert(
element.content, 1,
pandoc.RawBlock("latex", "\\begin{"..div.."}"))
-- insert element at the back
table.insert(
element.content,
pandoc.RawBlock("latex", "\\end{"..div.."}"))
end
return element
end
Add pandoc_args to _output.yml:
bookdown::pdf_book:
includes:
in_header: latex/preamble.tex
pandoc_args:
- --lua-filter=latex/boxes.lua
extra_dependencies: ["float"]
Create environments in preamble.tex (which is also configured in _output.yml):
I am using tcolorbox instead of mdframed
\usepackage{xcolor}
\usepackage{tcolorbox}
\definecolor{notecolor}{RGB}{253, 196, 0}
\definecolor{warncolor}{RGB}{253, 70, 0}
\definecolor{infocolor}{RGB}{0, 183, 253}
\definecolor{readcolor}{RGB}{106, 50, 253}
\definecolor{taskcolor}{RGB}{128, 252, 219}
\newtcolorbox{DivNote}{colback=notecolor!5!white,colframe=notecolor!75!black}
\newtcolorbox{DivSideNote}{colback=notecolor!5!white,colframe=notecolor!75!black}
\newtcolorbox{DivWarning}{colback=warncolor!5!white,colframe=warncolor!75!black}
\newtcolorbox{DivInfo}{colback=infocolor!5!white,colframe=infocolor!75!black}
\newtcolorbox{DivReading}{colback=readcolor!5!white,colframe=readcolor!75!black}
\newtcolorbox{DivExercise}{colback=taskcolor!5!white,colframe=taskcolor!75!black}
Because I have also images and tables within the boxes, I run into LaTeX Error: Not in outer par mode.. I was able to solve this issue by adding following command to my Rmd file:
```{r, echo = F}
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
```
when executing build book, index.html is gaining "automatic contributions" that I have not ordered; example of the "contribution":
</div>
<div id="prerequisites" class="section level1">
<h1><span class="header-section-number"> 1</span> Prerequisites</h1>
<p>This is a <em>sample</em> book written in <strong>Markdown</strong>. You can use anything that
Pandoc’s Markdown supports, e.g., a math equation <span class="math inline">\(a^2 + b^2 = c^2\)
</span>
...
</p>class="uri">https://yihui.org/tinytex/</a>.</p>
This "gift" is reflected into the ePUB output, but not into the html output
Seems improper to have to edit index.html to exclude this "addition".
Anyone has been able to avoid this effect?
Thank you
I am currently creating my own pandoc template for Rmarkdown (outputting html). I want my report to show a footer containing the title and first author's name.
Reading the pandoc manual, I saw that it is possible to use a pipe to get the first element of an array (var/first). So having the header:
title: "My Report"
author:
- "Jane Doe"
- "John Doe"
I tried to do the following in the template:
<div class="footer">
<div class="footer-content">
<div class="footer-title">
<h3>$title$</h3>
</div>
<div class="footer-author">
<h3>$author/first$</h3>
</div>
</div>
</div>
Then I got as error in the author line:
"template" (line 96, column 24):
unexpected "/"
expecting "." or "$"
Please note that the pandoc version used by rmarkdown is 2.3.1.
Is there a different way to accomplish that?
You could write a small Lua filter to get just the first author:
function Meta (meta)
local firstauthor = meta.author.t == 'MetaList'
and meta.author[1]
or meta.author
meta.firstauthor = firstauthor
return meta
end
You can then use $firstauthor$ in your template. See here for a brief discussion of Lua filters and how to use them.
I noticed that in HTML produced by knitr from my RMarkdown document, sections are marked up thus:
<div id="chunk_id" class="section level2">
<h2>...</h2>
<p>...</p>
</div>
and so on. I think it's best practice to use a <section> element rather than a <div> here (reference 1, reference 2), so I forked the RMarkdown code to see if I could make a change and a PR. In the code I found the following:
#'#param section_divs Wrap sections in <div> tags (or <section> tags in HTML5),
#' and attach identifiers to the enclosing <div> (or <section>) rather than the
#' header itself. ```
so it seems like there is no need for a change to RMarkdown - it will already use <section> in the way I want, if it is told to output HTML5.
My question is: how do you tell knitr to output HTML5? I have
output:
html_document:
section_divs = TRUE
but no idea how to "switch on" HTML5.
I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.