Getting the first author in a pandoc template (Rmarkdown) - r-markdown

I am currently creating my own pandoc template for Rmarkdown (outputting html). I want my report to show a footer containing the title and first author's name.
Reading the pandoc manual, I saw that it is possible to use a pipe to get the first element of an array (var/first). So having the header:
title: "My Report"
author:
- "Jane Doe"
- "John Doe"
I tried to do the following in the template:
<div class="footer">
<div class="footer-content">
<div class="footer-title">
<h3>$title$</h3>
</div>
<div class="footer-author">
<h3>$author/first$</h3>
</div>
</div>
</div>
Then I got as error in the author line:
"template" (line 96, column 24):
unexpected "/"
expecting "." or "$"
Please note that the pandoc version used by rmarkdown is 2.3.1.
Is there a different way to accomplish that?

You could write a small Lua filter to get just the first author:
function Meta (meta)
local firstauthor = meta.author.t == 'MetaList'
and meta.author[1]
or meta.author
meta.firstauthor = firstauthor
return meta
end
You can then use $firstauthor$ in your template. See here for a brief discussion of Lua filters and how to use them.

Related

How to keep only the first paragraph of $product.description_short in a listing?

I want to keep only the first paragraph of product description in categories.
Example: <p>This is a pretty good description.</p><p>The rest of the description, even if it's cool to I don't want it.</p>
To : <p>This is a pretty good description.</p>
This is the default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{$product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
</p>
This is what I tried default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{assign var $newdescription = $product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
{preg_replace('(?<=<\/p>)\s+<p>.*','',$newdescription)}
</p>
{$_shorten = explode('</p>', $product.description_short)}
// with valid html tags
{$_shorten.0|cat:'</p>'}
// or if you want to strip tags:
{$_shorten.0|strip_tags}

How can I specify HTML5 output from RMarkdown to get semantic elements like <section>?

I noticed that in HTML produced by knitr from my RMarkdown document, sections are marked up thus:
<div id="chunk_id" class="section level2">
<h2>...</h2>
<p>...</p>
</div>
and so on. I think it's best practice to use a <section> element rather than a <div> here (reference 1, reference 2), so I forked the RMarkdown code to see if I could make a change and a PR. In the code I found the following:
#'#param section_divs Wrap sections in <div> tags (or <section> tags in HTML5),
#' and attach identifiers to the enclosing <div> (or <section>) rather than the
#' header itself. ```
so it seems like there is no need for a change to RMarkdown - it will already use <section> in the way I want, if it is told to output HTML5.
My question is: how do you tell knitr to output HTML5? I have
output:
html_document:
section_divs = TRUE
but no idea how to "switch on" HTML5.

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

How to filter the html markups when render a template with jinja2?

Now I'm biulding a django project with jinja2 dealing with templates. Some page contents are submited by the client with wysiwy editor, and thing's going fine with the detail pages.
But the list pages are wrong with the slice of the contents.
My code:
<div class="summary ">
<div class="content">{{ question.content[:200]|e}}...</div>
</div>
But the output is:
<p>what i want to show here is raw text without markups</p>...
The expected result is that the html markups like <p></p> <section>.... are gone (filtered or eliminated) and only the raw text shows!
So how can I fix it? Thanks in advance!
Use striptags filter:
striptags(value)
Strip SGML/XML tags and replace adjacent whitespace
by one space.
<div class="content">{{ question.content|striptags}}...</div>
Jinja2 striptags filter test will also help you to understand how it works.
Hope that helps.

Matching text that is not html tags with regular expression

So I am trying to create a regular expression that matches text inside different kinds of html tags. It should match the bold text in both of these cases:
<div class="username_container">
<div class="popupmenu memberaction">
<a rel="nofollow" class="username offline " href="http://URL/surfergal.html" title="Surfergal is offline"><strong><!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end --></strong></a>
</div>
<div class="username_container">
<span class="username guest"><b><a>**Advertisement**</a></b></span>
</div>
I have tried with the following regular expression without any result:
/<div class="username_container">.*?((?<=^|>)[^><]+?(?=<|$)).*?<\/div>/is
This is my first time posting here on stackoverflow so if I am doing something incredibly stupid I can only apologize.
Using regex to parse html is.. hard. See the links in the comments to your question.
What do you plan to do with these matches? Here's a quick jquery script that logs the results in the console:
var a = [];
$('strong, b').each(function(){
a.push($(this).html());
});
console.log(a);
results:
["<!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end -->", "<a>**Advertisement**</a>"] ​
http://jsfiddle.net/Mk7xf/