References not sorted properly by author-year - r-markdown

I'm working on a markdown Rmd document with references in several .bib BibTeX databases. The yaml header includes:
---
title: "title"
author: "me"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
bookdown::word_document2:
reference_docx: StylesTemplate.docx
number_sections: false
bibliography:
- "`r system('kpsewhich graphics.bib', intern=TRUE)`"
- "`r system('kpsewhich statistics.bib', intern=TRUE)`"
- "`r system('kpsewhich timeref.bib', intern=TRUE)`"
- "`r system('kpsewhich Rpackages.bib', intern=TRUE)`"
csl: apa.csl
---
I am stymied in how to get the following references to sort in the correct author-year order. The first two are out of order.
I am aware that pandoc-citeproc with a .csl file attempts to disambiguate authors when there are different spellings, but I checked my .bib files and all of these Tukey publications have one of:
author = {John W. Tukey}
author = {Tukey, John W.}
so they should be considered the same.
The first 4 references in my BibTeX files are:
#InProceedings{Tukey:1975:picturing,
author = {John W. Tukey},
booktitle = {Proceedings of the International Congress of Mathematicians, Vancouver},
title = {Mathematics and the picturing of data},
year = {1975},
pages = {523--531},
volume = {2},
}
#Techreport{Tukey:1993:TR,
author = "John W. Tukey",
title = "Exploratory Data Analysis: Past, Present, and Future",
institution = "Department of Statistics, Princeton University",
year = "1993",
number = "No. 302",
month = apr,
url = "https://apps.dtic.mil/dtic/tr/fulltext/u2/a266775.pdf",
}
#Article{Tukey:59,
author = {John W. Tukey},
journal = {Technometrics},
title = {A Quick, Compact, Two Sample Test to {Duckworth's} Specifications},
year = {1959},
pages = {31--48},
volume = {1},
doi = {10.2307/1266308},
url = {https://www.jstor.org/stable/1266308},
}
#article{Tukey:1962,
Author = {John W. Tukey},
Journal = {The Annals of Mathematical Statistics},
Number = {1},
Pages = {1--67},
Publisher = {Institute of Mathematical Statistics},
Title = {The Future of Data Analysis},
Url = {http://www.jstor.org/stable/2237638},
Volume = {33},
Year = {1962},
}
I see minor differences in formatting, but these should not affect pandoc-citeproc sorting.
Is this perhaps a bug in pandoc-citeproc or is there something I can do in my .bib files to avoid this?
I'm running R 4.1.3 under R Studio 2022.02.1, with pandoc 2.17.1.1
Update
I re-ran this using the chicago-author-date.csl style. All the references now sort correctly, so there must be something peculiar with the apa.csl style. I'd still prefer apa.csl style, so it would be of interest to understand why the difference.

Related

How do I remove DOI from R-markdown bibliography?

I would like to remove the DOI from the bibliographic references in my markdown script. Is there a way I can do this?
Here is my markdown file:
---
title: "my paper"
author: "name"
date: \today
header-includes:
output:
pdf_document:
number_sections: yes
toc: yes
keep_tex: yes
fig_caption: yes
word_document:
toc: yes
latex_engine: xelatex
indent: yes
bibliography: library.bib
references:
link-citations: yes
linkcolor: blue
hyperfootnotes: yes
---
I would like to remove the DOI from this reference #Wallace2005
# Bibliography {-}
::: {#refs}
:::
The output of this file is the following:
And here is the .bib file
#article{Wallace2005,
abstract = {Constantly evolving, and with far-reaching implications, European Union policy-making is of central importance to the politics of the European Union. From defining the processes, institutions and modes through which policy-making operates, the text moves on to situate individual policieswithin these modes, detail their content, and analyse how they are implemented, navigating policy in all its complexities. The first part of the text examines processes, institutions, and the theoretical and analytical underpinnings of policy-making, while the second part considers a wide range of policy areas, from economics to the environment, and security to the single market. Throughout the text, theoreticalapproaches sit side by side with the reality of key events in the EU, including enlargement, the ratification of the Lisbon Treaty, and the financial crisis and resulting euro area crisis, exploring what determines how policies are made and implemented. In the final part, the editors consider trendsin EU policy-making and look at the challenges facing the EU. Exploring the link between the modes and mechanisms of EU policy-making and its implementation at national level, Policy-Making in the Europe Union helps students to engage with the key issues related to policy. Written by experts, for students and scholars alike, this is the most authoritative andin-depth guide to policy in the European Union.},
author = {Wallace, Helen and Wallace, William and Pollack, Mark A},
doi = {10.1177/0010414013516917},
file = {:Users/aguasti/Desktop/Mendely Organized Library/Wallace, Wallace, Pollack/Wallace, Wallace, Pollack - 2005 - Policy-Making in the European Union.pdf:pdf},
isbn = {0199689679},
issn = {0010-4140},
pages = {574},
pmid = {130137987},
title = {{Policy-Making in the European Union}},
url = {https://books.google.com/books?id=w6SbBQAAQBAJ&pgis=1},
year = {2005}
}
If anyone knows how I could remove the DOI from the bibliographic reference I would be extremely grateful
I am assuming that you want to have this done on the fly while knitting the PDF.
The way the references are rendered is controlled by the applied citation styles.
So, one way would be to change the citation style and in the YAML header to a style that does not include the DOI (note that for the PDF output you would need to add the natbib line).
bibliography: library.bib
citation_package: natbib
csl: somethingelse.csl
Alternatively, if you have to stick to a certain style, you could [modify the CSL-file] (https://www.zotero.org/support/dev/citation_styles/style_editing_step-by-step).
Example for elsevier-harvard.csl
You could just comment the relevant line in the CSL-file:
<if variable="DOI">
<!--<text variable="DOI" prefix="https://doi.org/"/> -->
</if>
Save this under a new name (e.g., elsevier-harvard_mod.csl)
and then re-run your example (here shortened)
---
title: "my paper"
author: "name"
date: \today
output: pdf_document
bibliography: library.bib
citation_package: natbib
csl: elsevier-harvard_mod.csl
---
I would like to remove the DOI from this reference #Wallace2005
# Bibliography {-}

how to insert citation in a footnote of table generated with kable?

I was trying to add citation in a footnote (or even in any text in the table) but it's not working, the citation text appears as it is. I thought I need to change the format table to markdown instead of latex and using bookdown::pdf_document2 but both did not solve the problem. another attempt was to create a citation text outside kable with a separate code chunk and then paste it inside the footnote also didn't work.
this is my code:
---
title: "scientific report"
output:
pdf_document:
fig_caption: true
keep_tex: true
number_sections: yes
latex_engine: xelatex
csl: elsevier-with-titles.csl
bibliography: citations.bib
link-citations: true
linkcolor: blue
---
# This is an exaample
the number of the table below is [\ref{do}]
P.S. I wrote the superscript (a) manually in the xlsx file.
```{r echo=FALSE }
library(knitr)
library(kableExtra)
library("readxl")
dfdf <- read_excel("dyss_count.xlsx")
df <- as.data.frame(dfdf)
options(knitr.kable.NA = '')
kable(df, "latex", longtable = T, booktabs = T,escape = F ,caption = 'dosage \\label{do}',align = "c") %>%
kable_styling(latex_options = c('repeat_header'), font_size = 7) %>%
footnote(general ="A general footnote",
alphabet = 'the source is #Burg_2019',
general_title = "General: ", number_title = "Type I: ",
alphabet_title = "Type II: ",
footnote_as_chunk = T, title_format = c("italic", "underline")
)
result is:
I would be very thankful for any useful information.
well, after many attempts it worked with the conventional cross referencing here. so
in case someone else is having same issue, I just did this:
(ref:caption) The source is [#Burg_2019] outside the the chunk and then inside the footnote footnote(general ="A general footnote",alphabet = "(ref:caption)" )
You could possible try adding in a caption using css -> kable_styling(extra_css = ..) so you could modify its styling properties? Just a thought.

Caption numbering not in sequential order when citing the captions with captioner in Rmarkdown

I am using captioner (https://cran.r-project.org/web/packages/captioner/vignettes/using_captioner.html) to create table captions in Rmarkdown - the main reason is because I am using huxtable for conditional formatting and exporting to word. This is the only I have found to have numbered captions.
I was trying to reference the captions but the caption number is not in sequential order when citing the captions but only if the table_nums(..., display="cite") is before the tables. I was trying to give the range of table numbers and it changed the number of the last table. I The number isn't changed if the r table_nums('third_cars_table',display = "cite") is put after the captions. Is there a way to make sure that table numbers remain in sequential order? I'd also be happy with a better solution for numbered captions.
Reproducible example:
---
title: "Untitled"
output: bookdown::word_document2
---
```{r setup, include=FALSE}
library(captioner)
library(huxtable)
library(knitr)
library(pander)
table_nums <- captioner(prefix = "Table")
fig_nums <- captioner(prefix = "Figure")
knitr::opts_chunk$set(echo = TRUE)
```
## Description of tables
I am trying to put a description of tables
and say that these results are shown table numbers ranging
from the first table (`r table_nums('first_cars_table',display = "cite")`)
to the last table (`r table_nums('third_cars_table',display = "cite")`)
```{r, results='asis',echo=FALSE,eval.after=TRUE}
tablecap1=cat(table_nums(name="first_cars_table",caption='First car table'))
kable((cars[1:5,]))
tablecap2=cat(table_nums(name="second_cars_table",caption='second car table'))
kable(cars[6:10,])
tablecap3=cat(table_nums(name="third_cars_table",caption='third car table'))
kable(cars[10:15,])
```
The results:
A (terrible) workaround is to manually give the number ordering using display = FALSE. For example, inserting the following at the start of the document will ensure t1-t5 are sequentially numbered, no matter where the tables or first citations appear:
`r table_nums('t1', display = FALSE)`
`r table_nums('t2', display = FALSE)`
`r table_nums('t3', display = FALSE)`
`r table_nums('t4', display = FALSE)`
`r table_nums('t5', display = FALSE)`
I have not examined the captioner code but I expect that the document is read from top to bottom once and hence the numbering is stored in a first come, first served basis. Thus, I am not sure there are any other ways to get around this as it would involve some kind of pre-processing stage.

Extracting only dates from a text file and ignoring large numbers

I have a text file and I want to extract all dates from it but somehow my code is also extracting the other values like
Procedure #: 10075453.
Below is a small sample of that file:
Patient Name: Mills, John Procedure #: 10075453
October 7, 2017
Med Rec #: 747901 Visit ID: 110408731
Patient Location: OUTPATIENT Patient Type: OUTPATIENT
DOB:07/09/1943 Gender: F Age: 73Y Phone: (321)8344-0456
Can I get an idea how I could approach this problem?
doc = []
with open('Clean.txt', encoding="utf8") as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_extract():
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some preprocessing on a corpus of texts using R before processing them further on the NLP platform GATE. I convert the original pdf files to text as follows (the text files, unfortunately, go into the same folder):
dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
Then, having loaded the tm package and physically(!) moved the text files to another folder, I create a corpus:
TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))
I then want to perform a series of custom transformations to clean the texts. I succeeded to replace a simple string as follows:
ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")
However, a pattern that is a 1-3 digit page number followed by two line breaks and a page break is causing me problems. I want to replace it with a blank space:
EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")
The ([0-9]{1,3}) and \f alone match. The line breaks don't. If I copy text from one of the original .txt files into the RegExr online tool and test the expression "[0-9]{1,3}\n\n\f", it matches. So the line breaks do exist in the original .txt file.
But when I view one of the .txt files as read into the EU corpus in R, there appear to be no line breaks even though the lines are obviously breaking before the margin, e.g.
[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"
Seeing this, I tried other patterns, e.g. to detect one or more blank space ("[0-9]{1,3}\s*\f"), but no patterns worked.
So my questions are:
Am I converting and reading the files into R correctly? If so, what has happened to the line breaks?
If no line breaks is normal, how can I pattern match the character on line 5? Is that not a blank
space?
(A tangential concern:) When converting the pdf files, is there code that will put them directly in a new folder?
Apologies for extending this, but how can one print or inspect only a few lines of the text object? The tm commands and head(EU) print the entire object, each a very long text.
I know my problem(s) must appear simple and perhaps stupid, but one has to start somewhere and extensive searching has not revealed a source that explains comprehensively how to use RegExes to modify text objects in R. I am so frustrated and hope someone here will take pity and can help me.
Thanks for any advice you can offer.
Brigitte
p.s. I think it's not possible to upload attachments in this forum, therefore, here is a link to one of the original PDF documents: http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
Because the doc is long, I created a snippet of the first 3 pages of the TXT doc, read it into the R corpus ('EU') and printed it to the console and this is it:
dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON",
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************",
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report",
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations",
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures",
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities",
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion",
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation",
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition",
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual",
"3.3. Economic and Fiscal Affairs Economic and Monetary Union",
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0),
datetimestamp = structure(list(sec = 50.1142621040344, min = 33L,
hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L,
yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt"), tzone = "GMT"), description = character(0), heading = character(0),
id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author",
"datetimestamp", "description", "heading", "id", "language",
"origin"), class = "TextDocumentMeta")), .Names = c("content",
"meta"), class = c("PlainTextDocument", "TextDocument"))
Yes, working with text in R is not always a smooth experience! But you can get a lot done quickly with some effort (maybe too much effort!)
If you could share one of your PDF files or the output of dput(EU), that might help to identify exactly how to capture your page numbers with regex. That would also add a reproducible example to your question, which is an important thing to have in questions here so that people can test their answers and make sure they work for your specific problem.
No need to put PDF and text files in separate folders, instead you can use a pattern like so:
EU <- Corpus(DirSource(pattern = ".txt"))
This will only read the text files and ignore the PDF files
There is no 'snippet view' method in tm, which is annoying. I often use just names(EU) and EU[[1]] for quick looks
UPDATE
With the data you've just added, I'd suggest a slightly tangential approach. Do the regex work before passing the data to the tm package formats, like so:
# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")
# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)
# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
# read plain text int R
x1 <- readLines("my_pdf.txt")
# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3} \f", "", x3)
# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))
With the snippet of data your provided using dput, the output from this method is
inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics