Bookdown text references not working if URL contains special characters

Bookdown text references not working if URL contains special characters - r-markdown

There is some inconsistent behaviour of text references in bookdown with URLs containing special characters such as : or _. Here are some demonstrations:
---
output: bookdown::pdf_book
---
(ref:good) [This Works](https://commons.wikimedia.org/wiki)
(ref:good)
(ref:bad) [This Does Not](https://commons.wikimedia.org/wiki/File:Linear_visible_spectrum.svg)
(ref:bad)
The link will work normally [like here](https://commons.wikimedia.org/wiki/File:Linear_visible_spectrum.svg)
Is there a way to get text references to work if they contain special characters?
This behaviour was flagged in this question here, but the symbols were not directly identified as the key issues within the question. I wanted to make a focussed thread on SO before raising it as a potential issue on GitHub.

The problem was not caused by special characters but the fact that the link was too long, so the line in LaTeX was wrapped by Pandoc by default:
(ref:bad)
\href{https://commons.wikimedia.org/wiki/File:Linear_visible_spectrum.svg}{This Does Not}
It should be considered as a bug of bookdown, but there is a workaround:
output:
bookdown::pdf_book:
pandoc_args: [--wrap=none]

Related

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.

It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1

The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.

I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Having issues with regex sub function in Python 3

I am trying to remove symbols like \x92, \xa0 etc from a text that I have downloaded from website and parsed using BeautifulSoup. Then I see that I have these symbols (encoding ) everywhere. I am using re.sub(r'[^\x00-x7F]',' ',txt)
to remove these symbols from my txt, but I noticed that I have lost each occurrence of y. For example: 'Security' became 'Securit' etc.
Any help would be greatly appreciated.
Thanks.

The (erroneous) regular expression r'[^\x00-x7F]' probably should be r'[^\x00-\x7F]' (note an additional backslash).
As you have written, it contains the set not in NUL through x. So y and subsequent ascii codes are missing.

Doxygen parsing ampersands for ascii chars

I've been using Doxygen to document my project but I've ran into some problems.
My documentation is written in a language which apostrophes are often used. Although my language config parameter is properly set, when Doxygen generates the HTML output, it can't parse apostrophes so the code is shown instead of the correct character.
So, in the HTML documentation:
This should be the text: Vector d'Individus
But instead, it shows this: Vector d'Individus
That's strange, but searching the code in the HTML file, I found that what happens is that instead of using an ampersand to write the ' code, it uses the ampersand code. Well, seeing the code is easier to see:
<div class="ttdoc">Vector d&#39;Individus ... </div>
One other thing is to note that this only happens with the text inside tooltips...
But not on other places (same code, same class)...
What can I do to solve this?
Thanks!

Apostrophes in code comments must be encoded with the correct glyph for doxygen to parse it correctly. This seems particularly true for the SOURCE_TOOLTIPS popups. The correct glyph is \u2019, standing for RIGHT SINGLE QUOTATION MARK. If the keyboard you are using is not providing this glyph, you may write a temporary symbol (e.g. ') and batch replace it afterwards with an unicode capable auxiliary tool, for example: perl -pC -e "s/'/\x{2019}/g" < infile > outfile. Hope it helps.

Regarding the answer from ramkinobit, this is not necessary, doxygen can use for e.g. the Right Single quote: ’ (see doxygen documentation chapter "HTML commands").
Regarding the apostrophe the OP asks for one can use (the doxygen extension) &apos; (see also doxygen documentation chapter "HTML commands")).
There was a double 'HTML escape' in doxygen resulting in the behavior as observed for the single quote i.e. displaying '.
I've just pushed a proposed patch to github (pull request 784, https://github.com/doxygen/doxygen/pull/784).
EDIT 07/07/2018 (alternative) patch has been integrated in main branch on github.

How replacing with regex characters generated by encoding errors when embedded in text

I need to replace the following characters with regex (gsub):
ÃÆÃÂ¨ -> è
ÃÆÃÂ -> à
ÃÆÃÂ² -> ò
ÃÆÃÂ¬ -> ì
ÃÆÃÃ¹ -> ù
My strategy is to first remove the first three characters ÃÆÃ that are common to all and the move to the last, leaving à at the end since it is basically the lowest common denominator.
Now gsub correctly removes the first three but then it seams it doesn't see the final ones - like Â¨ - but I noticed it sees Ã± (for ñ).
By copy/pasting the characters into the text editor I noticed they cause weird behaviours (such as moving the cursor forward by few positions).
My dataset was downloaded from a website that itself has encoding problems for the oldest pages but not for the most recent ones (I think they corrected the encoding problem sometime in the last years). Visiting the oldest pages you can still see the very same ÃÆÃÂ¨ in plain sight. Then the problem is not (I assume) in the encoding of my file.
That is, the encoding errors are limited to regions of the dataset and are not the result of an encoding issue with the whole text corpus.

The problem when the characters are not correctly displayed is to understand exactly how they are parsed by the regex. In my case, as explained, the encoding errors where limited to few strings in my dataset. Then Encoding() was not applicable.
I solved the problem by visualising the problematic characters directly in R console. In console they appear like Ã\u0083Æ\u0092Ã\u0082Â¨ while in the R-studio they were visualised as Ã Æ Ã Â¨. What visualised in console was what I needed for a correct match with regex: gsub("Ã\u0083Æ\u0092Ã\u0082Â¨"...

Detecting Characters in an XSLT

I have encountered some odd characters that do not display properly in Internet Explorer, such as these: â€œ, â€“, and â€™. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.

What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).

Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.

Jeni Tennison's Multiple string replacements may be a good starting point.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bookdown text references not working if URL contains special characters - r-markdown

Related

How to find and replace box character in text file?

Having issues with regex sub function in Python 3

Doxygen parsing ampersands for ascii chars

How replacing with regex characters generated by encoding errors when embedded in text

Detecting Characters in an XSLT

Categories

Resources