Replacing numbers R regular-expression - regex

I am trying to code up an html tagging tool for my code in R and I am having difficulty finding and replace numbers with colored numbers.
I think the following is in the right direction but I am not sure what to do:
txt <- gsub("\\<[:digit:]\\>", paste0(num.start,"\\1",num.end) , txt)
This does not seem to do the job. Overall, I would like all numbers which are not part of words to be identified and replaced with tags before and after the numbers which change the color and are defined by the num.start, num.end variables.
For example:
num.start <- '<span style="color: #990000"><b>'
num.end <- '</b></span>'
So I would like to be able to feed in say R code and have it write html tags when appropriate.
Rcode:
txt <- "a <- 3945 ; b <- 3453*3942*a"
gsub("\\<[:digit:]\\>", paste0(num.start,"\\1",num.end) , txt)
[1] "a <- <span style="color: #990000"><b>3945</b></span> ; b <- <span style="color: #990000"><b>3453</b></span>*<span style="color: #990000"><b>3942</b></span>*a"
The hope would be that I could copy the modified R code into an html editor such as my blog and all of the numbers would be color coded.
Thanks so much for any assistance!
Francis

This will do the job though I do not recommend using regular expressions with HTML:
gsub("(\\d+)", paste0(num.start,"\\1",num.end) , txt)
The result:
[1] "a <- <span style=\"color: #990000\"><b>3945</b></span> ; b <- <span style=\"color: #990000\"><b>3453</b></span>*<span style=\"color: #990000\"><b>3942</b></span>*a"

Related

Word count in Quarto

I would love a convenient and easy way to print my word count automatically in quarto and stumbled across this nice add-in from Ben Marwick:
https://github.com/benmarwick/wordcountaddin
It is sound for rmarkdown and I presumed it should be no issue with quarto too. However, when I use the add-in, though it can count out the number of words within my RStudio session, it doesn't print it in my final pdf format and just returns [1] NA.
{r, #wordcountdev, message = FALSE, warning = FALSE, echo = FALSE}
wordstats <- wordcountaddin:::text_stats('CMI Write Up.qmd')
words <- substr(wordstats[3], start=19, stop=30)
print(words)
I don't understand what is going on here, it is seemingly simple, would anyone know of a better way to achieve what I'm trying?
You could take a look at the wordcounts pandoc filter, e.g. as a starting point it prints the number of words in the body to the console while rendering:
---
format: html
filters: [wordcounts.lua]
---
Hello there, how many words are in the body?
Or: You can use the development version (devtools::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)) of the above mentioned package:
---
format: html
---
```{r}
#| echo: false
#| label: wordstats
#| warning: false
#| message: false
wordcount <- wordcountaddin::text_stats('wordcount.qmd')
words <- substr(wordcount[3], start=19, stop=30)
```
Hello there, how many words are in the body?
There are `r words` words in the whole document.

How to align text in Mermaid flowchart node?

I would like to align text in a Mermaid flowchart node so the Thinkpad and iPad will line up. I tried to insert a \t before them but they just got rendered as text.
flowchart TD
A[Christmas] -->|Get money| B(Go shopping)
B --> C{Let me think}
C -->|One| D["Laptop: Thinkpad\nTablet: iPad"]
style D text-align:left
C -->|Two| E[iPhone]
C -->|Three| F[fa:fa-car Car]
Link to live editor
I added a unicode space to line it up:
flowchart TD
A[Christmas] -->|Get money| B(Go shopping)
B --> C{Let me think}
C -->|One| D["Laptop: Thinkpad\nTablet: iPad"]
style D text-align:left
C -->|Two| E[iPhone]
C -->|Three| F[fa:fa-car Car]
simple way: use space to fill it.
or if you just want to know how to print tab try use &Tab;
Be careful! The graph does not work; please use the flowchart instead of it.
Example
<script src="https://cdnjs.cloudflare.com/ajax/libs/mermaid/8.14.0/mermaid.min.js"></script>
<h2>Graph</h2>
<i>Do not use <code>graph</code> `\n` not work.</i>
<div class="mermaid">
graph TD
A --> B["Laptop: Thinkpad\nTablet: iPad"]
</div>
<h2>flowchart</h2>
<div class="mermaid">
flowchart TD
A --> B["Laptop: Thinkpad\nTablet&Tab;: iPad"] %% use TAB same as ` `
C --> D["Laptop: Thinkpad\nTablet : iPad"] %% use space
style B text-align:left
style D text-align:left
</div>
Reference
charref
&Tab; , &#9829 ♥ ...
Entity codes to escape characters

Append values in regular expressions

I'm using Xpath and regular expressions to obtain data from a web page
I'm using the following xpath to get the portion I'm interested in.
response.xpath('//*[#id="business-detail"]/div/p').extract()
EDIT:
Which provides the following:
[u'<p><span class="business-phone" itemprop="telephone">(415) 287-4225</span><span class="business-address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">2180 Bryant St. STE 203, </span><span itemprop="addressLocality">San Francisco</span>,\xa0<span itemprop="addressRegion">CA</span>\xa0<span itemprop="postalCode">94110</span></span><span class="business-link">www.klopfarchitecture.com</span> <br><br></p>']
I'm interested in
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
<span itemprop="addressLocality">San Francisco</span>
<span itemprop="addressRegion">CA</span>
<span itemprop="postalCode">94110</span>
So I'm using this regex to extract the data
reg = r'"streetAddress">[0-9]+[^<]*'
reg = r'"addressLocality"[^<]*'
reg = r'"addressRegion"[^<]*'
reg = r'"postalCode"[^<]*'
The problem is that are four of them so I get four variables, I need to append the data to have the full address in one variable to assign it to an Item, what would be an efficient way to accomplish it?
EDIT2:
You're right Roshan Jossey, I can use response.xpath('//*[#itemprop="streetAddress"]').extract()
But still are four labels, addressLocality, addressRegion and postal code. how I merge the results?
I looking for this result:
2180 Bryant St. STE 203, San Francisco, CA 94110
And I'm getting this format for each of the four parts
<span itemprop="streetAddress">2180 Bryant St. STE 203, </span>
I'd recommend to use just xpaths to solve this problem
response.xpath('//*[#id="business-detail"]/div/p//span[#itemprop="streetAddress"]/text()').extract()[0]
will provide you the street address. You can extract all other elements in a similar fashion. Then its just a matter of concatenating them.
Regular expressions looks like an overkill when such simple xpath solutions exist.

how to read a custom part of text from html file in R

I have some html text which looks as follows.
I would like to extract the part which says 745 from this text.
I mean, for a different query, the number may be something else(so I'm looking for anything after the word 'of')
<div><h2>Search results</h2><h3 class="result_count left">Items: 1 to 20 of 745</h3><span id="result_sel" class="nowrap"></span><input name="EntrezSystem2.PEntrez.Spring.Spring_ResultsPanel.Spring_ResultsController.ResultCount" sid="1" type="hidden" id="resultcount" value="745" /><input name="EntrezSystem2.PEntrez.Spring.Spring_ResultsPanel.Spring_ResultsController.RunLastQuery" sid="1" type="hidden" /></div>
How can I do this using a regular expression in R?
A repeatable way that incorporates proper HTML parsing and regular expressions may be:
library(rvest)
library(stringr)
search_results <- '<div><h2>Search results</h2><h3 class="result_count left">Items: 1 to 20 of 745</h3><span id="result_sel" class="nowrap"></span><input name="EntrezSystem2.PEntrez.Spring.Spring_ResultsPanel.Spring_ResultsController.ResultCount" sid="1" type="hidden" id="resultcount" value="745" /><input name="EntrezSystem2.PEntrez.Spring.Spring_ResultsPanel.Spring_ResultsController.RunLastQuery" sid="1" type="hidden" /></div>'
pg <- read_html(search_results)
items <- html_text(html_nodes(pg, "h3.result_count"))
to_val <- as.numeric(str_match(items, "Items: [[:digit:]]+ to [[:digit:]]+ of ([[:digit:]]+)")[,2])
depending on the answer to the comment.
One could also mine the <input> tag with the id="result_count" attribute if that would also be something consistent in the HTML response.

Remove Hashes in R Output from R Markdown and Knitr

I am using RStudio to write my R Markdown files. How can I remove the hashes (##) in the final HTML output file that are displayed before the code output?
As an example:
---
output: html_document
---
```{r}
head(cars)
```
You can include in your chunk options something like
comment=NA # to remove all hashes
or
comment='%' # to use a different character
More help on knitr available from here: http://yihui.name/knitr/options
If you are using R Markdown as you mentioned, your chunk could look like this:
```{r comment=NA}
summary(cars)
```
If you want to change this globally, you can include a chunk in your document:
```{r include=FALSE}
knitr::opts_chunk$set(comment = NA)
```
Just HTML
If your output is just HTML, you can make good use of the PRE or CODE HTML tag.
Example
```{r my_pre_example,echo=FALSE,include=TRUE,results='asis'}
knitr::opts_chunk$set(comment = NA)
cat('<pre>')
print(t.test(mtcars$mpg,mtcars$wt))
cat('</pre>')
```
HTML Result:
Welch Two Sample t-test
data: mtcars$mpg and mtcars$wt
t = 15.633, df = 32.633, p-value < 0.00000000000000022
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
14.67644 19.07031
sample estimates:
mean of x mean of y
20.09062 3.21725
Just PDF
If your output is PDF, then you may need some replace function. Here what I am using:
```r
tidyPrint <- function(data) {
content <- paste0(data,collapse = "\n\n")
content <- str_replace_all(content,"\\t"," ")
content <- str_replace_all(content,"\\ ","\\\\ ")
content <- str_replace_all(content,"\\$","\\\\$")
content <- str_replace_all(content,"\\*","\\\\*")
content <- str_replace_all(content,":",": ")
return(content)
}
```
Example
The code also needs to be a little different:
```{r my_pre_example,echo=FALSE,include=TRUE,results='asis'}
knitr::opts_chunk$set(comment = NA)
resultTTest <- capture.output(t.test(mtcars$mpg,mtcars$wt))
cat(tidyPrint(resultTTest))
```
PDF Result
PDF and HTML
If you really need the page work in both cases PDF and HTML, the tidyPrint should be a little different in the last step.
```r
tidyPrint <- function(data) {
content <- paste0(data,collapse = "\n\n")
content <- str_replace_all(content,"\\t"," ")
content <- str_replace_all(content,"\\ ","\\\\ ")
content <- str_replace_all(content,"\\$","\\\\$")
content <- str_replace_all(content,"\\*","\\\\*")
content <- str_replace_all(content,":",": ")
return(paste("<code>",content,"</code>\n"))
}
```
Result
The PDF result is the same, and the HTML result is close to the previous, but with some extra border.
It is not perfect but maybe is good enough.