Control text wrapping and hyphenation in kable tables - r-markdown

I have an Rmarkdown document with embedded tables but I have trouble understanding the underlying rules for text wrapping and hyphenation for the table contents. Searching through stackoverflow and other resources hasn't provided a lot of insight.
An example is provided below, the columns widths specified are only necessary in the example to reproduce the problem I have with the real table. After some trial and error, I was able to get the last column header to hyphenate by entering it as " Manufacturer " but this trick does not work in the rows below that header. Additional examples of problems with text in cells either getting cut off or spilling into adjacent cells are shown in the third column (Result) and the formatting of cell entries is displayed in the second column. I've added a border between the third and fourth columns to highlight the problems. The real table has 8 columns and I've adjusted those column widths as much as possible while preserving readability.
---
title: 'Table_7_problem'
fontsize: 11pt
output:
bookdown::pdf_document2:
toc: false
number_sections: false
latex_engine: xelatex
tables: yes
header-includes:
- \usepackage{booktabs}
- \usepackage{longtable}
- \usepackage{colortbl} # to set row stripe colors
- \usepackage{tabu}
- \setlength{\tabcolsep}{1pt}
---
```
```{r setup, echo = TRUE, cache = FALSE, warning = FALSE, message = FALSE}
{r setup, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
library(knitr)
```
# Table 7: Appliance durability
This table contains fictional data.
```{r table7, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
{r table7, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
table7 <- data.frame(
Column_1 = c('Very long string #1 that requires a wide column to accomodate and maintain readability' ,'Very long string #2... and more of the same down rows for this column...','Very long string #3','Very long string #4','Very long string #5','Very long string #6', 'Very long string #7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText #2"',
'"Sample Text #2"'),
Column_3 = c('SampleText',
'Sample Text',
' SampleText',
'SampleText ',
' SampleText ',
'SampleText #2',
'Sample Text #2"'),
Column_4 = c('Manufacturer',
' Manufacturer',
'Manufacturer ',
' Manufacturer ',
' LongManufacturerName',
'Long_Manufacturer_Name',
"Long Manufacturer Name")
)
###
colnames(table7) <- c("Name", "Cell Content Format", "Result", " Manufacturer ")
library(kableExtra)
table7 %>%
kbl(longtable = TRUE, align = "lllc", booktabs = TRUE) %>%
kable_styling(full_width = FALSE, font_size = 8, latex_options = c("repeat_header", "striped"), stripe_color = "gray!15", repeat_header_text = "Table 7 \\textit{continued...}") %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, width = "1.5in") %>%
column_spec(2, width = "3.825in") %>%
column_spec(3, width = "0.5in") %>%
column_spec(4, width = "0.45in", border_left = TRUE)
```
The above code produces this:
Any advice or solutions on how to control the hyphenation and word wrapping to resolve these problems?
*** UPDATE 2022-09-07
Updating the status - I've explored several packages for making the table and so far none will do everything I was looking for but, for me, it seems the flextable package will do most of what I wanted. The updated code and pdf result are shown below. It may not be pretty but it gets the job done. Seems some conflicts arise when piping the formatting commands but they seem to work just fine if entered one at a time, which is why there are multiple t7 <-... statements (I played around with much more elaborate formatting and the same strategy of using individual statements worked).
table7 <- data.frame(
Column_1 = c('Very long string #1 that requires a wide column to accomodate and maintain readability' ,'Very long string #2... and more of the same down rows for this column...','Very long string #3','Very long string #4','Very long string #5','Very long string #6', 'Very long string #7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText #2"',
'"Sample Text #2"'),
Column_3 = c('SampleText',
'Sample Text',
' SampleText',
'SampleText ',
' SampleText ',
'SampleText #2',
'Sample Text #2"'),
Column_4 = c('Manufacturer',
' Manufacturer',
'Manufacturer ',
' Manufacturer ',
' LongManufacturerName',
'Long_Manufacturer_Name',
"Long Manufacturer Name")
)
###
colnames(table7) <- c("Name", "Cell Content Format", "Result", "Manu-\nfacturer")
library(flextable)
library(stringr)
set_flextable_defaults(
font.family = gdtools::match_family(font = "Serif"),
font.size = 8,
padding = 3)
table7$`Manu-\nfacturer` <- str_replace(string = table7$`Manu-\nfacturer`, pattern = 'Manufacturer', replacement = 'Manu-\nfacturer')
t7 <- table7 %>% flextable() %>%
width(., width = c(1.5, 3.825, 0.5, 0.45), unit = "in") %>%
#add_header_lines(., values = "Table 7") %>%
theme_zebra(.)
t7 <- hline(t7, i = 1, border = officer::fp_border(color = "black"), part = "header")
t7 <- flextable::align(t7, i = 1, j = 1, align = "left", part = "header")
t7
the above generates the figure below. The str_replace strategy suggested by #Julian achieves the hyphenation and wrapping and theme_zebra() in flextable preserved the row striping.

What you can do is to add linebreaks and add escape = FALSE to your kable function. Note that you need to escape #,_ etc. as well.
table7 <- data.frame(
Column_1 = c('Very long string 1 that requires a wide column to accomodate and maintain readability' ,'Very long string 2... and more of the same down rows for this column...','Very long string 3','Very long string 4','Very long string 5','Very long string 6', 'Very long string 7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText 2"',
'"Sample Text 2"'),
Column_3 = c('Sample\nText',
'Sample\n Text',
' Sample\nText',
'Sample\nText ',
' Sample\nText ',
'Sample\nText 2',
'Sample \nText 2"'),
Column_4 = c('Manu\nfacturer',
' Manu\nfacturer',
'Manu\nfacturer ',
' Manu\nfacturer ',
' Long\nManufacturer\nName',
'Long\nManufacturer\nName',
"Long\n Manufacturer\n Name")
)

Related

How to separate values from string using quotes as escape secuence?

For example: have this text:
'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
How to separate this into array values like as Data 1, Da;ta;3;etc..., Data 4, etc? have a unknown number of ; into quotes and have any binary chars into content (non utf-8).
I try using a split:
data = line.strip().split(b';')
But have a problem with the delimiters into quotes. I try replacing the delimiters using:
line = re.sub(rb'(".+?);(.+?")', rb'\1 - \2', line)
But the problem is when have two o more delimiters.
Can not use csv module, csv can not support a binary read mode.
import re
test_str = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
regex = '\"([^\"]+)\"'
data_list = re.findall(regex,test_str)
for data in matches:
test_str = test_str.replace(f"\"{data}\";","")
data_list = data_list + test_str.split(';')
Here data_list would look like this : ['Da;ta;3;etc...', 'Data 1', 'Data 2', 'Data 4']
I'm not sure I understood correctly, but if you want split your string having " as a delimiter it's as simple as:
line = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
my_array = line.split('"')
Which results in the following array:
['Data 1;Data 2;', 'Da;ta;3;etc...', ';Data 4']
Now if you want to split both by " and ; you can:
line = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
my_array = []
for entry in line.split('"'):
my_array.extend(entry.split(';')) 4']
Which results in the following array:
['Data 1', 'Data 2', '', 'Da', 'ta', '3', 'etc...', '', 'Data 4']

Removing regular expressions from text string in a data-frame in R

I have a data-set with 1000 rows with text containing the order description of lamps. The data is full of inconsistent regex patterns and after referring to the few solutions, I got some help, but its not solving the issue.
R remove multiple text strings in data frame
remove multiple patterns from text vector r
I want to remove all delimiters and also keep only the words present in the wordstoreplace vector.
I have tried removing the delimiters using lapply and post that I have created 2 vectors- "wordstoremove" and "wordstoreplace"
I am trying to apply "str_remove_all()" and the "str_replace_all()". The the first function worked but the second did not.
Initially I had tried using a very naive approach but it was too clumsy.
mydata_sample=data.frame(x=c("LAMP, FLUORESCENT;TYPE TUBE LIGHT, POWER 8 W, POTENTIAL 230 V, COLORWHITE, BASE G5, LENGTH 302.5 MM; P/N: 37755,Mnfr:SuryaREF: MODEL: FW/T5/33 GE 1/25,",
"LAMP, INCANDESCENT;TYPE HALOGEN, POWER 1 KW, POTENTIAL 230 V, COLORWHITE, BASE R7S; Make: Surya",
"BALLAST, LAMP; TYPE: ELECTROMAGNETIC, LAMP TYPE: TUBELIGHT/FLUORESCENT, POWER: 36/40 W, POTENTIAL: 240VAC 50HZ; LEGACY NO:22038 Make :Havells , Cat Ref No : LHB7904025",
"SWITCH,ELECTRICAL,TYPE:1 WCR WAY,VOLTAGE:230V,CURRENT RATED:10A,NUMBEROFPOLES:1P,ADDITIONAL INFORMATION:FOR SNAPMODULESWITCH",
"Brief Desc:HIGH PRES. SODIUM VAPOUR LAMP 250W/400WDetailed Desc:Purchase order text :Short Description :HIGH PRES. SODIUM VAPOURLAMP 250W/400W===============================Part No :SON-T 250W/400W===============================Additional Specification :HIGH PRESSURE SODIUM VAPOUR LAMPSON-T 250W/400W USED IN SURFACE INS SYSTEM TOP LIGHT"))
delimiters1=c('"',"\r\n",'-','=',';')
delimiters2=c('*',',',':')
library(dplyr)
library(stringr)
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters1, "\\b", collapse = '|'), ignore_case = T)))
dat <- mydata_sample %>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",delimiters2, "\\b", collapse = '|'), ignore_case = T)))
####Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
wordstoremove=c('Mnfr','MNFR',"VAPOURTYPEHIGH",'LHZZ07133099MNFR',"BJHF","BJOS",
"BGEMF","BJIR","LIGHTING","FFT","FOR","ACCOMMODATIONQUANTITY","Cat",
"Ref","No","Type","TYPE","QUANTITY","P/N")
wordstoreplace=c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ")
dat1 <- dat%>%
mutate(x1 = str_remove_all(x1, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
dat1=dat1 %>%
mutate(x1=str_replace_all(x1, wordstoreplace, 'Grade A'),ignore_case = T)
###Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
The regex is failing because you need to escape all special characters. See the differences here:
# orig delimiters1=c('"', "\r\n", '-', '=', ';')
delimiters1=c('\\"', "\r\n", '-', '\\=', ';')
# orig delimiters2=c('*', ',', ':')
delimiters2=c('\\*', ',', '\\:')
For the str_replace_all() you need the words to be a single string separated by a | rather than a vector of 12
wordstoreplace <-
c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ") %>%
paste0(collapse = "|")
# "HAVELLS|Havells|Bajaj|BAJAJGrade A|PHILIPS|Philips|MAKEBAJAJ/CG|philips|Philips/Grade A/Grade A/CG/GEPurchase|CG|Bajaj|BAJAJ"
This then runs without throwing an error
dat1 <-
dat %>%
mutate(
x1 =
str_remove_all(x1, regex(str_c("\\b", wordstoremove, "\\b", collapse = "|"), ignore_case = T)),
x1 = str_replace_all(x1, wordstoreplace, "Grade A")
)

Trouble using UpdateCursor when setting the value

Using ArcDesktop 10.1 & Python 2.7:
I am working on a code that searches for values within 13 fields and, based on what it finds within those 13 fields, it concatenates a string and puts the result in an existing (empty) field.
It uses a search cursor to search the 13 fields. Then uses the result of that in an update cursor to concatenate the string.
I am having trouble getting the result into the field using the setValue - Line 40 of the code below # urow.setValue(commentsField, easementType). The error message is very unhelpful (RuntimeError: ERROR 999999: Error executing function.)
I am not sure how to correctly get the value set in the field desired. Any help would be greatly appreciated!
import arcpy, os, math
from itertools import izip
arcpy.env.workspace = "C:\\Users\\mdelgado\\Desktop\\WorkinDog.gdb"
#These are my variables
fc = "EASEMENTS"
commentsField = "Comments"
typeFields = ["ABANDONED", "ACCESS", "AERIAL", "BLANKET", "COMM", "DRAIN", "ELEC", "GEN_UTIL", "LANDSCAPE", "PARKING", "PIPELINE", "SAN_SEWR", "SIDEWALK", "SPECIAL", "STM_SEWR", "WATER"]
fieldNames = ["ABANDONED", "ACCESS", "AERIAL", "BLANKET", "COMMUNICATION", "DRAINAGE", "ELECTRIC", "GENERAL UTILITY", "LANDSCAPE", "PARKING", "PIPELINE", "SANITATION SEWER", "SIDEWALK", "SPECIAL", "STORM SEWER", "WATER"]
fieldValues = []
easementType = ""
#This is my search cursor
scursor = arcpy.SearchCursor(fc)
srow = scursor.next()
for field in typeFields:
srowValue = (srow.getValue(field))
fieldValues.append(srowValue)
srow = scursor.next()
print fieldValues
#This is my update cursor
ucursor = arcpy.UpdateCursor(fc)
for urow in ucursor:
#This is where I begin the loop to concatenate the comment field
for (value, name) in izip(fieldValues, fieldNames):
print str(value) + " " + name
#This is where I check each field to find out which types the easement is
if value == 1:
easementType = easementType + name + ", "
#This is where I format the final concatenated string
easementType = easementType[:-2]
print easementType
#This is where the field is updated with the final string using the cursor
urow.setValue(commentsField, easementType)
ucursor.updateRow(urow)
urow = cursor.next()
del urow
del ucursor
del srow
del scursor
The uninformative 999999 error is one of the worst.
I suggest a couple modifications to your approach that may make it simpler to troubleshoot. First, use the da Cursors -- they are faster, and the syntax is a little simpler.
Second, you don't need a separate Search and Update -- the Update can "search" other fields in the same row in addition to updating fields. (The current code, assuming it was working correctly, would be putting the same fieldValues into every row the UpdateCursor affected.)
fieldNames = ["ABANDONED", "ACCESS", "AERIAL", "BLANKET", "COMMUNICATION", "DRAINAGE",
"ELECTRIC", "GENERAL UTILITY", "LANDSCAPE", "PARKING", "PIPELINE", "SANITATION SEWER",
"SIDEWALK", "SPECIAL", "STORM SEWER", "WATER"]
cursorFields = ["ABANDONED", "ACCESS", "AERIAL", "BLANKET", "COMM", "DRAIN",
"ELEC", "GEN_UTIL", "LANDSCAPE", "PARKING", "PIPELINE", "SAN_SEWR",
"SIDEWALK", "SPECIAL", "STM_SEWR", "WATER", "Comments"]
with arcpy.da.UpdateCursor(fc, cursorFields) as cursor:
for row in cursor:
easementType = ""
for x in range(13):
if row[x] == 1:
easementType += fieldNames[x] + ", "
easementType = easementType[:-2]
print easementType
row[13] = easementType
cursor.updateRow(row)

Removing words from a corpus of documents with a tailored list of words

The tm package has the ability to let the user 'prune' the words and punctuation in a corpus of documents:
tm_map( corpusDocs, removeWords, stopwords("english") )
Is there a way to supply tm_map with a tailored list of words that is read in from a csv file and used in place of stopwords("english")?
Thank you.
BSL
Lets take a file (wordMappings)
"from"|"to"
###Words######
"this"|"ThIs"
"is"|"Is"
"a"|"A"
"sample"|"SamPle"
First removel of words;
readFile <- function(fileName, seperator) {
read.csv(paste0("data\\", fileName, ".txt"),
sep=seperator, #"\t",
quote = "\"",
comment.char = "#",
blank.lines.skip = TRUE,
stringsAsFactors = FALSE,
encoding = "UTF-8")
}
kelimeler <- c("this is a sample")
corpus = Corpus(VectorSource(kelimeler))
seperatorOfTokens <- ' '
words <- readFile("wordMappings", "|")
toSpace <- content_transformer(function(x, from) gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", ' ', seperatorOfTokens), x))
for (word in words$from) {
corpus <- tm_map(corpus, toSpace, word)
}
If you want a more flexible solution, for example not just removing also replacing with then;
#Specific Transformations
toMyToken <- content_transformer( function(x, from, to)
gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", to, seperatorOfTokens), x))
for (i in seq(1:nrow(words))) {
print(sprintf("%s -> %s ", words$from[i], words$to[i]))
corpus <- tm_map(corpus, toMyToken, words$from[i], words$to[i])
}
Now a sample run;
[1] "this -> ThIs "
[1] "is -> Is "
[1] "a -> A "
[1] "sample -> SamPle "
> content(corpus[[1]])
[1] " ThIs Is A SamPle "
>
My solution, which may be cumbersome and inelegant:
#read in items to be removed
removalList = as.matrix( read.csv( listOfWordsAndPunc, header = FALSE ) )
#
#create document term matrix
termListing = colnames( corpusFileDocs_dtm )
#
#find intersection of terms in removalList and termListing
commonWords = intersect( removalList, termListing )
removalIndxs = match( commonWords, termListing )
#
#create m for term frequency, etc.
m = as.matrix( atsapFileDocs_dtm )
#
#use removalIndxs to drop irrelevant columns from m
allColIndxs = 1 : length( termListing )
keepColIndxs = setdiff( allColIndxs, removalIndxs )
m = m[ ,keepColIndxs ]
#
#thence to tf-idf analysis with revised m
Any stylistic or computational suggestions for improvement are gratefully sought.
BSL

regexp parsing in matlab

I have a cell array 3x1 like this:
name1 = text1
name2 = text2
name3 = text3
and I want to parse it into separate cells 1x2, for example name1 , text1. In future I want to treat text1 as a string to compare with other strings. How can I do it? I am trying with regexp and tokens, but I cannot write a proper formula for that, if someone can help me with it please, I will be grateful!
This code
input = {'name1 = text1';
'name2 = text2';
'name3 = text3'};
result = cell(size(input, 1), 2);
for row = 1 : size(input, 1)
tokens = regexp(input{row}, '(.*)=(.*)', 'tokens');
if ~isempty(tokens)
result(row, :) = tokens{1};
end
end
produces the outcome
result =
'name1 ' ' text1'
'name2 ' ' text2'
'name3 ' ' text3'
Note that the whitespace around the equal sign is preserved. You can modify this behaviour by adjusting the regular expression, e.g. also try '([^\s]+) *= *([^\s]+)' giving
result =
'name1' 'text1'
'name2' 'text2'
'name3' 'text3'
Edit: Based on the comments by user1578163.
Matlab also supports less-greedy quantifiers. For example, the regexp '(.*?) *= *(.*)' (note the question mark after the asterisk) works, if the text contains spaces. It will transform
input = {'my name1 = any text1';
'your name2 = more text2';
'her name3 = another text3'};
into
result =
'my name1' 'any text1'
'your name2' 'more text2'
'her name3' 'another text3'