How to separate values from string using quotes as escape secuence? - regex

For example: have this text:
'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
How to separate this into array values like as Data 1, Da;ta;3;etc..., Data 4, etc? have a unknown number of ; into quotes and have any binary chars into content (non utf-8).
I try using a split:
data = line.strip().split(b';')
But have a problem with the delimiters into quotes. I try replacing the delimiters using:
line = re.sub(rb'(".+?);(.+?")', rb'\1 - \2', line)
But the problem is when have two o more delimiters.
Can not use csv module, csv can not support a binary read mode.

import re
test_str = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
regex = '\"([^\"]+)\"'
data_list = re.findall(regex,test_str)
for data in matches:
test_str = test_str.replace(f"\"{data}\";","")
data_list = data_list + test_str.split(';')
Here data_list would look like this : ['Da;ta;3;etc...', 'Data 1', 'Data 2', 'Data 4']

I'm not sure I understood correctly, but if you want split your string having " as a delimiter it's as simple as:
line = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
my_array = line.split('"')
Which results in the following array:
['Data 1;Data 2;', 'Da;ta;3;etc...', ';Data 4']
Now if you want to split both by " and ; you can:
line = 'Data 1;Data 2;"Da;ta;3;etc...";Data 4'
my_array = []
for entry in line.split('"'):
my_array.extend(entry.split(';')) 4']
Which results in the following array:
['Data 1', 'Data 2', '', 'Da', 'ta', '3', 'etc...', '', 'Data 4']

Related

Control text wrapping and hyphenation in kable tables

I have an Rmarkdown document with embedded tables but I have trouble understanding the underlying rules for text wrapping and hyphenation for the table contents. Searching through stackoverflow and other resources hasn't provided a lot of insight.
An example is provided below, the columns widths specified are only necessary in the example to reproduce the problem I have with the real table. After some trial and error, I was able to get the last column header to hyphenate by entering it as " Manufacturer " but this trick does not work in the rows below that header. Additional examples of problems with text in cells either getting cut off or spilling into adjacent cells are shown in the third column (Result) and the formatting of cell entries is displayed in the second column. I've added a border between the third and fourth columns to highlight the problems. The real table has 8 columns and I've adjusted those column widths as much as possible while preserving readability.
---
title: 'Table_7_problem'
fontsize: 11pt
output:
bookdown::pdf_document2:
toc: false
number_sections: false
latex_engine: xelatex
tables: yes
header-includes:
- \usepackage{booktabs}
- \usepackage{longtable}
- \usepackage{colortbl} # to set row stripe colors
- \usepackage{tabu}
- \setlength{\tabcolsep}{1pt}
---
```
```{r setup, echo = TRUE, cache = FALSE, warning = FALSE, message = FALSE}
{r setup, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
library(knitr)
```
# Table 7: Appliance durability
This table contains fictional data.
```{r table7, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
{r table7, echo = FALSE, cache = FALSE, warning = FALSE, message = FALSE}
table7 <- data.frame(
Column_1 = c('Very long string #1 that requires a wide column to accomodate and maintain readability' ,'Very long string #2... and more of the same down rows for this column...','Very long string #3','Very long string #4','Very long string #5','Very long string #6', 'Very long string #7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText #2"',
'"Sample Text #2"'),
Column_3 = c('SampleText',
'Sample Text',
' SampleText',
'SampleText ',
' SampleText ',
'SampleText #2',
'Sample Text #2"'),
Column_4 = c('Manufacturer',
' Manufacturer',
'Manufacturer ',
' Manufacturer ',
' LongManufacturerName',
'Long_Manufacturer_Name',
"Long Manufacturer Name")
)
###
colnames(table7) <- c("Name", "Cell Content Format", "Result", " Manufacturer ")
library(kableExtra)
table7 %>%
kbl(longtable = TRUE, align = "lllc", booktabs = TRUE) %>%
kable_styling(full_width = FALSE, font_size = 8, latex_options = c("repeat_header", "striped"), stripe_color = "gray!15", repeat_header_text = "Table 7 \\textit{continued...}") %>%
row_spec(0, bold = TRUE) %>%
column_spec(1, width = "1.5in") %>%
column_spec(2, width = "3.825in") %>%
column_spec(3, width = "0.5in") %>%
column_spec(4, width = "0.45in", border_left = TRUE)
```
The above code produces this:
Any advice or solutions on how to control the hyphenation and word wrapping to resolve these problems?
*** UPDATE 2022-09-07
Updating the status - I've explored several packages for making the table and so far none will do everything I was looking for but, for me, it seems the flextable package will do most of what I wanted. The updated code and pdf result are shown below. It may not be pretty but it gets the job done. Seems some conflicts arise when piping the formatting commands but they seem to work just fine if entered one at a time, which is why there are multiple t7 <-... statements (I played around with much more elaborate formatting and the same strategy of using individual statements worked).
table7 <- data.frame(
Column_1 = c('Very long string #1 that requires a wide column to accomodate and maintain readability' ,'Very long string #2... and more of the same down rows for this column...','Very long string #3','Very long string #4','Very long string #5','Very long string #6', 'Very long string #7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText #2"',
'"Sample Text #2"'),
Column_3 = c('SampleText',
'Sample Text',
' SampleText',
'SampleText ',
' SampleText ',
'SampleText #2',
'Sample Text #2"'),
Column_4 = c('Manufacturer',
' Manufacturer',
'Manufacturer ',
' Manufacturer ',
' LongManufacturerName',
'Long_Manufacturer_Name',
"Long Manufacturer Name")
)
###
colnames(table7) <- c("Name", "Cell Content Format", "Result", "Manu-\nfacturer")
library(flextable)
library(stringr)
set_flextable_defaults(
font.family = gdtools::match_family(font = "Serif"),
font.size = 8,
padding = 3)
table7$`Manu-\nfacturer` <- str_replace(string = table7$`Manu-\nfacturer`, pattern = 'Manufacturer', replacement = 'Manu-\nfacturer')
t7 <- table7 %>% flextable() %>%
width(., width = c(1.5, 3.825, 0.5, 0.45), unit = "in") %>%
#add_header_lines(., values = "Table 7") %>%
theme_zebra(.)
t7 <- hline(t7, i = 1, border = officer::fp_border(color = "black"), part = "header")
t7 <- flextable::align(t7, i = 1, j = 1, align = "left", part = "header")
t7
the above generates the figure below. The str_replace strategy suggested by #Julian achieves the hyphenation and wrapping and theme_zebra() in flextable preserved the row striping.
What you can do is to add linebreaks and add escape = FALSE to your kable function. Note that you need to escape #,_ etc. as well.
table7 <- data.frame(
Column_1 = c('Very long string 1 that requires a wide column to accomodate and maintain readability' ,'Very long string 2... and more of the same down rows for this column...','Very long string 3','Very long string 4','Very long string 5','Very long string 6', 'Very long string 7'),
Column_2 = c('"SampleText"',
'"Sample Text"',
'" SampleText"',
'"SampleText "',
'" SampleText "',
'"SampleText 2"',
'"Sample Text 2"'),
Column_3 = c('Sample\nText',
'Sample\n Text',
' Sample\nText',
'Sample\nText ',
' Sample\nText ',
'Sample\nText 2',
'Sample \nText 2"'),
Column_4 = c('Manu\nfacturer',
' Manu\nfacturer',
'Manu\nfacturer ',
' Manu\nfacturer ',
' Long\nManufacturer\nName',
'Long\nManufacturer\nName',
"Long\n Manufacturer\n Name")
)

regex in Python to remove commas and spaces

I have a string with multiple commas and spaces as delimiters between words. Here are some examples:
ex #1: string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
ex #2: string = 'word1 word2 word3'
ex #3: string = 'word1,word2,word3,'
I want to use a regex to convert either of the above 3 examples to "word1, word2, word3" - (Note: no comma after the last word in the result).
I used the following code:
import re
input_col = 'word1 , word2 , word3, '
test_string = ''.join(input_col)
test_string = re.sub(r'[,\s]+', ' ', test_string)
test_string = re.sub(' +', ',', test_string)
print(test_string)
I get the output as "word1,word2,word3,". Whereas I actually want "word1, word2, word3". No comma after word3.
What kind of regex and re methods should I use to achieve this?
you can use the split to create an array and filter len < 1 array
import re
s='word1 , word2 , word3, '
r=re.split("[^a-zA-Z\d]+",s)
ans=','.join([ i for i in r if len(i) > 0 ])
How about adding the following sentence to the end your program:
re.sub(',+$','', test_string)
which can remove the comma at the end of string
One approach is to first split on an appropriate pattern, then join the resulting array by comma:
string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
parts = re.split(",*\s*", string)
sep = ','
output = re.sub(',$', '', sep.join(parts))
print(output
word1,word2,word3
Note that I make a final call to re.sub to remove a possible trailing comma.
You can simply use [ ]+ to detect extra spaces and ,\s*$ to detect the last comma. Then you can simply substitute the [ ]+,[ ]+ with , and the last comma with an empty string
import re
input_col = 'word1 , word2 , word3, '
test_string = re.sub('[ ]+,[ ]+', ', ', input_col) # remove extra space
test_string = re.sub(',\s*$', '', test_string) # remove last comma
print(test_string)

Retaining punctuations in a word

How can i remove punctuations from a line, but retain punctuation in the word using re ??
For Example :
Input = "Hello!!!, i don't like to 'some String' .... isn't"
Output = (['hello','i', 'don't','like','to', 'some', 'string', 'isn't'])
I am trying to do this:
re.sub('\W+', ' ', myLine.lower()).split()
But this is splitting the words like "don't" into don and t.
You can use lookarounds in your regex:
>>> input = "Hello!!!, i didn''''t don't like to 'some String' .... isn't"
>>> regex = r'\W+(?!\S*[a-z])|(?<!\S)\W+'
>>> print re.sub(regex, '', input, 0, re.IGNORECASE).split()
['Hello', 'i', "didn''''t", "don't", 'like', 'to', 'some', 'String', "isn't"]
RegEx Demo
\W+(?!\S*[a-z])|(?<!\S)\W+ matches a non-word, non-space character that doesn't have a letter at previous position or a letter at next position after 1 or more non-space characters.

Python 2.7 : Remove elements from a multidimensional list

Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat

regexp parsing in matlab

I have a cell array 3x1 like this:
name1 = text1
name2 = text2
name3 = text3
and I want to parse it into separate cells 1x2, for example name1 , text1. In future I want to treat text1 as a string to compare with other strings. How can I do it? I am trying with regexp and tokens, but I cannot write a proper formula for that, if someone can help me with it please, I will be grateful!
This code
input = {'name1 = text1';
'name2 = text2';
'name3 = text3'};
result = cell(size(input, 1), 2);
for row = 1 : size(input, 1)
tokens = regexp(input{row}, '(.*)=(.*)', 'tokens');
if ~isempty(tokens)
result(row, :) = tokens{1};
end
end
produces the outcome
result =
'name1 ' ' text1'
'name2 ' ' text2'
'name3 ' ' text3'
Note that the whitespace around the equal sign is preserved. You can modify this behaviour by adjusting the regular expression, e.g. also try '([^\s]+) *= *([^\s]+)' giving
result =
'name1' 'text1'
'name2' 'text2'
'name3' 'text3'
Edit: Based on the comments by user1578163.
Matlab also supports less-greedy quantifiers. For example, the regexp '(.*?) *= *(.*)' (note the question mark after the asterisk) works, if the text contains spaces. It will transform
input = {'my name1 = any text1';
'your name2 = more text2';
'her name3 = another text3'};
into
result =
'my name1' 'any text1'
'your name2' 'more text2'
'her name3' 'another text3'