I have a string that goes like - "\n\n\some text\t goes here. some\t\t other text goes here\b\n\n\n".
What I want - "some text goes here. some other text goes here."
Here is what I am doing: re.sub('[^A-Za-z0-9]+', ' ', s)
Problem is that this removes all the punctuations as well. How do I retain those?
Here's a solution that finds all the escape characters in your string and then removes them.
r = repr(s) # Convert escape sequences to literal backslashes
r = r[1:-1] # Remove the quote characters introduced by `repr`
escapes = set(re.findall(r'\\\w\d*', r)) # Get escape chars
answer = re.sub('|'.join(map(re.escape, escapes)), '', r) # Remove them
# \ome text goes here. some other text goes here
Related
How to write a Scala regular expression to capture all the quoted contents between two double quotes(including the escaped quote marks)?
My aim is to find the first (unescaped) quotation mark(which is part of the string), find the paired (unescaped) quotation mark(which is also part of the string), then extract everything between them.
I'm expecting something like this:
"""??""".r findFirstMatchIn(""""abcdef\"abc"""") // Note that the real string begins from the fourth quotation marks, i.e. the real string is "abcdef\"abc"
res = Some(abcdef\"abc)
"""??""".r findFirstMatchIn(""""abcdef\"abc\t\t"""")
res = Some(abcdef\"abc\t\t)
"""??""".r findFirstMatchIn(""""abcdef\"abc\t\"\t"""")
res = Some(abcdef\"abc\t\"\t)
I tried something like """([^\"])*([\\\\]+[\"tnbr/])+([^\"]*)*""".r but it does not work for string "abcdef\"abc\t\"\t"
Any hints are appreciated.
edit:
My intention is to extract every character between paired double quotes:
"abc" => abc
"abc\n" => abc\n
"\t\n" => \t\n
"\\" => \\
"\" => This is wrong(so it will never happen) because the second quotation mark is escaped hence the double quotes are not paired
"abc\" => abc\"
"hello\\"world\"" => This is also wrong(so it will never happen) because \ is escaped and the quotation mark are not properly escaped
"hello\\\"world\\\"" => hello\\\"world\\\"
The escaped char can be:
\" \\ \n \t \b \r \f \/
otherwise it is just plain texts.
edit:
my string is JSON style like:
"abc": "value"
or "abc\t\n\"def": "value"
and my aim is to extract abc or abc\t\n\"def before the colon.
so to summarize:
my aim is to find the first (unescaped) quotation mark(which is part of the string), find the paired (unescaped) quotation mark(which is also part of the string), then extract everything between them.
Try
"((?:[^"\\]|\\[\\"ntbrf])+)"
Demo: regex101
In Scala code:
val regex = """"((?:[^"\\]|\\[\\"ntbrf])+)"""".r
val examples = List(
""""abc"""",
""""abc\n"""",
""""\t\n"""",
""""\\"""",
""""abc\""""",
""""hello\\\"world\""""",
""""hello\\\"world\\\""""",
""""abc": """,
""""value" """,
"""or "abc\t\n\"def"""",
""": "value"""",
"""abc"def\"abc"""",
"""abc"def\"abc\t\t"""",
"""abc"def\"abc\t\"\t""""
)
for (e <- examples) {
println(regex.findFirstMatchIn(e).get.group(1))
}
Output:
abc
abc\n
\t\n
\\
abc\"
hello\\\"world\"
hello\\\"world\\\"
abc
value
abc\t\n\"def
value
def\"abc
def\"abc\t\t
def\"abc\t\"\t
I just use
"""([^"\]|\"|\t|\n|\b|\r|\/|\f)*""".r
and it seems like working.
Thank you.
I would like to replace some escaping character in a given text. Here what I've tried.
_RE_SPECIAL_CHARS = re.compile(r"(?:[^#\\]|\\.)+#")
text = r"ok#\#.py"
search = re.search(_RE_SPECIAL_CHARS, text)
print(text)
if search:
print(_RE_SPECIAL_CHARS.sub("<star>", text))
else:
print('<< NOTHING FOUND ! >>')
This prints :
ok#\#.py
<star>\#.py
What I need to have instead is ok<star>\#.py.
You can use lookbehind and just match the special character:
re.compile(r"(?<=[^#\\]|\\.)#")
See DEMO
Or you can capture the part before # in group 1 and replace with \1<star>
re.compile(r"((?:[^#\\]|\\.)+)#")
and
print(_RE_SPECIAL_CHARS.sub("\1<star>", text))
See DEMO
I have multiple lines in some text files such as
.model sdata1 s tstonefile='../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p' passive=2
I want to extract the text between the single quotes in MATLAB.
Much help would be appreciated.
To get all of the text inside multiple '' blocks, regexp can be used as follows:
regexp(txt,'''(.[^'']*)''','tokens')
This says to get text surrounded by ' characters, which does not include a ' in the captured text. For example, consider this file with two lines (I made up different file name),
txt = ['.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2 ', char(10), ...
'.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'' passive=2']
>> stringCell = regexp(txt,'''(.[^'']*)''','tokens');
>> stringCell{:}
ans =
'../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'
ans =
'../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'
>>
Trivia:
char(10) gives a newline character because 10 is the ASCII code for newline.
The . character in regexp (regex in the rest of the coding word) pattern usually does not match a newline, which would make this a safer pattern. In MATLAB, a dot in regexp does match a newline, so to disable this, we could add 'dotexceptnewline' as the last input argument to `regexp``. This is convenient to ensure we don't get the text outside of the quotes instead, but not needed since the first match sets precedent.
Instead of excluding a ' from the match with [^''], the match can be made non-greedy with ? as follows, regexp(txt,'''(.*?)''','tokens').
If you plan to use textscan:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','''');
fclose(fid);
output = rawdata{:}(2)
As also used in other answers the single apostrophe 'is represented by a double one: '', e.g. for delimiters.
considering the comment:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','\n');
fclose(fid);
lines = rawdata{1,1};
L = size(lines,1);
output = cell(L,1);
for ii=1:L
temp = textscan(lines{ii},'%s','delimiter','''');
output{ii,1} = temp{:}(2);
end
One easy way is to split the string with single quote delimiter and take the even-numbered strings in the output:
str = fileread('test.txt');
out = regexp(str, '''', 'split');
out = out(2:2:end);
You can do this using regular expressions. Assuming that there is only one occurrence of text between quotation marks:
% select all chars between single quotation marks.
out = regexp(inputString,'''(.*)''','tokens','once');
After identifing which lines you want to extract info from, you could tokenize it or do something like this if they all have the same form:
test='.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2';
a=strfind(test,'''')
test=test(a(1):a(2))
I am using the following code to take out anything other than alphabetical characters, numbers, question mark, exclamation point, periods, parenthesis, commas & hyphen:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9]", ""))
I come up with this: hellotoyousMy#is4425235584
It should read like so: hello to yous! My # is (442) 523-5584.?,
Simply add all characters to your negated character class (take note of the space character!):
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 ?!.(),#-]+", ""))
(I also added a repeating + to your regex, so it can replace consecutive disallowed characters in one go)
Add a space and other symbols in the regex:
MsgBox(System.Text.RegularExpressions.Regex.Replace("hello to you's! My # is (442) 523-5584. #$%^*<>{}[]\|/?,+-=:;`~", "[^A-Za-z0-9 \(\)\!\.,\-\?]", ""))
Regex.Replace("your text", "[^A-Za-z0-9 ?!.(),-]+", "")
It [^A-Za-z0-9 ?!.(),-]+ will grab all unwanted characters following one after another and replace them by ""
I want to clean up a string that contains escaped quotation marks. I want to remove the escaped quotation marks the end and beginning of a string but keep intact all qoutation marks within the string. What I came up with is the following.
library(stringr)
s1 <- "\"He said:\"Hello\" - some word\""
str_replace_all(s1, "(^\\\")|(\\\"$)", "")
> [1] "He said:\"Hello\" - some word"
What I am struggling with now is that I only want to remove the quotation marks if and only if there is one at the beginning AND at the end. Otherwise not.The following expression falsely removes the leading one.
s2 <- "\"Hello!\" he said"
str_replace_all(s2, "(^\\\")|(\\\"$)", "")
> [1] "Hello!\" he said"
Here my regex should indicate that I only want to remove them in case the whole string is wrapped in escaped quotation marks. How can I do that?
The following regex seems to work on your examples:
s <- c("\"He said:\"Hello\" - some word\"", "\"Hello!\" he said")
The regex uses back-references (\\1) to return only the string inside the leading quote ^\" and the trailing quote \"$:
r <- gsub("^\"(.*)\"$", "\\1", s)
This results in:
cat(r, sep="\n")
He said:"Hello" - some word
"Hello!" he said