I have a text file where lines are trimmed by newline characters /n and paragraphs by double newlines /n/n
I want to strip out those single newlines and replace with simple spaces. But I do not want the double newlines affected.
I thought something like one of these would work:
(?!\n\n)\n
\n{1}
\n{1,1}
But no luck. Everything I try inevitably ends up affecting those double new lines too. How can I write a regex that effectively "ignores" the /n/n but captures the /n
You can search using this regex:
(.)\n(?!\n)
And replace it with:
"\1 "
RegEx Demo
RegEx Breakup:
.\n: Match any character followed by a line break
(?!\n): Negative lookahead to assert that we don't have a line break at next position. We match one character before matching \n to make sure we don't match an empty line. Also note that this character is being captured in capture group #1. This will match all single line breaks but will skip double line breaks.
\1 : is replacement to append a space after first capture group
Python Code:
import re
repl = re.sub('(.)\n(?!\n)', r'\1 ', input)
print (repl)
Javscript Code:
repl = input.replace(/(.)\n(?!\n)/g, '$1 ')
console.log (repl)
You'll need a negative lookahead and a negative lookbehind. /(?<!\n)\n(?!\n)/g would probably work off the top of my head.
That said, you should be aware of kind of spotty browser support for lookbehinds. It's gotten better since I last checked, but Safari and IE don't support it at all.
I thought of a simple way to do this.(may not be the right way from a regex point of view) but its a workaround.
import re
sample = """This is a sentence in para1.
this is also a sentence in para1
The begining of paragraph2 and sentence1
this is a second line in paragraph2.
"""
print(sample)
sample = re.sub(r'\n\n\n',"NPtag",sample)
sample = re.sub(r'\n\n'," ",sample)
sample = re.sub(r"NPtag",'\n\n\n',sample)
print("OUTPUT*****\n")
print(sample)
the workaround is to replace the multi-line(3 in this case to demonstrate the space clearly) breaker with a NewParagraphtag(NPtag) and then substitute the single newline(2 in the above case, to demonstrate the sapce clearly in notebook env) with space and resubstitute the NPtag with multiline break. You can see the output here as:
Hope this helps. Eager to see other regex answers too! Happy coding
Related
I'm using Java. So I have a comma separated list of strings in this form:
aa,aab,aac
aab,aa,aac
aab,aac,aa
I want to use regex to remove aa and the trailing ',' if it is not the last string in the list. I need to end up with the following result in all 3 cases:
aab,aac
Currently I am using the following pattern:
"aa[,]?"
However it is returning:
b,c
If lookarounds are available, you can write:
,aa(?![^,])|(?<![^,])aa,
with an empty string as replacement.
demo
Otherwise, with a POSIX ERE syntax you can do it with a capture:
^(aa(,|$))+|(,aa)+(,|$)
with the 4th group as replacement (so $4 or \4)
demo
Without knowing your flavor, I propose this solution for the case that it does know the \b.
I use perl as demo environment and do a replace with "_" for demonstration.
perl -pe "s/\baa,|,aa\b/_/"
\b is the "word border" anchor. I.e. any start or end of something looking like a word. It allows to handle line end, line start, blank, comma.
Using it, two alternatives suffice to cover all the cases in your sample input.
Output (with interleaved input, with both, line ending in newline and line ending in blank):
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
If the \b is unknown in your regex engine, then please state which one you are using, i.e. which tool (e.g. perl, awk, notepad++, sed, ...). Also in that case it might be necessary to do replacing instead of deleting, i.e. to fine tune a "," or "" as replacement. For supporting that, please show the context of your regex, i.e. the replacing mechanism you are using. If you are deleting, then please switch to replacing beforehand.
(I picked up an input from comment by gisek, that the cpaturing groups are not needed. I usually use () generously, including in other syntaxes. In my opinion not having to think or look up evaluation orders is a benefit in total time and risks taken. But after testing, I use this terser/eleganter way.)
If your regex engine supports positive lookaheads and positive lookbehinds, this should work:
,aa(?=,)|(?<=,)aa,|(,|^)aa(,|$)
You could probably use the following and replace it by nothing :
(aa,|,aa$)
Either aa, when it's in the begin or the middle of a string
,aa$ when it's at the end of the string
Demo
As you want to delete aa followed by a coma or the end of the line, this should do the trick: ,aa(?=,|$)|^aa,
see online demo
I'm trying to use regex to find single quotes (so I can turn them all into double quotes) anywhere in a line that starts with mySqlQueryToArray (a function that makes a query to a SQL DB). I'm doing the regex in Sublime Text 3 which I'm pretty sure uses Perl Regex. I would like to have my regex match with every single quote in a line so for example I might have the line:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name'");
I want the regex to match in that line both of the quotes around $name but no other characters in that line. I've been trying to use (?<=mySqlQueryToArray.*)' but it tells me that the look behind assertion is invalid. I also tried (?<=mySqlQueryToArray)(?<=.*)' but that's also invalid. Can someone guide me to a regex that will accomplish what I need?
To find any number of single quotes in a line starting with your keyword you can use the \G anchor ("end of last match") by replacing:
(^\h*mySqlQueryToArray|(?!^)\G)([^\n\r']*)'
With \1\2<replacement>: see demo here.
Explanation
( ^\h*mySqlQueryToArray # beginning of line: check the keyword is here
| (?!^)\G ) # if not at the BOL, check we did match sth on this line
( [^\n\r']* ) ' # capture everything until the next single quote
The general idea is to match everything until the next single quote with ([^\n\r']*)' in order to replace it with \2<replacement>, but do so only if this everything is:
right after the beginning keyword (^mySqlQueryToArray), or
after the end of the last match ((?!^)\G): in that case we know we have the keyword and are on a relevant line.
\h* accounts for any started indent, as suggested by Xælias (\h being shortcut for any kind of horizontal whitespace).
https://stackoverflow.com/a/25331428/3933728 is a better answer.
I'm not good enough with RegEx nor ST to do this in one step. But I can do it in two:
1/ Search for all mySqlQueryToArray strings
Open the search panel: ⌘F or Find->Find...
Make sure you have the Regex (.* ) button selected (bottom left) and the wrap selector (all other should be off)
Search for: ^\s*mySqlQueryToArray.*$
^ beginning of line
\s* any indentation
mySqlQueryToArray your call
.* whatever is behind
$ end of line
Click on Find All
This will select every occurrence of what you want to modify.
2/ Enter the replace mode
⌥⌘F or Find->Replace...
This time, make sure that wrap, Regex AND In selection are active .
Them search for '([^']*)' and replace with "\1".
' are your single quotes
(...) si the capturing block, referenced by \1 in the replace field
[^']* is for any character that is not a single quote, repeated
Then hit Replace All
I know this is a little more complex that the other answer, but this one tackles cases where your line would contain several single-quoted string. Like this:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name' and Value='1234'");
If this is too much, I guess something like find: (?<=mySqlQueryToArray)(.*?)'([^']*)'(.*?) and replace it with \1"\2"\3 will be enough.
You can use a regex like this:
(mySqlQueryToArray.*?)'(.*?)'(.*)
Working demo
Check the substitution section.
You can use \K, see this regex:
mySqlQueryToArray[^']*\K'(.*?)'
Here is a regex demo.
I'm checking line by line in C#
Example data:
bob jones,123,55.6,,,"Hello , World",,0
jim neighbor,432,66.5,,,Andy "Blank,,1
john smith,555,77.4,,,Some value,,2
Regex to pick commas outside of quotes doesn't resolve second line, it's the closest.
Try the following regex:
(?!\B"[^"]*),(?![^"]*"\B)
Here is a demonstration:
regex101 demo
It does not match the second line because the " you inserted does not have a closing quotation mark.
It will not match values like so: ,r"a string",10 because the letter on the edge of the " will create a word boundary, rather than a non-word boundary.
Alternative version
(".*?,.*?"|.*?(?:,|$))
This will match the content and the commas and is compatible with values that are full of punctuation marks
regex101 demo
The below regex is for parsing each fields in a line, not an entire line
Apply the methodical and desperate regex technique: Divide and conquer
Case: field does not contain a quote
abc,
abc(end of line)
[^,"]*(,|$)
Case: field contains exactly two quotes
abc"abc,"abc,
abc"abc,"abc(end of line)
[^,"]*"[^"]*"[^,"]*(,|$)
Case: field contains exactly one quote
abc"abc(end of line)
abc"abc, (and that there's no quote before the end of this line)
[^,"]*"[^,"]$
[^,"]*"[^"],(?!.*")
Now that we have all the cases, we then '|' everything together and enjoy the resultant monstrosity.
The best answer written by Vasili Syrakis does not work with negative numbers inside quotation marks such as:
bob jones,123,"-55.6",,,"Hello , World",,0
jim neighbor,432,66.5
Following regex works for this purpose:
,(?!(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$))
But I was not successful with this part of input:
,Andy "Blank,
try this pattern ".*?"(*SKIP)(*FAIL)|, Demo
import re
print re.sub(',(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)',"",string)
I am having trouble matching the pattern, "This program cannot be run" whenever the phrase is broken over multiple lines, e.g.:
This program cannot be run
T
his program cannot be run
Thi
s program cannot be run
.
.
This pr
ogram cannot be run
The pattern can be split onto two lines at any point. I have tried using /m and /s as well as anchors and boundaries but I cannot get it to work. I am at a loss as to what I am doing wrong. I even tried using \s after every character and even that won't match! The pattern must be PCRE formatted.
s and m won't help you here. They only change the behavior of . and anchors, respectively. Anchors and boundaries won't help either, because they only assert that something is at a certain position.
The problem with all those approaches is that a line break introduces one or two new characters into the string (\n, \r or \r\n, depending on your system). Therefore, you will would have to allow a line break at any possible point if you need a regex only solution:
/T[\r\n]*h[\r\n]*i[\r\n]*s[\r\n]* [\r\n]*p[\r\n]*.../
And so on.
If you can modify the input, it would be easier to remove line breaks first by replacing
/[\r\n]+/
with an empty string and then running the pattern you already have.
If a newline character can appear at any point in the sought substring, you will need to add a corresponding character to match that newline in the regex.
Assuming the newline characters are always \n
T\n?h\n?i\n?s\n? \n?p\n?r\n?o\n?g\n?r\n?a\n?m\n? \n?c\n?a\n?n\n?n\n?o\n?t\n? \n?b\n?e\n? \n?r\n?u\n?n
so it looks horrible, and maybe someone can offer a better solution, here it is in python using the re.S flag
>>> a = """
... This pr
... ogram cannot be run"""
>>> re.search("T[\n]*h[\n]*i[\n]*s[\n]* [\n]*p[\n]*r[\n]*o[\n]*",a,re.S)
<_sre.SRE_Match object at 0x7f9d746e9e68>
The easy way to make the regex if your string changes
>>> a = "This program cannot be run"
>>> b = list(a)
>>> r = '[\r\n]*'.join(b)
I have a syntax highlighting function in vb.net. I use regular expressions to match "!IF" for instance and then color it blue. This works perfect until I tried to figure out how to do comments.
The language I'm writing this for a comment can either be if the line starts with a single quote ' OR if anywhere in the line there is two single quotes
'this line is a comment
!if StackOverflow = "AWESOME" ''this is also a comment
Now i know how to see if it starts with a single line ^' but i need to to return the string all the way to the end of the line so i can color the entire comment green and not just the single quotes.
You shouldn't need the code but here is a snippet just in case it helps.
For Each pass In frmColors.lbRegExps.Items
RegExp = System.Text.RegularExpressions.Regex.Matches(LCase(rtbMain.Text), LCase(pass))
For Each RegExpMatch In RegExp
rtbMain.Select(RegExpMatch.Index, RegExpMatch.Length)
rtbMain.SelectionColor = ColorTranslator.FromHtml(frmColors.lbHexColors.Items(PassNumber))
Next
PassNumber += 1
Next
Something along the lines of:
^(\'[^\r\n]+)$|(''[^\r\n]+)$
should give you the commented line (of part of the line) in group n° 1
Actually, you do not even need group
^\'[^\r\n]+$|''[^\r\n]+$
If it finds something, it is a comment.
"(^'|'').*$"
mentioned by Boaz would work if applied only line by line (which may be your case).
For multi-line detection, you must be sure to avoid the 'Dotall' mode, where '.' stands also for \r and \n characters. Otherwise that pattern would match both your lines entirely.
That is why I generally prefer [^\r\n] to '.': it avoids any dependency to the mode of the pattern. Even in 'Dotall' mode, it still works and avoids trying any match on the next line.
While the above would work you can simplify it:
"(^'|'').*$"
As VonC mentions - this would only work if you feed the Regex one line at a time. For multi line mode use:
"(^'|'').*?$"
The ? makes the * operator not be greedy , forcing the regex to match a single line.
Using the regex pattern: REM((\t| ).*$|$)|^\'[^\r\n]+$|''[^\r\n]+$
see more https://code.msdn.microsoft.com/How-to-find-code-comments-9d1f7a29/