I'm looking for a regex string that selects each line break that follows a certain number of characters (in my case 19).
This would select the whole line--but I would only want the line break selected that fulfills this condition:
.{19,}[^\n]
Any help would be greatly appreciated (I obviously don't really know my way around regexs.)
Essentially what I'm trying to do is a search&replace with a text-editor that supports regex to get rid off line-breaks from an OCRd book. My somewhat heuristic approach is that every line shorter than 19 characters is likely a paragraph break (It's a very small book) and should keep the line break while all other lines should have the break taken out.
Example:
This is a line that wraps
around
This one isn't.
Here begins a new paragraph
The line break after 1. should be taken out, so the word "around" moves up. The line break after line 3 shouldn't since it's too short--so the transition to the next paragraph (line 4) is not taken out.
I hope this makes sense? (since I'm not using a programming language, I'm assuming /K won't work--at least it doesn't in the editor I'm currently using).
Thanks!
One very useful tool with regexp is a regexp tester, such as Regex 101
That way you can see what your regex is doing. Make sure to clarify which one you are using (I program mostly in Ruby which acts a bit different then some others).
In yours, .{19,} looks for 19 or more characters, if you want EXACTLY 19, remove the comma.
.{19}
then since you don't want those 19 (or 19 plus?) characters you can use:
.{19}\K\n
\K 'forgets' what has been matched so far and moves on from that point. Very useful if your regex allows it (Ruby does not, if I recall correctly?) if you want 19 chars from the beginning of the line:
^.{19,}\K\n
and don't forget the multiline and global options if you want all matches.
demo
ALSO! Be sure to read Crayon Violent's comment above for more good advice (and an important Windows fact!)
Related
Continuing from this thread, I put the question up for other people looking for something like this.
In the other thread, I explained the caveats which Wiktor found a marvellous workaround for.
What I have:
* Some text working as a title starting with a Capital letter in usually a single line *
The regex code for this result:
#### Text
Line breaks precede and follow the lines always, as described in the other thread and we must look out for spillover problems. I am not sure that in every instance across all my document there are whitespaces after asteriks or not. Notice also that other texts which are not intended as titles can also have this syntaxs, so we must be looking for CRLF's before and after, as explained in the other thread.
I'd like not only match but substitution example as well, for other novices like me to see easily.
Cheers,
F.
How can I match one line of text with a regex and follow it up with a line of dashes exactly as many as characters in the initial match to achieve text-only underlining. I intend to use this with the search and replace function (likely in the scope of a macro) inside an editor. Probably, but not necessarily, Visual Studio Code.
This is a heading
should turn into
This is a heading
-----------------
I believe I have read an example for that years ago but can't find it; neither do I seem to be able to formulate a search query to get anything useful out of Google (including variations of the question's title). If you are I'd be interested in that, too.
The best I can come up with is this:
^(.)(?=(.*\n?))|.
Substitution
$1$2-
syntax
note
^(.)
match the first character of a line, capture it in group 1
(?=(.*\n?))
then look ahead for the rest of this line and capture it in group 2, including a line break if there's any
|.
or a normal character
But the text must has a line break after it, or the underline only stays on the same line.
Not sure if it is any useful but here are the test cases.
There seems to be a bug in the Notepad++ find/replace behaviour when using a backreference to find duplicate lines that may not necessarily be consecutive. I'm wondering if anyone knows what the issue could be with the regex or if they know why the regex engine might be bugging out?
Details
I wanted to use a regex to find duplicate lines in Notepad++. The duplicates needn't necessarily be contiguous i.e. on consecutive lines, there can be lines in between. I started with this post:
https://medium.com/#heitorhherzog/compare-sort-and-delete-duplicate-lines-in-notepad-2d1938ed7009
But realised that the regex mentioned there only checks for contiguous duplicates. So I wrote my own regex:
^(.+)$(?:(?:\s|.)+)^(\1)$
The above basically captures something on a whole line, then matches a load of stuff in between, then captures the same thing about on a line.
What's wrong
The regex works, but only sometimes. I can't figure out the pattern. I've whittled it down to this so far. If I do a "Replace All" on the replacement pattern \1\2 then the "replace all" leaves me with just line 3, which is "elative backreferences32". This is wrong:
dasfdasfdsfasdfasdfadsfasdf
elative backreferenceswe
elative backreferences32
elative backreferencesd
elative backreferencdesdfdasdfsdafsd
asfasdfasdfasdfasdfasfdsaasdfas
asdfasdfafds asdfasfdsafasd asdfdasfsd
elative backreferencessfhdfg
x
y
x
But if I delete any line from that file, then only the consecutive lines x then y then x are replaced by a single line xx as I'd expect.
Notes
I'd like to keep this question focused mostly on why the regex is
bugging out. Suggestions about alternative ways to find duplicate
lines are of course good but the main reason I'm asking this is to
figure out what's going on with the regex and Notepad++.
I don't really need the replace part of this, just the find, I was just using the replace to try to figure out what groups were being captured in an attempt to debug this
The find behaviour is also buggy. I noticed this first actually. It first finds the match I'm actually looking for, and then if I click "Find Next" again, it highlights all the text.
Hypotheses
There is a bug in Notepad++ v7.8.4 64 bit. I just updated today so maybe they haven't caught it yet.
Does the in-between part of the match, (?:(?:\s|.)+), maybe cycle
around past the end of file character and loop right back to the
original match? If so, I'd say that's still a bug, because AFAIK a
regex should only consume each character once.
I thought there might be a limit to the number of characters in the file, but I disproved this hypothesis by playing around with the file, adding characters here and there. Two files with the same number of lines and the same number of characters can behave differently: one with buggy behaviour, one without.
Screenshots
Before
After Without Matches Newline (The intended configuration)
After With Matches Newline (for Experimentation)
(?:\s|.) should be avoid as it causes unexpected behaviour, I suggest using [\s\S] instead:
Find what: ^(.+)$[\s\S]+?^(\1)$
Replace with: $1$2
I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.
I have a requirement to remove indentation from a numbered paragraph. I currently do this with a couple of regular expressions and some code, but would like to accomplish it with one or more regular expressions. The paragraph looks like this:
1. THE FIRST LINE OF THE PARAGRAPH
ANOTHER LINE IN THE PARAGRAPH
AN INDENTED LINE WITHIN THE PARAGRAPH
This needs to be transformed to retain the indentation within the paragraph, but remove the indentation of the entire paragraph as measured by the indentation of the first line.
THE FIRST LINE OF THE PARAGRAPH
ANOTHER LINE IN THE PARAGRAPH
AN INDENTED LINE WITHIN THE PARAGRAPH
The following regex accomplishes the task by replacing matches with empty strings. (note that there are no tabs expected in this content, just spaces):
(\A *\d+\. *|^ {0,5})
But it requires that the indention length of 5 characters be set explicitly. I would like a generic way of doing this that would work with any indentation length. Any ideas for how one or more regular expressions (applied cumulatively) could accomplish this?
I am using the .NET regular expression engine with multiline mode turned on.
As other have indicated, regex (alone) probably aren't the correct tool for the job.
The major problem is that in order to strip the correct amount of spaces from all the further lines, you somehow need to store how wide was the first indentation. This is something that I'm not sure is doable with a regex engine alone.
If your desire for a regex based approach is just to have a quick one-liner than I think you can hack something like the following (I'm not familiar with .NET so I'll just provide you with a python solution):
re.sub(r"^([\d\. ]+)(.*)$",
lambda m: re.sub("^" + " "*len(m.group(1)),
"",
m.group(2),
flags=re.MULTILINE),
paragraph,
flags=re.MULTILINE|re.DOTALL)
The idea is to have the outer regex isolate the indentation of the first line, while the inner regex takes care of removing the correct amount from subsequent lines.
In order for this to work the indentation must be made exclusively of spaces (i.e. no tabs) otherwise you'll have to do some assumptions on how many spaces a tab is made of.
That said you would probably better off implementing a custom parser to do the job. It would surely be cleaner and probably more efficient too.
I am not sure how you thought it would work, but your regex matches everything under the sun due to the right side of the |.
Try this:
^((?:\d+\.)? +)
Use something like http://www.regexr.com/ to test it out.