Convert line-broken paragraphs into single paragraphs? (Folding text?) - regex

I have searched everywhere for an answer to this, but I think I must not be using the right lingo... I have text like this:
This text is actually just
one paragraph, but every
few words are broken to a
new line, and that's
annoying as hell, because
I have to go to each line
and fix it by hand...
Then there's a second
paragraph which does the
same thing.
I would like to convert that to:
This text is actually just one paragraph, but every few words are broken to a new line, and that's annoying as hell, because I have to go to each line and fix it by hand...
Then there's a second paragraph which does the same thing.
I've tried as many regex techniques as I could think of in TextMate, and can't find any macros or commands to re-wrap the text... The text in question is a result of content editors on one of my sites pasting from Word... I think they may even type this way (holdover from typewriter days!).

Based on your comment, there's probably something you can do with lookaheads. I tried it, but it didn't work (perhaps didn't try enough). So you can try to do this with a series of commands.
First replace any series of spaces with just a single space character:
:%s/ \+/ /g
Then replace all newlines with a space:
:%s/\n/ /g
Then replaces all double spaces with double newlines:
:%s/ /^M^M/g
The ^M can be obtained in vim by doing CTRL+V CTRL+M.
Or, you could even do:
:%s/ /\r\r/g
This is a little ghetto, but it should work :)

Related

Deleting every 2nd line from a file using Notepad++

I am looking for some regex help.
I have a textfile, nothing super important but I would like to delete every second line from it - I have tried following this guide: Delete every other line in notepad++
However I just can't get it to work, is the regex I am using ok? I am noob with regex
Find:
([^\n]*\n)[^\n]*\n
Replace with:
$1
No matter what I try (mouse position at the beginning, ctrl+a and Replace All) I just can't get it to work. I appreciate any help.
I've put the regex into here: http://regexpal.com/ and if I remove the final \n it highlights the individual rows.
Make sure you select regular expression for the search mode...
Also, you may want to make that final newline optional. In the case that there are an even number of lines and you do not have a trailing newline, it won't remove the last line.
([^\n]*\n)[^\n]*\n?
Update:
See how Windows handle new lines with \r\n instead of just \n. Try updating the expression to take this into account:
([^\r\n]*[\r\n]+)[^\r\n]*[\r\n]*
Final Update:
Thanks to #zx81, I now know that N++ uses PCRE so \R can be used for unicode newline characters. However [^\R] won't work (this looks for anything except R literally), so you will need to keep [^\r\n]. This can be simplified as:
([^\r\n]*\R)[^\r\n]*\R?

switch search pattern in a line to titlecase leaving remainder of line unchanged

I want to convert the UPPERCASE words in the following line:
<h3>XV. THE THOUSAND AND ONE GOALS</h3>
to titlecase, using either sed or ved (vim ed). Googling turned up ways to Titlecase whole lines but not just text that matches a pattern search (partial text in a line)
thought this might work, but no dice:
sed -ri '/<h3>/s/([A-Z ]*)<\/h3>/\L\1\END<\/h3>/;s/[[:graph:]]*/\u&/g'
after converting the search pattern to LOWERCASE (no probs of course managing that), I thought I might be able to then convert same text to Titlecase with something like this, but still no love (I actually thought this made enough sense to work, so I am unsure why it doesn't):
sed -ri 's/(<\/a>[IVX]{1,6}\.[ ]{1,})( [a-z])/\1\u\2/g'
Is there some way to edit only the text in a pattern search to Titlecase and not all the words in an entire line of text? I wonder why there is not a \T to compliment the \L and \u case commands. Sure would be handy.
%s/\v<\zs(\u)(\u*)\ze>/\1\L\2/g
in vim, after you executed this command, your line will be turned into:
<h3>Xv. The Thousand And One Goals</h3>
vim has substitute() and other powerful functions, which are very handy to do substitution on matched text, if your requirement is complex enough. :s/.../\=(expression here)/
only do replacement in <h3>....</h3>:
%s#<h3>\zs.*\ze</h3>#\=substitute(submatch(0), '\v<\zs(\u)(\u*)\ze>','\1\L\2',"g")#

How can I remove all the text between matches on a line?

I have this problem:
Input text:
this is my text text text and more text
this is my text myspace this is my text
this space is my text space this is my
this is my text this is my text
this space is my text space space myspace
Let say I want to search for "space"
I would like to have this as output:
this is my text text text and more text
space
space space
this is my text this is my text
space space space space
Matches on the same line have to be separated with a space.
Line without matches must remain as it is.
Same for all other search items.
I'm trying to realize this, this afternoon but without success.
Can anyone help me?
Solution:
:g/space/s/\(.*space\).*$/\1/|s/.\{-}space/ space/g|s/^ //
Explanation:
This is tricky, but it can be done. It can't be done with a single regular expression, though.
The first thing we do is get rid of anything after the last match (we actually exploit the fact that regular expressions are greedy by default here):
s/\(.*space\).*$/\1/
Then we remove anything between all the internal matches (notice we use the lazy version of * here, \{-}):
s/.\{-}space/ space/g
The previous step will leave an initial space in the result, so we get rid of that:
s/^ //
Fortunately, in vim, we can chain replacements together with the | character. So, putting it all together:
:g/space/s/\(.*space\).*$/\1/|s/.\{-}space/ space/g|s/^ //
is this tricky line ok for you?
:g/space/s/space/^G/g|s/[^^G]//g|s/^G/space /g
the ^G above you need press Ctrl-V Ctrl-G
the output of above command is same as your example except for the ending whitespace after pattern (space in this case). but it is easy to be fixed, e.g. chain another s/ $// after the :g line.
Kent's solution uses a nice trick that makes it work only for fixed strings, but it's clean and short. Ethan Brown's answer is more general, but also adds complexity with its three steps. I think the best solution can be developed based on the accepted answer in this very similar question.
Contrary to what Ethan Brown assumes, this can indeed be done with a single regular expression substitution. Here it is, in all its ugliness:
:g/space/s/\%(^\|\%(space \)*space\%( \%(.*space\)\#=\)\?\)\zs\%(\%(space \)*space\%( \%(.*space\)\#=\)\?\)\#!.\{-1,}\ze\%(\%(space \)*space\%( \%(.*space\)\#=\)\?\|$\)//g
It becomes somewhat more readable when you use the :DeleteExcept command from my PatternsOnText plugin:
:g/space/DeleteExcept/\%(space \)*space\%( \%(.*space\)\#=\)\?/
Explanation
This deletes everything except
potentially multiple sequential occurrences \%(space \)*
of the word space
including the trailing whitespace when it's not the last match in the line, i.e. there's a following match \%(.*space\)\#= so that the whitespace is not swallowed
or excluding (i.e. deleting) it \? after the last match in the line.
More practical alternative
Though it's a nice challenge to come up with the above solution, in practice, I would also favor a two-step approach, just because it's way simpler:
:g/space/DeleteExcept/space\%( \|$\)/
This leaves behind trailing whitespace that can be pruned with
:%s/ $//

How to read this command to remove all blanks at the end of a line

I happened across this page full of super useful and rather cryptic vim tips at http://rayninfo.co.uk/vimtips.html. I've tried a few of these and I understand what is happening enough to be able to parse it correctly in my head so that I can possibly recreate it later. One I'm having a hard time getting my head wrapped around though are the following two commands to remove all spaces from the end of every line
:%s= *$== : delete end of line blanks
:%s= \+$== : Same thing
I'm interpreting %s as string replacement on every line in the file, but after that I am getting lost in what looks like some gnarly variation of :s and regex. I'm used to seeing and using :s/regex/replacement. But the above is super confusing.
What do those above commands mean in english, step by step?
The regex delimiters don't have to be slashes, they can be other characters as well. This is handy if your search or replacement strings contain slashes. In this case I don't know why they use equal signs instead of slashes, but you can pretend that the equals are slashes:
:%s/ *$//
:%s/ \+$//
Does that make sense? The first one searches for a space followed by zero or more spaces, and the second one searches for one or more spaces. Each one is anchored at the end of the line with $. And then the replacement string is empty, so the spaces are deleted.
I understand your confusion, actually. If you look at :help :s you have to scroll down a few pages before you find this note:
*E146*
Instead of the '/' which surrounds the pattern and replacement string, you
can use any other character, but not an alphanumeric character, '\', '"' or
'|'. This is useful if you want to include a '/' in the search pattern or
replacement string. Example:
:s+/+//+
I do not know vim syntax, but it looks to me like these are sed-style substitution operators. In sed, the / (in s/REGEX/REPLACEMENT/) can be uniformly replaced with any other single character. Here it appears to be =. So if you mentally replace = with /, you'll get
:%s/ *$//
:%s/ \+$//
which should make more sense to you.

Regex: remove lines not starting with a digit

I have been fighting this problem with the help of a RegEx cheat sheet, trying to figure out how to do this, but I give up... I have this lengthy file open in Notepad++ and would like to remove all lines that do not start with a digit (0..9). I would use the Find/Replace functionality of N++. I am only mentioning this as I am not sure what Regex implementation is N++ using... Thank you
Example. From the following text:
1hello
foo
2world
bar
3!
I would like to extract
1hello
2world
3!
not:
1hello
2world
3!
by doing a find/replace on a regular expression.
You can clear up those line with ^[^0-9].* but it will leave blank lines.
Notepad++ use scintilla, and also using its regex engine to match those.
\r and \n are never matched because in
Scintilla, regular expression searches
are made line per line (stripped of
end-of-line chars).
http://www.scintilla.org/SciTERegEx.html
To clear up those blank lines, only way is choose extended mode, and replace \n\n to \n, If you are in windows mode change \r\n\r\n to \r\n
[^0-9] is a regular expression that matches pretty much anything, except digits. If you say ^[^0-9] you "anchor" it to the start of the line, in most regular expression systems. If you want to include the rest of the line, use ^[^0-9].+.
^[^\d].* marks a whole line whose first character is not a digit. Check if there are really no whitespaces in front of the digits. Otherwise you'd have to use a different expression.
UPDATE:
You will have to do ot in two steps. First empty the lines that do not start with a digit. Then remove the empty lines in extended mode.
One could also use the technique of bookmarking in Notepad++. I started benefiting from this feature (long time present but only more recently made somewhat more visible in the UI) not very long ago.
Simply bring up the find dialogue, type regex for lines not starting with digit ^\D.*$ and select Mark All. This will place blue circles, like marbles, in the left gutter - these are line bookmarks. Then just select from main menu Search -> Bookmark -> Remove bookmarked lines.
Bookmarks are cool, you could extract these lines by simply selecting to copy bookmarked lines, opening new document and pasting lines there. I sometimes use this technique when reviewing log files.
I'm not sure what you are asking. but the reg exp for finding the lines with a digit at the beginning would be
^\d.*
you can remove all the lines that match the above or alternatly keep all the lines that match this expression:
^[^\d].*