How do I re-wrap a paragraph to a certain line length? - regex

I have a big paragraph which I need to split into lines such that each line must not have more than 100 characters and no words must be broken. How would I go about doing this? I guess with regular expressions is the best way but I'm not sure how.

Use Text::Wrap.
Text::Wrap::wrap() is a very simple paragraph formatter. It formats a single paragraph at a time by breaking lines at word boundaries. Indentation is controlled for the first line ($initial_tab) and all subsequent lines ($subsequent_tab) independently.

While you should use a library function if you have one, as KennyTM suggested, a simple regex to solve this can be:
.{1,100}\b
This will take 100 characters or less, and will not break words. It would break other characters though, for example the period at the end of a sentence may be parted from the last word (last word<\n>. new line).
If that's an issue, you can also try:
.{1,99}(\s|.$)
That assures the last character in every match is a white space.
All of these assume you count spaces as characters, and probably don't have newlines in your text (a single paragraph), and don't have word of over 100 characters.

Related

Regular expression to extract whole sentences with matching word

I would like to extract sentences with the word "flung" in the whole text.
For example, in the following text, I'd like to extract the sentence "It was exactly as if a hand had clutched them in the centre and flung them aside." using regular expression.
I tried to use this .*? flung (?<sub>.*?)\., but it starts searching from the beginning of the line.
How could I solve the problem?
As she did so, a most extraordinary thing happened. The bed-clothes gathered themselves together, leapt up suddenly into a sort of peak, and then jumped headlong over the bottom rail. It was exactly as if a hand had clutched them in the centre and flung them aside. Immediately after, .........
Here you go,
[^.]* flung [^.]*\.
DEMO
OR
[^.?!]*(?<=[.?\s!])flung(?=[\s.?!])[^.?!]*[.?!]
DEMO
Simply anything between dots:
without a dote
[A-Za-z," ]+word[A-Za-z," ]+
with a dote
[A-Za-z," ]+word[A-Za-z," ]+\.
"[A-Z]\\s?\\w*\\s?(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*(?:your target\\s)(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*[\\.|\\?|!]"
A sentence starts with any capital letter, in the middle it may contain decimal or abbreviation.
(?<=^|\s)[A-Z][^!?.]*( word\s*)[^!?.]*(?=\.|\!|\?)
Before first capital letter there is a line start or a white space, then it may consist any characters without set of [!?.](*)-or may not , then contains your target word with or without white spaces after it (if it is in the end of the sentence), then may consist again any characters without set of [!?.](*)-or not, and finally ends with dot or ! or ?.

Find/Match every similar words in word list in notepad++

I have a word list in alphabetical order.
It is ranked as a column.
I do not use any programming languages.
The list in notepad format.
I need to match every similar words and take them on same line.
I use regex but I can't achieve correct results.
First list is like:
accept
accepted
accepts
accepting
calculate
calculated
calculates
calculating
fix
fixed
A list I want:
accept accepted accepts accepting
calculate calculated calculates calculating
fix fixed
This seems to work, but you will have to do Replace All multiple times:
Find (^(.+?)\s*?.*?)\R\2 and replace with \1\t\2. . matches newline should be disabled.
How it works:
It finds some characters at the start of line ^(.+?), then any linebreak \R, and those same characters again \2.
\s*?.*? is used to skip unnecessary characters after multiple Replace All. \s*? skips the first whitespace, and .*? any remaining chars on the line.
Match is replaced with \1\t\2, where \1 is anything matched in (^(.+?)\s*?.*?), and \2 is anything matched with (.+?). \t is used to insert tab character to replace linebreak.
How it breaks:
Note that this will not work well with different words with similar prefix, like:
hand
hands
handle
handles
This will be hand hands handle handles after 2 replaces.
I can imagine doing this programatically with limited success (take first word which comes as a root and if derived word with this root follows, place it on the same line, else take the word as a new root and put it to new line). This will still fail at irregular words where root is not the same for all forms.
Without programming there is a way only with (manual) preprocessing – if there are less than 4 forms for given word in the list, you insert blank line for each missing verb form, so there are always 4 lines for each word. Then you can use regex to get each such a quadruple into one line.

How do I find the separator of elements in a string?

I have a string such as "option1;option2;option3" where the ";" separator might be anything. Any string of at least 1 character that the user puts.
I am looking for a simple/clean way to determine the separator without any information other than the input string.
I can guarantee the separator exists only between 2 elements but consider the possibility of only one option in the input string. I can also guarantee that the separator will only be non alphanumerical and may contain space and $ or # or % etc.
Couldn't create a regular expression for this, but perhaps someone will be able to, though I am not particularly looking for a regex expression.
To find the seperator
in = "option1;option2;option3"
separator=re.search("[ ;'#/.,<>?~#;,:}{\]\[+=\-_]+", in).group()
Sorry it was easier to use regexp for this
Now it's back to you. You need to prove that this works as you intend against all possible inputs
Here's a perhaps easier to use version
possible=""" ;'#/.,<>?~#,:}{][+=-_"""
seperator=re.search("[%s]+" % re.escape(possible), input).group()
This means that characters with special meaning in regexp can be added or taken away easier
This would work only if you knew for certain that only characters [A-Za-z0-9_] would appear inf fields:
^(\w+)\W(\w+)\W(\w+)$
This is probably not the case, so my solution would be to:
Create a list of all possible separators.
For each of these separators run a regex (dynamically constructed in a loop): ^([^X]+)X([^X]+)X([^X]+)$ where X is a separator character.
Check if number of matches equals expected number of columns (or go to 4. if you don't know the number of columns).
Run it for every line to see if number of matches changes, because a match in first line could be a blind luck.
If it matches everywhere, then you have your separator and the number of columns. If it doesn't match then start checking next separator for every line.
The downside of this solution is that in worst case you'd run your regex for every line of text and for every separator.
Possible optimizations would be to:
Start checking with most common separators first
Instead of running regex for every line for every separator, just count the number of separator characters in entire text. If the number of lines divides the number of separator characters without a remainder, then there's high probability that the separator is valid.

How can I remove all the text between matches on a line?

I have this problem:
Input text:
this is my text text text and more text
this is my text myspace this is my text
this space is my text space this is my
this is my text this is my text
this space is my text space space myspace
Let say I want to search for "space"
I would like to have this as output:
this is my text text text and more text
space
space space
this is my text this is my text
space space space space
Matches on the same line have to be separated with a space.
Line without matches must remain as it is.
Same for all other search items.
I'm trying to realize this, this afternoon but without success.
Can anyone help me?
Solution:
:g/space/s/\(.*space\).*$/\1/|s/.\{-}space/ space/g|s/^ //
Explanation:
This is tricky, but it can be done. It can't be done with a single regular expression, though.
The first thing we do is get rid of anything after the last match (we actually exploit the fact that regular expressions are greedy by default here):
s/\(.*space\).*$/\1/
Then we remove anything between all the internal matches (notice we use the lazy version of * here, \{-}):
s/.\{-}space/ space/g
The previous step will leave an initial space in the result, so we get rid of that:
s/^ //
Fortunately, in vim, we can chain replacements together with the | character. So, putting it all together:
:g/space/s/\(.*space\).*$/\1/|s/.\{-}space/ space/g|s/^ //
is this tricky line ok for you?
:g/space/s/space/^G/g|s/[^^G]//g|s/^G/space /g
the ^G above you need press Ctrl-V Ctrl-G
the output of above command is same as your example except for the ending whitespace after pattern (space in this case). but it is easy to be fixed, e.g. chain another s/ $// after the :g line.
Kent's solution uses a nice trick that makes it work only for fixed strings, but it's clean and short. Ethan Brown's answer is more general, but also adds complexity with its three steps. I think the best solution can be developed based on the accepted answer in this very similar question.
Contrary to what Ethan Brown assumes, this can indeed be done with a single regular expression substitution. Here it is, in all its ugliness:
:g/space/s/\%(^\|\%(space \)*space\%( \%(.*space\)\#=\)\?\)\zs\%(\%(space \)*space\%( \%(.*space\)\#=\)\?\)\#!.\{-1,}\ze\%(\%(space \)*space\%( \%(.*space\)\#=\)\?\|$\)//g
It becomes somewhat more readable when you use the :DeleteExcept command from my PatternsOnText plugin:
:g/space/DeleteExcept/\%(space \)*space\%( \%(.*space\)\#=\)\?/
Explanation
This deletes everything except
potentially multiple sequential occurrences \%(space \)*
of the word space
including the trailing whitespace when it's not the last match in the line, i.e. there's a following match \%(.*space\)\#= so that the whitespace is not swallowed
or excluding (i.e. deleting) it \? after the last match in the line.
More practical alternative
Though it's a nice challenge to come up with the above solution, in practice, I would also favor a two-step approach, just because it's way simpler:
:g/space/DeleteExcept/space\%( \|$\)/
This leaves behind trailing whitespace that can be pruned with
:%s/ $//

How can I word wrap a string in Perl?

I'm trying to create a loose word wrapping system via a regex in Perl. What I would like is about every 70 characters or so to check for the next whitespace occurrence and replace that space with a newline, and then do this for the whole string. The string I'm operating on may already have newlines in it already, but the amount of text between newlines tends to be very lengthy.
I'd like to avoid looping one character at a time or using substr if I can, and I would prefer to edit this string in place as opposed to creating new string objects. These are just preferences, though, and if I can't achieve what I'm looking for without breaking these preferences then that's fine.
Thoughts?
Look at modules like Text::Wrap or Text::Autoformat.
Depending on your needs, even the GNU core utility fold(1) may be an option.
s/(.{70}[^\s]*)\s+/$1\n/
Consume the first 70 characters, then stop at the next whitespace, capturing everything in the process. Then, emit the captured string, omitting the whitespace at the end, adding a newline.
This doesn't guarantee your lines will cut off strictly at 80 characters or something. There's no guarantee the last word it consumes won't be a billion characters long.
Welbog's answer wraps at the first space after 70 characters. This has the flaw that long words beginning close to the end of the line make an overlong line. I would suggest instead wrapping at the last space within the first, say, 81 characters, or wrapping at the first space if you have a >80 character "word", so that only truly unbreakable lines are overlong:
s/(.{1,79}\S|\S+)\s+/$1\n/g;
In modern perl:
s/(?:.{1,79}\S|\S+)\K\s+/\n/g;
You can get much, much more control and reliability by using Text::Format
use Text::Format;
print Text::Format->new({columns => 70})->format($text);
This is the one I've always used.
Unlike the accepted solution, it will wrap BEFORE the wrap-length (in this case, 70 characters), unless there's a really long "word" without spaces (such as a URL), in which case it will just place that word on its own line, rather than break it.
s/(?=.{70,})(.{0,70}\n?)( )/\1\2\n /g
This second form handles all line endings: Mac \r, Unix \n, Windows \r\n, and Teletype \n\r, but which one it uses as a replacement still depends on what you put in the replacement clause: I've used \n.
s/(?=.{70,})(.{0,70}(?:\r\n?|\n\r?)?)( )/\1\2\n /g
Both versions also indent all wrapped lines after the first by one space: remove the space before the last /g if you don't want that, but I usually find it nicer.