conditionally remove portion of a line in delimited file - regex

I have a ~ delimited text file with about 20 nullable columns.
I am trying to use SED (from cygwin) to "blank out" the value in column 11 if the following conditions are met...
Column 3 is a zero (0)
Column 11 is in date format mm/dd/yy (I'm not really concerned if it's a valid date)
Here's what I'm trying...
s/\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\(\d{2}\/\d{2}\/\d{2}~\)\(.*$\)/\1~\3/
Here's a sample from the file:
Test A~7~1~~~~72742050~~~Z370~10/25/11~~~0~8.58563698~6.40910452~4.59198764~3.18239469~1.72955975~.23345372~-1.30891113~-2.89971394~1~0
Test B~7~0~~~~72742060~~~Z351~05/15/12~05/14/12~~0~18.88910518~12.69425528~9.96182381~6.76077612~6.76077612~3.86279298~.22449489~-.91021010~0~0
Test C~7~0~~~~72742060~~~Z352~06/12/12~ABC~~0~20.60845679~17.54889351~15.52912556~12.43279217~12.43279217~10.32033576~9.35296144~8.09245899~0~0
...and here's what I expect to get back
Test A~7~1~~~~72742050~~~Z370~10/25/11~~~0~8.58563698~6.40910452~4.59198764~3.18239469~1.72955975~.23345372~-1.30891113~-2.89971394~1~0
Test B~7~0~~~~72742060~~~Z351~05/15/12~~~0~18.88910518~12.69425528~9.96182381~6.76077612~6.76077612~3.86279298~.22449489~-.91021010~0~0
Test C~7~0~~~~72742060~~~Z352~06/12/12~ABC~~0~20.60845679~17.54889351~15.52912556~12.43279217~12.43279217~10.32033576~9.35296144~8.09245899~0~0
but the file comes through with line 2 completely unchanged.

You are trying to replace column 12 instead of 11:
\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\(\d{2}\/\d{2}\/\d{2}~\)\(.*$\)
1 2 3 4 5 6 7 8 9 10 11 12
If just removing one of the [^~]*~ from the end of the first group doesn't fix it, it could be because your version of sed doesn't support either \d or repetition with {2} (although escaping the curly brackets would probably fix that).
Here is a version that should work everywhere which replaces each \d{2} with [0-9][0-9] (and fixes the incorrect column issue mentioned above):
s/\([^~]*~[^~]*~0~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~[^~]*~\)\([0-9][0-9]\/[0-9][0-9]\/[0-9][0-9]~\)\(.*$\)/\1~\3/

Related

Regex to match some dates matching non-dates

I'm using some Regex to find date strings of the form Jan 12, 2015 or Feb 3, 1999.
The regex I'm using is \w+\s\d{1,2},\s\d{4} and it's working correctly, but the thing is that on the file are also some strings with the form:
Weg 58, 4047 or Strasse 1, 4482 and I also match them.
How can I avoid those non-date matches? My approach is:
The first string (the one of the month, Jan, Feb, etc.) has to have always length 3.
The year has to start with 1 or 2.
The thing is that I dont know how can I add these two options to my regex. Any help please?
You can make the test right here: https://regex101.com/r/bN2pO0/1
Thanks in advance.
Since the months won't change (ie: consistent values between January - Decemeber, we can put the 3 starting characters).
We can then use a OR | operator to select years starting with 1 or 2
/((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s(1|2)\d{3})/ig
https://regex101.com/r/bN2pO0/3
Just as you used \d{1,2} to match a digit 1 or 2 times and \d{4} to match a digit 4 times, you can use \w{3} to match a word character 3 times.
For the year, you can use the pipe "or" operator |.
\w{3}\s\d{1,2},\s(?:1|2)\d{3}
Although, this will also match non-dates of form Abc xy, 1xyz
If you want, you can go with brute force approach or just get rid of regex and use code to capture the dates.
Brute force:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s[0-2]?[0-9],\s[12]\d{3}

Bring all rows to one row

In Excel, I have rows like below:
1 2 3 4 5
6 7 8 9 0
9 8 7 6 5
...
I need to bring all of them to the first row:
1 2 3 4 5 6 7 8 9 0 9 8 7 6 5 ...
The numbers of rows and columns are fixed.
What is the fastest way I can achieve this?
Alternatively, can I solve this on a Textpad or Notepad++ using some REGEX grouping?
If you wanted to do it with an Excel formula, pasting the following, starting in column F, would produce it across the top row:
=INDIRECT("r"&CEILING(COLUMN()/5,1)&"c"&IF(MOD(COLUMN(),5)=0,5,MOD(COLUMN(),5)),FALSE)
If your table started from A2 and your row values are to be copied from A1 onwards, following should work:
=OFFSET($A$2, (COLUMN()-COLUMN($A$1))/5, MOD(COLUMN()-COLUMN($A$1), 5))
However, I think for just a small table of size 5X10, using '=' sign manually would be the fastest.
I just select and drag them up there. When there was a pattern of projects I wrote a VBA function to do the job, but for most small unique projects select and drag worked for me.
In Notepad++, for Find what : \r\n, Replace with : 'space' with Search Mode Extended, Replace All, then copy result into Excel.
With images :) here:
Replace Carriage Return and Line Feed in Notepad++.

How do I remove first 5 characters in each line in a text file using vi?

How do I remove the first 5 characters in each line in a text file?
I have a file like this:
4 Alabama
4 Alaska
4 Arizona
4 Arkansas
4 California
54 Can
8 Carolina
4 Colorado
4 Connecticut
8 Dakota
4 Delaware
97 Do
4 Florida
4 Hampshire
47 Have
4 Hawaii
I'd like to remove the number and the space at the beginning of each line in my txt file.
:%s/^.\{0,5\}// should do the trick. It also handles cases where there are less than 5 characters.
Use the regular expression ^..... to match the first 5 characters of each line. use it in a global substitution:
:%s/^.....//
As all lines are lined up, you don't need a substitution to solve this problem.
Just bring the cursor to the top left position (gg), then:
CTRL+vGwlx
I think easiest way is to use cut.
just type cut -c n- <filename>
Try
:s/^.....//
You probably don't need the "^" (start of line), and there'd be shortcuts for the 5 characters - but simple is good :)
Since the text looks like it's columnar data, awk would usually be helpful. I'd use V to select the lines, then hit :! and use awk:
:'<,'>! awk '{ print $2 }'
to print out the second column of the data. Saves you from counting spaces altogether.
:%s/^.\{0,5\}//g for global, since we want to remove first 5 columns of each line for every line.
In my case, to Delete first 2 characters Each Line I used this :%s/^.\{0,2\}// and it works with or without g the same.
I am on a VIM - Vi IMproved 8.2, macOS version, Normal version without GUI.

Is it possible to increment numbers using regex substitution?

Is it possible to increment numbers using regex substitution? Not using evaluated/function-based substitution, of course.
This question was inspired by another one, where the asker wanted to increment numbers in a text editor. There are probably more text editors that support regex substitution than ones that support full-on scripting, so a regex might be convenient to float around, if one exists.
Also, often I've learned neat things from clever solutions to practically useless problems, so I'm curious.
Assume we're only talking about non-negative decimal integers, i.e. \d+.
Is it possible in a single substitution? Or, a finite number of substitutions?
If not, is it at least possible given an upper bound, e.g. numbers up to 9999?
Of course it's doable given a while-loop (substituting while matched), but we're going for a loopless solution here.
This question's topic amused me for one particular implementation I did earlier. My solution happens to be two substitutions so I'll post it.
My implementation environment is solaris, full example:
echo "0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909" |
perl -pe 's/\b([0-9]+)\b/0$1~01234567890/g' |
perl -pe 's/\b0(?!9*~)|([0-9])(?=9*~[0-9]*?\1([0-9]))|~[0-9]*/$2/g'
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Pulling it apart for explanation:
s/\b([0-9]+)\b/0$1~01234567890/g
For each number (#) replace it with 0#~01234567890. The first 0 is in case rounding 9 to 10 is needed. The 01234567890 block is for incrementing. The example text for "9 10" is:
09~01234567890 010~01234567890
The individual pieces of the next regex can be described seperately, they are joined via pipes to reduce substitution count:
s/\b0(?!9*~)/$2/g
Select the "0" digit in front of all numbers that do not need rounding and discard it.
s/([0-9])(?=9*~[0-9]*?\1([0-9]))/$2/g
(?=) is positive lookahead, \1 is match group #1. So this means match all digits that are followed by 9s until the '~' mark then go to the lookup table and find the digit following this number. Replace with the next digit in the lookup table. Thus "09~" becomes "19~" then "10~" as the regex engine parses the number.
s/~[0-9]*/$2/g
This regex deletes the ~ lookup table.
Wow, turns out it is possible (albeit ugly)!
In case you do not have the time or cannot be bothered to read through the whole explanation, here is the code that does it:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
$str = preg_replace("/\d+/", "$0~", $str);
$str = preg_replace("/$/", "#123456789~0", $str);
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
Now let's get started.
So first of all, as the others mentioned, it is not possible in a single replacement, even if you loop it (because how would you insert the corresponding increment to a single digit). But if you prepare the string first, there is a single replacement that can be looped. Here is my demo implementation using PHP.
I used this test string:
$str = '0 1 2 3 4 5 6 7 8 9 10 11 12 13 19 20 29 99 100 139';
First of all, let's mark all digits we want to increment by appending a marker character (I use ~, but you should probably use some crazy Unicode character or ASCII character sequence that definitely will not occur in your target string.
$str = preg_replace("/\d+/", "$0~", $str);
Since we will be replacing one digit per number at a time (from right to left), we will just add that marking character after every full number.
Now here comes the main hack. We add a little 'lookup' to the end of our string (also delimited with a unique character that does not occur in your string; for simplicity I used #).
$str = preg_replace("/$/", "#123456789~0", $str);
We will use this to replace digits by their corresponding successors.
Now comes the loop:
do
{
$str = preg_replace(
"/(?|0~(.*#.*(1))|1~(.*#.*(2))|2~(.*#.*(3))|3~(.*#.*(4))|4~(.*#.*(5))|5~(.*#.*(6))|6~(.*#.*(7))|7~(.*#.*(8))|8~(.*#.*(9))|9~(.*#.*(~0))|(?<!\d)~(.*#.*(1)))/s",
"$2$1",
$str, -1, $count);
} while($count);
Okay, what is going on? The matching pattern has one alternative for every possible digit. This maps digits to successors. Take the first alternative for example:
0~(.*#.*(1))
This will match any 0 followed by our increment marker ~, then it matches everything up to our cheat-delimiter and the corresponding successor (that is why we put every digit there). If you glance at the replacement, this will get replaced by $2$1 (which will then be 1 and then everything we matched after the ~ to put it back in place). Note that we drop the ~ in the process. Incrementing a digit from 0 to 1 is enough. The number was successfully incremented, there is no carry-over.
The next 8 alternatives are exactly the same for the digits 1to 8. Then we take care of two special cases.
9~(.*#.*(~0))
When we replace the 9, we do not drop the increment marker, but place it to the left of our the resulting 0 instead. This (combined with the surrounding loop) is enough to implement carry-over propagation. Now there is one special case left. For all numbers consisting solely of 9s we will end up with the ~ in front of the number. That is what the last alternative is for:
(?<!\d)~(.*#.*(1))
If we encounter a ~ that is not preceded by a digit (therefore the negative lookbehind), it must have been carried all the way through a number, and thus we simply replace it with a 1. I think we do not even need the negative lookbehind (because this is the last alternative that is checked), but it feels safer this way.
A short note on the (?|...) around the whole pattern. This makes sure that we always find the two matches of an alternative in the same references $1 and $2 (instead of ever larger numbers down the string).
Lastly, we add the DOTALL modifier (s), to make this work with strings that contain line breaks (otherwise, only numbers in the last line will be incremented).
That makes for a fairly simple replacement string. We simply first write $2 (in which we captured the successor, and possibly the carry-over marker), and then we put everything else we matched back in place with $1.
That's it! We just need to remove our hack from the end of the string, and we're done:
$str = preg_replace("/#123456789~0$/", "", $str);
echo $str;
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 30 100 101 140
So we can do this entirely in regular expressions. And the only loop we have always uses the same regex. I believe this is as close as we can get without using preg_replace_callback().
Of course, this will do horrible things if we have numbers with decimal points in our string. But that could probably be taken care of by the very first preparation-replacement.
Update: I just realised, that this approach immediately extends to arbitrary increments (not just +1). Simply change the first replacement. The number of ~ you append equals the increment you apply to all numbers. So
$str = preg_replace("/\d+/", "$0~~~", $str);
would increment every integer in the string by 3.
I managed to get it working in 3 substitutions (no loops).
tl;dr
s/$/ ~0123456789/
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
Explanation
Let ~ be a special character not expected to appear anywhere in the text.
If a character is nowhere to be found in the text, then there's no way to make it appear magically. So first we insert the characters we care about at the very end.
s/$/ ~0123456789/
For example,
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
becomes:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
Next, for each number, we (1) increment the last non-9 (or prepend a 1 if all are 9s), and (2) "mark" each trailing group of 9s.
s/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/$2$3$4$5/g
For example, our example becomes:
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
Finally, we (1) replace each "marked" group of 9s with 0s, (2) remove the ~s, and (3) remove the character set at the end.
s/9(?=9*~)(?=.*(0))|~| ~0123456789$/$1/g
For example, our example becomes:
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
PHP Example
$str = '0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909';
echo $str . '<br/>';
$str = preg_replace('/$/', ' ~0123456789', $str);
echo $str . '<br/>';
$str = preg_replace('/(?=\d)(?:([0-8])(?=.*\1(\d)\d*$)|(?=.*(1)))(?:(9+)(?=.*(~))|)(?!\d)/', '$2$3$4$5', $str);
echo $str . '<br/>';
$str = preg_replace('/9(?=9*~)(?=.*(0))|~| ~0123456789$/', '$1', $str);
echo $str . '<br/>';
Output:
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909
0 1 2 3 7 8 9 10 19 99 109 199 909 999 1099 1909 ~0123456789
1 2 3 4 8 9 19~ 11 29~ 199~ 119~ 299~ 919~ 1999~ 1199~ 1919~ ~0123456789
1 2 3 4 8 9 10 11 20 100 110 200 910 1000 1100 1910
Is it possible in a single substitution?
No.
If not, is it at least possible in a single substitution given an upper bound, e.g. numbers up to 9999?
No.
You can't even replace the numbers between 0 and 8 with their respective successor. Once you have matched, and grouped this number:
/([0-8])/
you need to replace it. However, regex doesn't operate on numbers, but on strings. So you can replace the "number" (or better: digit) with twice this digit, but the regex engine does not know it is duplicating a string that holds a numerical value.
Even if you'd do something (silly) as this:
/(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)/
so that the regex engine "knows" that if group 1 is matched, the digit '0' is matched, it still cannot do a replacement. You can't instruct the regex engine to replace group 1 with the digit '1', group '2' with the digit '2', etc. Sure, some tools like PHP will let you define a couple of different patterns with corresponding replacement strings, but I get the impression that is not what you were thinking about.
It is not possible by regular expression search and substitution alone.
You have to use use something else to help achieve that. You have to use the programming language at hand to increment the number.
Edit:
The regular expressions definition, as part of Single Unix Specification doesn't mention regular expressions supporting evaluation of aritmethic expressions or capabilities for performing aritmethic operations.
Nonetheless, I know some flavors ( TextPad, editor for Windows) allows you to use \i as a substitution term which is an incremental counter of how many times has the search string been found, but it doesn't evaluate or parse found strings into a number nor does it allow to add a number to it.
I have found a solution in two steps (Javascript) but it relies on indefinite lookaheads, which some regex engines reject:
const incrementAll = s =>
s.replaceAll(/(.+)/gm, "$1\n101234567890")
.replaceAll(/(?:([0-8]|(?<=\d)9)(?=9*[^\d])(?=.*\n\d*\1(\d)\d*$))|(?<!\d)9(?=9*[^\d])(?=(?:.|\n)*(10))|\n101234567890$/gm, "$2$3");
The key thing is to add a list of numbers in order at the end of the string in the first step, and in the second, to find the location relevant digit and capture the digit to its right via a lookahead. There are two other branches in the second step, one for dealing with initial nines, and the other for removing the number sequence.
Edit: I just tested it in safari and it throws an error, but it definately works in firefox.
I needed to increment indices of output files by one from a pipeline I can't modify. After some searches I got a hit on this page. While the readings are meaningful, they really don't give a readable solution to the problem. Yes it is possible to do it with only regex; no it is not as comprehensible.
Here I would like to give a readable solution using Python, so that others don't need to reinvent the wheels. I can imagine many of you may have ended up with a similar solution.
The idea is to partition file name into three groups, and format your match string so that the incremented index is the middle group. Then it is possible to only increment the middle group, after which we piece the three groups together again.
import re
import sys
import argparse
from os import listdir
from os.path import isfile, join
def main():
parser = argparse.ArgumentParser(description='index shift of input')
parser.add_argument('-r', '--regex', type=str,
help='regex match string for the index to be shift')
parser.add_argument('-i', '--indir', type=str,
help='input directory')
parser.add_argument('-o', '--outdir', type=str,
help='output directory')
args = parser.parse_args()
# parse input regex string
regex_str = args.regex
regex = re.compile(regex_str)
# target directories
indir = args.indir
outdir = args.outdir
try:
for input_fname in listdir(indir):
input_fpath = join(indir, input_fname)
if not isfile(input_fpath): # not a file
continue
matched = regex.match(input_fname)
if matched is None: # not our target file
continue
# middle group is the index and we increment it
index = int(matched.group(2)) + 1
# reconstruct output
output_fname = '{prev}{index}{after}'.format(**{
'prev' : matched.group(1),
'index' : str(index),
'after' : matched.group(3)
})
output_fpath = join(outdir, output_fname)
# write the command required to stdout
print('mv {i} {o}'.format(i=input_fpath, o=output_fpath))
except BrokenPipeError:
pass
if __name__ == '__main__': main()
I have this script named index_shift.py. To give an example of the usage, my files are named k0_run0.csv, for bootstrap runs of machine learning models using parameter k. The parameter k starts from zero, and the desired index map starts at one. First we prepare input and output directories to avoid overriding files
$ ls -1 test_in/ | head -n 5
k0_run0.csv
k0_run10.csv
k0_run11.csv
k0_run12.csv
k0_run13.csv
$ ls -1 test_out/
To see how the script works, just print its output:
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | head -n5
mv test_in/k6_run26.csv test_out/k7_run26.csv
mv test_in/k25_run11.csv test_out/k26_run11.csv
mv test_in/k7_run14.csv test_out/k8_run14.csv
mv test_in/k4_run25.csv test_out/k5_run25.csv
mv test_in/k1_run28.csv test_out/k2_run28.csv
It generates bash mv command to rename the files. Now we pipe the lines directly into bash.
$ python3 -u index_shift.py -r '(^k)(\d+?)(_run.+)' -i test_in -o test_out | bash
Checking the output, we have successfully shifted the index by one.
$ ls test_out/k0_run0.csv
ls: cannot access 'test_out/k0_run0.csv': No such file or directory
$ ls test_out/k1_run0.csv
test_out/k1_run0.csv
You can also use cp instead of mv. My files are kinda big, so I wanted to avoid duplicating them. You can also refactor how many you shift as input argument. I didn't bother, cause shift by one is most of my use cases.

Replacing multiple blank lines with one blank line using RegEx search and replace

I have a file that I need to reformat and remove "extra" blank lines.
I am using the Perl syntax regular expression search and replace functionality of UltraEdit and need the regular expression to put in the "Find What:" field.
Here is a sample of the file I need to re-format.
All current text
REPLACE with all the following:
Winter 2011 Class Schedule
Winter 2011 Class Registration Dates: Dec. 6, 2010 – Jan. 1, 2011
Winter 2011 Class Session Dates: Jan. 5 – Feb. 12, 2011
DANCE
Adventures in Ballet & Tap
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old develop a greater sense of rhythm, flexibility and coordination as they explore the basic elements of movement.
Saturdays 9 - 10 a.m. Jan. 8 – Feb. 12 Six-week fees: $30
African Storytelling
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old explore storytelling and fables through spoken word, music, movement and visual arts experiences.
Saturdays 10 – 11 a.m. Jan. 8 – Feb. 12 Six-week fee: $30
African Dance / Children
You'll notice that some of the double blank lines have spaces or tabs or both in them.
After the search and replace has been run I should have a file that looks like this.
All current text
REPLACE with all the following:
Winter 2011 Class Schedule
Winter 2011 Class Registration Dates: Dec. 6, 2010 – Jan. 1, 2011
Winter 2011 Class Session Dates: Jan. 5 – Feb. 12, 2011
DANCE
Adventures in Ballet & Tap
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old develop a greater sense of rhythm, flexibility and coordination as they explore the basic elements of movement.
Saturdays 9 - 10 a.m. Jan. 8 – Feb. 12 Six-week fees: $30
African Storytelling
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old explore storytelling and fables through spoken word, music, movement and visual arts experiences.
Saturdays 10 – 11 a.m. Jan. 8 – Feb. 12 Six-week fee: $30
African Dance / Children
Replacing
^(\s*\r\n){2,}
With
\r\n
Is what I ended up with.
This only selects blank lines in multiples of two or more and replaces them with one.
It depends what the line endings are. Assuming \n, replace this:
([ \t]*\n){3,}
with \n\n.
Try this perl oneliner perl -00pe0, if you want in place editing, just add -i option
Replacing
\n\s*\n\s*
with
\n\n
should do the trick
For completeness I want to reference here the large post Remove / delete blank and empty lines in the user forums of UltraEdit which contains at bottom after all the explanations for newbies the solution for reducing two or more lines with nothing (empty lines) or just whitespaces (blank lines) to one empty line independent on line terminator type.
And some words on what Alan Moore wrote in his answer:
UltraEdit's Perl regular expression support is not crippled by its line-based architecture. Perl regular expression engines have a flag which determine if a dot matches all characters except newline characters like carriage return (CR) and line feed (LF) or really all characters including CR and LF. This makes the difference if a text file is interpreted as large byte stream or as a sequence of lines for Perl regular expression finds/replaces. In UltraEdit the flag is set by default to not include \r (CR) and \n (LF) by a dot in the regular expression search string. But this behavior can be easily changed in UltraEdit by starting the regular expression string with (?s) which changes the value of the flag match_not_dot_newline as posted in UltraEdit user forums at topic "." in Perl regular expressions doesn't include CRLFs?
A Perl regular expression replace working for files with
carriage return + line feed (DOS/Windows) or
only line feed (Unix, Mac OS 10.0 and later versions) or
only carriage return (Mac OS 9 and previous versions)
as line ending with optionally trailing spaces and tabs at end of a paragraph (one or more lines) and with two or more lines without (empty line) or with whitespaces (blank line) below the paragraph could be done with search string \h*(\r?\n|\r)(?:\h*\1){2,} and \1\1 as replace string.
Explanation:
\h* matches any horizontal whitespace character according to Unicode 0 or more times. This first part of the search expression matches horizontal whitespace characters at end of a line like horizontal tabs, normal spaces, no-break-spaces and some other not often used spaces.
The usage of \s is not good as this character class matches any whitespace character including the vertical whitespace characters carriage return and line feed.
(\r?\n|\r) ... is an OR expression with two arguments in a marking group. The first argument matches a line feed optionally with a preceding carriage return while the second argument matches just a carriage return. So this expression matches all three common types of line terminations completely correct. It is important for the rest of the search and the replace to match always either CR+LF (both together) or just LF or just CR.
(?:\h*\1) ... is a non marking group which matches 0 or more horizontal whitespaces and the newline as found before back-referenced with \1, i.e. CR+LF or just LF or just CR. So this part of the expression finds an empty or blank line.
{2,} ... is a multiplier for the previous expression in the non marking group which means at least two times. So after end of a paragraph there must be two or more empty or blank lines. Only one empty or blank line below a paragraph is not enough for a positive match of search expression.
The replace string \1\1 references twice the first found line break.
The advantage of this regular expression in comparison to the others posted here is that the line ending type must not be known. The search expression finds that out and found line ending is referenced in the replace string. And probably existing trailing whitespaces at end of a paragraph and whitespaces on next line are removed also by this regular expression replace if there are two or more empty or blank lines below a paragraph.
{2,} can be replaced by + in search string if trimming whitespaces at end of a paragraph and on next empty or blank line should be also done on running this Perl regular expression replace. But please note that in this case the replace makes replaces which do not change anything at all if there are not trailing whitespaces at end of a paragraph and next line is an empty line.
In Vim, Using
:%!cat -s
I find this is the easiest way to delete extra empty line so far.
I'm not sure what UltraEdit lets you get away with in the "replace" area, but if you cannot use a newline (I've had this problem before) but can use capture references, this might work:
Find : \s*(\r\n)\s*(\r\n)\s*\r\n
Replace : $1$2
Not tested extensively, but seems to work on the sample you provided.
See this thread for what's causing the problem. As I understand it, UltraEdit regexes are greedy at the character level (i.e., within a line), but non-greedy at the line level (roughly speaking). I don't have access to UE, but I would try writing the regex so it has to match something concrete after the last blank line. For example:
search: (\r\n[ \t]*){2,}(\S)
replace: $1$2
This matches and captures two or more instances of a line separator and any horizontal whitespace that follows it, but it only retains the last one. The \S should force it to keep matching until it finds a line with at least one non-whitespace character.
I admit that I don't have a whole lot of confidence in this solution; UltraEdit's regex support is crippled by its line-based architecture. If you want an editor that does regexes right, and you don't want to learn a whole new regex syntax (like vim's), get EditPadPro.
Should also work with spaces on blank lines
Search - /\n^\s*\n/
Replace - \n\n
On my Intellij IDE what was search for \n\n and Replace it by \n