I've got some sloppy text that needs to be cleaned up. Somehow random line breaks got inserted in the middle of paragraphs.
This is a paragraph
and it got broken into two lines.
The manual way to deal with this would be to
Place my cursor at the beginning of line 2
Hit delete to bring that line up to line 1
Hit space to separate the two words that get mashed together by doing this
Is there a way to accomplish this with Find and Replace? I know I can find the offending lines with ^[a-z] and checking "Case Sensitive", but that's as far as I can get.
I'm just starting to learn how powerful pattern matching can be and I've solved all the other cleanup issues, but this one still perplexes me.
Using awk in linux
cat file
This is a paragraph
and it got broken into two lines.
This line is fine and should be printed.
Here is another
that has been broken.
awk 'NR>1 {printf "%s"(substr($0,1,1)~/^[[:lower:]]$/?FS:RS),a} {a=$0} END {print a}' file
This is a paragraph and it got broken into two lines.
This line is fine and should be printed.
Here is another that has been broken.
If there is really nothing more to be taken care of, search for \n([a-z]) (with find's "Matching" both "Case sensitive" and "Grep" enabled) and replace with \1. (The searched for expression does not have a leading blank, whilst the replacement does actually have one.)
Related
Suppose I want to remove all lines matching a regex in my project. Is there a way to do that?
Using the global find and replace function with regexes enabled I've tried:
Replace foo|bar with an empty string. This doesn't work because it leaves the line there with an empty string. I want the newline removed.
Replace (foo|bar)\n with an empty string. This doesn't actually match anything.
Replace (foo|bar)$ with an empty string. Again, doesn't match anything.
Any ideas?
Edit: It seems like some of my files have Windows line endings so (foo|bar)\r?\n does match. However when you replace it with an empty string it actually still leaves the line endings there.
Here's a test case:
a
foo
b
It should end up like this:
a
b
Not like this:
a
b
foo\n^ and (foo|bar)\n^ both work.
I just tested in my vs code - and you leave the replacement string blank
Yes, it is possible to remove entire lines with the search-across-files feature.
I'm guessing the original problem was due to a bug or otherwise unwanted behavior in an older version of VSCode. With VSCode 1.37.1, so long as the \n is included in the regex, the line is removed. In particular, the regex (foo|bar)\n, described in the original question as not working, now works fine.
Before:
After pressing the "Replace All" button:
Related observations:
This same regex works even if I set the file line endings to CRLF.
Appending ^ makes no difference. That's a bit surprising, but perhaps "after newline" counts as "beginning of line".
Appending $ causes the regex to not match anything. That is quite surprising given the behavior of ^.
I looked through the search configuration settings, but nothing seemed like it could affect this.
I want to remove the empty lines from this file so that there are only two \n between stanzas of the song. There appear to be spaces on lines 7, 8, and 20, but I'm guessing they aren't regular spaces because I haven't been able to remove them with substitutions that use \s.
The text is reproduced below (with the spaces marked by <-- HERE for clarity), but the Stack Overflow editor seems to have changed the special spaces into regular ones, so you'll have to look at the original file to duplicate my problem.
9a I Believe in a Hill Called Mount Calvary
1 There are things, as we travel this earth's shifting sands,
That transcend all the reason
But the things that matter the most in this world,
They can never be held in our hand
<-- HERE
<-- HERE
Chorus
I believe in a hill called mount Calvary,
I believe whatever the cost!
And when time has surrendered and earth is no more
I'll still cling to that old rugged cross
2 I believe that the Christ who was slain on the cross,
Has the power to change lives today;
For He changed me completely a new life is mine
That is why by the cross I will stay
<-- HERE
3 I believe that this life, with its great mysteries,
Surely someday will come to an end;
But faith will conquer the darkness and death
And will lead me at last to my Friend
I tried perl -pe 's/\n{3,}/\n\n/g' which didn't work as there was some space in the lines 7, 8 and 20.
I can't remove the space, no matter what I try. I tried the following commands:
perl -p0e 's/\s{3,}/\n\n/g'
perl -pe 's/^\s$//g'
perl -pe 's/^ $//g'
perl -pe 's/ $//g'
None of these work. I want to know why this is happening. Could there be a non-space character that acts as a blank?
What should I do to get rid of this?
What should I do to get rid of this?
If you suspect funny characters, look at the file with od -bc filename and look for unusual characters.
I have used your file, after removing the <-- HERE marks, and your first alternative perl -p0e 's/\s{3,}/\n\n/g' file works just fine. This is a strong indication (aka proof :-) that something like this is the cause.
As I observed, the spaces are just non-printable characters. Suggest you try the following:
perl -p0e 's/(?:[\x80-\xFF][\x0D\x0A]{2})+//g'
I found the solution thanks to Jens' suggestion to use od -bc filename.
The dump showed characters 302 240 in the place of the space in the lines 7, 8 and 20.
On searching for the details on the octal values, I got the following from here:
man iso_8859-1 identifies \240 as NO-BREAK SPACE
and \302 as LATIN CAPITAL LETTER A WITH CIRCUMFLEX
I found how to remove the characters from here.
I used to command perl -pi -e 's/[^[:ascii:]]//g' filename to correct this.
Thank you for all the tips given and effort put forth.
I think the following solution can solve your problem
open FH,"/home/httpd/cgi-bin/space.txt";
while(<FH>)
{
print if (!/^\s*$/) ;
}
I have a bit of a strange problem: I have a code (it's LaTeX but that does not matter here) that contains long lines with period (sentences).
For better version control I wanted to split these sentences on a new line each.
This can be achieved via sed 's/\. /.\n/g'.
Now the problem arises if there are comments with potential periods as well.
These comments must not be altered, otherwise they will be parsed as LaTeX code and this might result in errors etc.
As a pseudo example you can use
Foo. Bar. Baz. % A. comment. with periods.
The result should be
Foo.
Bar.
Baz. % ...
Alternatively the comment might go on the next line without any problems.
It was ok to use perl if that would work out better. I tried with different programs (sed and perl) a few ideas but none did what I expected. Either the comment was also altered or only the first period was altered (perl -pe 's/^([^%]*?)\. /\1.\n/g').
Can you point me in the right direction?
This is tricky as you're essentially trying to match all occurrences of ". " that don't follow a "%". A negative look-behind would be useful here, but Perl doesn't support variable-width negative look-behind. (Though there are hideous ways of faking it in certain situations.) We can get by without it here using backtracking control verbs:
s/(?:%(*COMMIT)(*FAIL))|\.\K (?!%)/\n/g;
The (?:%(*COMMIT)(*FAIL)) forces replacement to stop the first time it sees a "%" by committing to a match and then unconditionally failing, which prevents back-tracking. The "real" match follows the alternation: \.\K (?!%) looks for a space that follows a period and isn't followed by a "%". The \K causes the period to not be included in the match so we don't have to include it in the replacement. We only match and replace the space.
Putting the comment by itself on a following line can be done with sed pretty easily, using the hold space:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G'
Or if you want the comment by itself before the rest:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/ *%.*//;s/\. /.\n/g;x;s/[^%]*%/%/;G'
Or finally, it is possible to combine the comment with the last line also:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G;s/\n\([^\n]*\)$/ \1/'
I have searched everywhere for an answer to this, but I think I must not be using the right lingo... I have text like this:
This text is actually just
one paragraph, but every
few words are broken to a
new line, and that's
annoying as hell, because
I have to go to each line
and fix it by hand...
Then there's a second
paragraph which does the
same thing.
I would like to convert that to:
This text is actually just one paragraph, but every few words are broken to a new line, and that's annoying as hell, because I have to go to each line and fix it by hand...
Then there's a second paragraph which does the same thing.
I've tried as many regex techniques as I could think of in TextMate, and can't find any macros or commands to re-wrap the text... The text in question is a result of content editors on one of my sites pasting from Word... I think they may even type this way (holdover from typewriter days!).
Based on your comment, there's probably something you can do with lookaheads. I tried it, but it didn't work (perhaps didn't try enough). So you can try to do this with a series of commands.
First replace any series of spaces with just a single space character:
:%s/ \+/ /g
Then replace all newlines with a space:
:%s/\n/ /g
Then replaces all double spaces with double newlines:
:%s/ /^M^M/g
The ^M can be obtained in vim by doing CTRL+V CTRL+M.
Or, you could even do:
:%s/ /\r\r/g
This is a little ghetto, but it should work :)
This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.