How to remove space (?) from empty line using perl? - regex

I want to remove the empty lines from this file so that there are only two \n between stanzas of the song. There appear to be spaces on lines 7, 8, and 20, but I'm guessing they aren't regular spaces because I haven't been able to remove them with substitutions that use \s.
The text is reproduced below (with the spaces marked by <-- HERE for clarity), but the Stack Overflow editor seems to have changed the special spaces into regular ones, so you'll have to look at the original file to duplicate my problem.
9a I Believe in a Hill Called Mount Calvary
1 There are things, as we travel this earth's shifting sands,
That transcend all the reason
But the things that matter the most in this world,
They can never be held in our hand
<-- HERE
<-- HERE
Chorus
I believe in a hill called mount Calvary,
I believe whatever the cost!
And when time has surrendered and earth is no more
I'll still cling to that old rugged cross
2 I believe that the Christ who was slain on the cross,
Has the power to change lives today;
For He changed me completely a new life is mine
That is why by the cross I will stay
<-- HERE
3 I believe that this life, with its great mysteries,
Surely someday will come to an end;
But faith will conquer the darkness and death
And will lead me at last to my Friend
I tried perl -pe 's/\n{3,}/\n\n/g' which didn't work as there was some space in the lines 7, 8 and 20.
I can't remove the space, no matter what I try. I tried the following commands:
perl -p0e 's/\s{3,}/\n\n/g'
perl -pe 's/^\s$//g'
perl -pe 's/^ $//g'
perl -pe 's/ $//g'
None of these work. I want to know why this is happening. Could there be a non-space character that acts as a blank?
What should I do to get rid of this?

What should I do to get rid of this?
If you suspect funny characters, look at the file with od -bc filename and look for unusual characters.
I have used your file, after removing the <-- HERE marks, and your first alternative perl -p0e 's/\s{3,}/\n\n/g' file works just fine. This is a strong indication (aka proof :-) that something like this is the cause.

As I observed, the spaces are just non-printable characters. Suggest you try the following:
perl -p0e 's/(?:[\x80-\xFF][\x0D\x0A]{2})+//g'

I found the solution thanks to Jens' suggestion to use od -bc filename.
The dump showed characters 302 240 in the place of the space in the lines 7, 8 and 20.
On searching for the details on the octal values, I got the following from here:
man iso_8859-1 identifies \240 as NO-BREAK SPACE
and \302 as LATIN CAPITAL LETTER A WITH CIRCUMFLEX
I found how to remove the characters from here.
I used to command perl -pi -e 's/[^[:ascii:]]//g' filename to correct this.
Thank you for all the tips given and effort put forth.

I think the following solution can solve your problem
open FH,"/home/httpd/cgi-bin/space.txt";
while(<FH>)
{
print if (!/^\s*$/) ;
}

Related

How to create regex search-and-replace with comments?

I have a bit of a strange problem: I have a code (it's LaTeX but that does not matter here) that contains long lines with period (sentences).
For better version control I wanted to split these sentences on a new line each.
This can be achieved via sed 's/\. /.\n/g'.
Now the problem arises if there are comments with potential periods as well.
These comments must not be altered, otherwise they will be parsed as LaTeX code and this might result in errors etc.
As a pseudo example you can use
Foo. Bar. Baz. % A. comment. with periods.
The result should be
Foo.
Bar.
Baz. % ...
Alternatively the comment might go on the next line without any problems.
It was ok to use perl if that would work out better. I tried with different programs (sed and perl) a few ideas but none did what I expected. Either the comment was also altered or only the first period was altered (perl -pe 's/^([^%]*?)\. /\1.\n/g').
Can you point me in the right direction?
This is tricky as you're essentially trying to match all occurrences of ". " that don't follow a "%". A negative look-behind would be useful here, but Perl doesn't support variable-width negative look-behind. (Though there are hideous ways of faking it in certain situations.) We can get by without it here using backtracking control verbs:
s/(?:%(*COMMIT)(*FAIL))|\.\K (?!%)/\n/g;
The (?:%(*COMMIT)(*FAIL)) forces replacement to stop the first time it sees a "%" by committing to a match and then unconditionally failing, which prevents back-tracking. The "real" match follows the alternation: \.\K (?!%) looks for a space that follows a period and isn't followed by a "%". The \K causes the period to not be included in the match so we don't have to include it in the replacement. We only match and replace the space.
Putting the comment by itself on a following line can be done with sed pretty easily, using the hold space:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G'
Or if you want the comment by itself before the rest:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/ *%.*//;s/\. /.\n/g;x;s/[^%]*%/%/;G'
Or finally, it is possible to combine the comment with the last line also:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G;s/\n\([^\n]*\)$/ \1/'

Vonnegutian Imperative: Replace semicolons with periods *correctly*

This question is a little tongue-in-cheek, but I'm also seriously curious as to the proper and cleanest way to do this.
Background: Kurt Vonnegut said that semicolons are worthless pieces of punctuation. All they do is show that you've been to college.
Anyways, I know that if I have a text (e.g., "test.txt") file with a bunch of worthless semicolons separating "closely related" sentences, I can do a find and replace with:
sed 's/;/./g' <test.txt >better.txt
Which will replace all those pesky semi-colons with periods. However, now I have problem that all the new periods are followed by words without the first letter capitalized (since one does not capitalize a word after a semicolon).
Is there a way (hopefully via sed) to automatically replace all the semicolons in a text file with periods AND also automatically capitalize the first letter of the words following the newly inserted periods?
Thanks,
hft
Short answer. Yes, absolutely.
Here is a way using GNU sed:
$ echo "hello;world;i;am;here" | sed 's/;\(.\)/.\U\1/g'
hello.World.I.Am.Here
For the standard prose case of semi-colons followed by a space use:
$ echo " Hello; world; blah" | sed 's/; *\(.\)/. \U\1/g'
Hello. World. Blah

Using Grep in TextWrangler to move a line up

I've got some sloppy text that needs to be cleaned up. Somehow random line breaks got inserted in the middle of paragraphs.
This is a paragraph
and it got broken into two lines.
The manual way to deal with this would be to
Place my cursor at the beginning of line 2
Hit delete to bring that line up to line 1
Hit space to separate the two words that get mashed together by doing this
Is there a way to accomplish this with Find and Replace? I know I can find the offending lines with ^[a-z] and checking "Case Sensitive", but that's as far as I can get.
I'm just starting to learn how powerful pattern matching can be and I've solved all the other cleanup issues, but this one still perplexes me.
Using awk in linux
cat file
This is a paragraph
and it got broken into two lines.
This line is fine and should be printed.
Here is another
that has been broken.
awk 'NR>1 {printf "%s"(substr($0,1,1)~/^[[:lower:]]$/?FS:RS),a} {a=$0} END {print a}' file
This is a paragraph and it got broken into two lines.
This line is fine and should be printed.
Here is another that has been broken.
If there is really nothing more to be taken care of, search for \n([a-z]) (with find's "Matching" both "Case sensitive" and "Grep" enabled) and replace with \1. (The searched for expression does not have a leading blank, whilst the replacement does actually have one.)

sed add text around regex

I would like to be able to go:
sed "s/^\(\w+\)$/leftside\1rightside/"
and have the group matched by (\w+\) appear in between 'leftside' and 'rightside'.
But it seems like I have to pipe it twice, one for the left of the text, another time for the right. If anyone knows a way to do it in one pass, I'd appreciate it.
The reason it's not working is that you probably specify the wrong regex. In your case, text will be added in the end and beginning of the line only if it consists only of word characters (given that your version of sed supports the \w notation). Also you didn't escape the + which you should do if not using the -r option.
Try starting with sed "s/^\(.*\)$/leftside\1rightside/" or just sed "s/.*/leftside&rightside/" and working from that.

Repeating a regex pattern

First, I don't know if this is actually possible but what I want to do is repeat a regex pattern.
The pattern I'm using is:
sed 's/[^-\t]*\t[^-\t]*\t\([^-\t]*\).*/\1/' films.txt
An input of
250. 7.9 Shutter Island (2010) 110,675
Will return:
Shutter Island (2010)
I'm matching all none tabs, (250.) then tab, then all none tabs (7.9) then tab. Next I backrefrence the film title then matching all remaining chars (110,675).
It works fine, but im learning regex and this looks ugly, the regex [^-\t]*\t is repeated just after itself, is there anyway to repeat this like you can a character like a{2,2}?
I've tried ([^-\t]*\t){2,2} (and variations) but I'm guessing that is trying to match [^-\t]*\t\t?
Also if there is any way to make my above code shorter and cleaner any help would be greatly appreciated.
This works for me:
sed 's/\([^\t]*\t\)\{2\}\([^\t]*\).*/\2/' films.txt
If your sed supports -r you can get rid of most of the escaping:
sed -r 's/([^\t]*\t){2}([^\t]*).*/\2/' films.txt
Change the first 2 to select different fields (0-3).
This will also work:
sed 's/[^\t]\+/\n&/3;s/.*\n//;s/\t.*//' films.txt
Change the 3 to select different fields (1-4).
To use repeating curly brackets and grouping brackets with sed properly, you may have to escape it with backslashes like
sed 's/\([^-\t]*\t\)\{3\}.*/\1/' films.txt
Yes, this command will work properly with your example.
If you feel annoyed to, you can choose to put -r option which enables regex extended mode and forget about backslash escapes on brackets.
sed -r 's/([^-\t]*\t){3}.*/\1/' films.txt
Found that this is almost the same as Dennis Williamson's answer, but I'm leaving it because it's shorter expression to do the same.
I think you might be going about this the wrong way. If you're simply wanting to extract the name of the film, and it's release year, then you could try this regex:
(?:\t)[\w ()]+(?:\t)
As seen in place here:
http://regexr.com?2sd3a
Note that it matches a tab character at the beginning and end of the actual desired string, but doesn't include them in the matching group.
You can repeat things by putting them in parenthesis, like this:
([^-\t]*\t){2,2}
And the full pattern to match the title would be this:
([^-\t]*\t){2,2}([^-\t]+).*
You said you tried it. I'm not sure what is different, but the above worked for me on your sample data.
why are you doing things the hard way??
$ awk '{$1=$2=$NF=""}1' file
Shutter Island (2010)
If this is a tab separated file with a regular format I'd use cut instead of sed
cut -d' ' -f3 films.txt
Note there's a single tab between the quotes after the -d which can be typed at the shell prompt by typing ctrl+v first, i.e. ctrl+v ctrl+i