N-Insert In Sed - regex

How would I replace the first 150 characters of every line with spaces using sed. I don't want to bring out the big guns (Python, Perl, etc) and it seems like sed is pretty powerful in itself (some people have written the dc calculator in it for example). A related problem is inserting 150 spaces in front of every line.
s/^.\{50\}/?????/
That matches the first 50 characters but then what? I can't do output
s/^.\{50\}/ \{50\}/
One could use readline to simulate 150 presses (alt+150, ) but that's lame.

Use the hold space.
/^.\{150\}/ {;# simply echo lines shorter than 150 chars
h;# make a backup
s/^.\{150\}\(.*\)/\1/;# strip 150 characters
x;# swap backup in and remainder out
s/^\(.\{150\}\).*/\1/;# only keep 150 characters
s/./ /g;# replace every character with a space
G;# append newline + remainder to spaces
s/\n//;# strip pesky newline
};
Alternately, you could code a loop (somewhat shorter code):
s/^.\{150\}/x/;# strip 150 characters & mark beginning of line
t l
# if first substitution matched, then loop
d; # else abort and go to next input line
:l
# begin loop
s/^/ /;# insert a space
/^ \{150\}x/!t l
# if line doesn't begin with 150 spaces, loop
s/x//;# strip beginning marker
However, I don't think you can use a loop without sed -f, or finding some other way to escape newlines. Label names seem to run to the end of the line, right through a ;.

Basically you want to do this:
sed -r "s/^.{150}/$(giveme150spaces)/"
The question is just what's the most elegant way to programatically get 150 spaces. A variety of techniques are discussed here: Bash Hacker's Wiki. The simplest is to to use printf '%150s', like this:
sed -r "s/^.{150}/$(printf '%150s')/"

The logic is to iterate the file, print 150 spaces, and then print the rest of the line from 151 till the end using "substringing". eg for first 10 chars.
$ more file
1234567890abc
0987654321def
$ awk '{printf "%10s%s\n", " ",substr($0,11)}' file
abc
def
this is much simpler than crafting regex.
Bash:
while read -r line
do
printf "%10s%s\n" " " "${line:10}"
done <"file"

This might work for you:
sed ':a;/^ \{150\}/!{s/[^ ]/ /;ta}' file
This will replace the first 150 characters of a line with spaces. However if the line is shorter than 150 it replace it all with spaces but retain the original length.
This solution replaces the first 150 characters with spaces or increases the line to 150 spaces long:
sed ':a;/^ \{150\}/!{s/[^ ]/ /;ta;s/^/ /;ba}' file
And this replaces the first 150 characters of a line with spaces but leaves lines shorter untouched:
sed '/.\{150\}/!b:a;/^ \{150\}/!{s/[^ ]/ /;ta}' file

Related

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>´ and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

Line-insensitive pattern-matching – How can some context be displayed?

I'm looking for a technique to search a file for a pattern (typically a phrase) that may span multiple lines, and print the match with some surrounding context on one line. The file's lines may be too long or too short for a sensible amount of context; I'm not concerned to print a single line of the file, as you might do with grep, but rather to print onto a single line of my terminal.
Basic requirements
Show a specified number of characters before and after the match, even if it straddles lines.
Show newlines as ‘\n’ to prevent flooding the terminal with whitespace if there are many short lines.
Prefix output line with line and column number of the start of the match.
Preferably a sed oneliner.
So far, I'm assuming that the pattern has a constant length shorter than the width of the terminal, which is okay and very useful for most phrases I might want to search for.
Further considerations
I would be interested to see how the following could also be achieved using sed or the likes:
Prefix output line with line and column number range of the match.
Generalise for variable length patterns, truncating the middle of the match to ‘[…]’ if too long.
Can I avoid using something like ‘[ \n]’ between words in a phrase regex on a file that has been ‘hard-wrapped’ using newlines, without altering what's printed?
Using the output of stty size to dynamically determine the terminal width may be useful, though I'd probably prefer to leave it static in case I want to resize the terminal or use it from screen attached from terminals of different sizes.
Examples
The basic idea for 10 characters of context would be something like:
‘excessively long line with match in the middle\n’ → ‘line with match in the mi’
‘short\nlines\n\nmatch\nlots\nof\nshort\nlines\n’ → ‘rt\nlines\n\nmatch\nlots\nof\ns’
Here's a command to return the 20 characters surrounding a pattern, spanning newlines and including them as a character:
$ input="test.txt"
$ pattern="match"
$ tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g'
line with match in the mi
rt\nlines\n\nmatch\nlots\nof\ns
With row number of the match as well:
$ paste <(grep -n ${pattern} "$input" | cut -d: -f1) \
<(tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g')
1 line with match in the mi
5 rt\nlines\n\nmatch\nlots\nof\ns
I realise this doesn't quite fulfill all of your basic requirements, but am not good enough with awk to do better (guess this is technically possible in sed, but I don't want to think about what it would look like).

What does this sed expression from todo.sh do?

What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P do? Exactly what does it match and how does it match it?
It's from todo.sh. In context:
archive()
{
#defragment blank lines
sed -i.bak -e '/./!d' "$TODO_FILE" ## delete all empty lines
[ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE" ## if verbose mode print completed tasks..
grep "^x " "$TODO_FILE" >> "$DONE_FILE" ## append completed tasks to $DONE_FILE
sed -i.bak '/^x /d' "$TODO_FILE" ## delete completed tasks
cp "$TODO_FILE" "$TMP_FILE"
sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"
## G; Add a newline
## s/\n/&&/; Substitute newline with && (two newlines?)
## /^\([ ~-]*\n\).*\n\1/d; Delete duplicate lines???
## s/\n// Remove newlines
## h Hold: copy pattern space to buffer
## P Print first line of pattern space
if [ $TODOTXT_VERBOSE -gt 0 ]; then
echo "TODO: $TODO_FILE archived."
fi
}
Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h command at the end of each input cycle.
Then s/\n/&&/ duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/ indeed matches if the current line is identical to a line in the hold space:
    ^\([ -~]*\n\) matches a line at the beginning of the buffer¹
        Note that this matches only if the line contains only printable ASCII characters.
        If your system supports locales, ^\([[:print:]]*\n\) would be better.
    .*\n matches at least one subsequent line
    \1 matches a line identical to the first line
The extra newline added by the previous s command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1 is to “anchor” the duplicate at the beginning of a line, otherwise bar would be considered a duplicate of foobar. If the current line is a duplicate, the d command discards it and execution branches to the next line.
If the current line is not a duplicate, s/\n// discards that extra newline (again, no g modifier, so only the first newline is removed). Then the h command results in the hold space containing what it contained before, with the current line prepended. Finally P prints the current input line.
Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.
¹ Uh, I don't know how you did that, but that should be [ -~], not [ ~-] which wouldn't make any sense.
Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).
<"$TMP_FILE" \
nl -s: | # add line numbers
sort -t: -k2 -u | # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n | # sort by line number
cut -d: -f2- # cut out the line numbers
Oh, you wanted to do this legibly and concisely? Just use awk.
<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'
If the current line hasn't been seen yet, mark it as seen, and print it.
Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort has the advantage that only sort needs to keep more than one line of input at a time, and it's designed for this.
Of course, if you don't care about the order of the lines, it's as simple as sort -u.
After Gilles presented his excellent answer I found Famous Sed One-Liners Explained, which includes this exact sed expression; adding here for reference:
70. Delete duplicate, nonconsecutive lines from a file.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
This is a very tricky one-liner. It
stores the unique lines in hold buffer
and at each newly read line, tests if
the new line already is in the hold
buffer. If it is, then the new line is
purged. If it's not, then it's saved
in hold buffer for future tests and
printed.
A more detailed description - at each
line this one-liner appends the
contents of hold buffer to pattern
space with "G" command. The appended
string gets separated from the
existing contents of pattern space by
"\n" character. Next, a substitution
is made to that substitutes the "\n"
character with two "\n\n". The
substitute command "s/\n/&&/" does
that. The "&" means the matched
string. As the matched string was
"\n", then "&&" is two copies of it
"\n\n". Next, a test "/^([
-~]\n).\n\1/" is done to see if the contents of group capture group 1 is
repeated. The capture group 1 is all
the characters from space " " to "~"
(which include all printable chars).
The "[ -~]" matches that. Replacing
one "\n" with two was the key idea
here. As "([ -~]\n)" is greedy
(matches as much as possible), the
double newline makes sure that it
matches as little text as possible. If
the test is successful, the current
input line was already seen and "d"
purges the whole pattern space and
starts script execution from the
beginning. If the test was not
successful, the doubled "\n\n" gets
replaced with a single "\n" by
"s/\n//" command. Then "h" copies the
whole string to hold buffer, and "P"
prints the new line.

Replace CR/LF in a text file only after a certain column

I have a large text file I would like to put on my ebook-reader, but the formatting becomes all wrong because all lines are hard wrapped at or before column 80 with CR/LF, and paragraphs/headers are not marked differently, only a single CR/LF there too.
What I would like is to replace all CR/LF's after column 75 with a space. That would make most paragraphs continuous. (Not a perfect solution, but a lot better to read.)
Is it possible to do this with a regex? Preferably a (linux) perl or sed oneliner, alternatively a Notepad++ regex.
perl -p -e 's/\s+$//; $_ .= length() <= 75 ? qq{\n} : q{ }' book.txt
Perl's -p option means: for each input line, process and print. The processing code is supplied with the -e option. In this case: remove trailing whitespace and then attach either a newline or a space, depending on line length.
Not really answering your question, but you can achieve this result in vim using this global join command. The v expands tabs into whitespace when determining line length, a feature that might be useful depending on your source text.
:g/\%>74v$\n/j
This seems to get pretty close:
sed '/^$/! {:a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta}' ebook.txt
It doesn't get the last line of a paragraph if it's shorter than 75 characters.
Edit:
This version should do it all:
sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g' ebook.txt
Edit 2:
If you want to re-wrap at word/sentence boundaries at a different width (here 65, but choose any value) to prevent words from being broken at the margin (or long lines from being truncated):
sed 's/^.\{0,74\}$/&\n/' ebook.txt | fmt -w 65 | sed '/^$;s/\n//}'
To change from DOS to Unix line endings, just add dos2unix to the beginning of any of the pipes above:
dos2unix < ebook.txt | sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g'
The less fancy option would be to replace the cr/lf's that apperar by themselves on a line with a single lf or cr, then remove all the cr/lf's remaining. No need for fancy/complicated stuff.
regex 1:
^\r\n$
finds lone cr/lf's. It is then trivial to replace the remaining ones. See this question for help finding cr/lf's in np++.