Bash: Change "Title" of postscript file

Bash: Change "Title" of postscript file - regex

I have a postscript file where I'd like to change the "Title" attribute before generating a pdf from it.
Following the beginning of the file:
%!PS-Adobe-3.0
%%BoundingBox: 0 0 595 842
%%HiResBoundingBox: 0 0 595 842
%%Title: GMT v5.1.1_r12693 [64-bit] Document from pscoast
%%Creator: GMT5
[…]
I now match the line %%Title: GMT v5.1.1_r12693 [64-bit] Document from pscoast with ^%%Title:\s.* and like to replace everything after the colon with the content of a variable.
My non-working code so far:
sed "s/\(^%%Title:\)\s.*$/\1 $title/g" test_file.ps
My sed knowledge is very limited and my experimentation didn't yield anything useful so far - your help will be greatly appreciated.
All the best, Chris
EDIT: added my non-working code

One of the tricks for getting sed to work correctly is getting the shell quoting right. This creates a postscript file with the new title:
newtitle="Shiny New Title"
sed 's/^%%Title:.*/%%Title: '"$newtitle/" sample.ps >new.ps
This updates the postscript in place:
newtitle="Shiny New Title"
sed -i 's/^%%Title:.*/%%Title: '"$newtitle/" sample.ps
Many of the characters that one uses in sed expressions, like $, (, or *, are shell-active. To protect them from possible shell expansion, they should be in single-quotes. However, because one wants the shell to expand the $newtitle variable, it cannot be in single-quotes. Thus, if you look carefully, you will see that the above substitute expression is in two parts, one single-quoted and one double-quoted. Adding a space between them to make it clearer:
's/^%%Title:.*/%%Title: ' "$newtitle/" # Do not use this form.
Thus, the shell-active characters are protected by single-quotes and only the parts that we want the shell to mess with are in double-quotes

Maybe this is what you're looking for:
myvar="some content"
sed -e "s/^\(%%Title:\).*/\1 $myvar/" < inputfile
# output
...
%%Title: some content
...

Related

Insert newline before/after match for TSV

I'm going grey trying to figure out how to accomplish some regex matching to insert new lines. Example input/output below...
Example TSV Data:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001 some:other:tag:with-colons-and-hypens=MACHINE NAME Name=NAMETAG backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01 backup=true Name=SOMENAME"
Desired Output:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hyphens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
I can guarantee each key=value within those quotes are separated by hard/literal tabs, although it may not appear that way with how the StackOverflow code block is displayed in HTML they did carry over into the code block editor, the data under the column Tags is in quotes so that even though they are tab separated they stay within the Tags column. For whatever reason I'm not able to successfully get the desired results.
In my measly attempts, I've been basically capturing everything between the "" as if tabs aren't separated in my regex searches because of my use of wildcards [TAB].*=.*[TAB] is obviously not working because then I'm losing everything in between the first/last occurrence for each line. I've attempted storing them in capture groups without any success.
I'm looking for a unix toolset solution (sed, awk, perl and the like). Any/All help is appreciated!

This will work using any awk in any shell on any UNIX box:
$ awk 'match($0,/".*"/){str=substr($0,RSTART,RLENGTH); gsub(/\t/,"\n",str); $0=substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)} 1' file
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hypens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
It just extracts a string between "s from the current record, replaces all tabs with newlines within that string, then puts the record back together before it's printed.

You can try this sed (GNU sed) 4.4
sed -E ':A;s/(".*)\t(.*")/\1\n\2/;tA' TSV_Data_File
With OSX sed, you can try this one.
I think the \t is ok.
sed -E '
:A
s/(".*)\t(.*")/\1\
\2/
tA
' TSV_Data_File
brief explain :
Catch the text inside "
Substitute the last \t by \n
If a substitution occur jump to A else continue
With awk :
awk -v RS='"' 'NR%2==0{gsub("\t","\n")}1' ORS='"' TSV_Data_File

This is basically ctac_'s awk answer converted to perl:
perl -pe'1 while s/(".*)\t(.*")/$1\n$2/s' file.tsv
Where the \t might be replaced by \t\s* if you want just one newline out of each tab-and-then-some.

This might work for you (GNU sed):
sed 's/\S\+=\S\+/\n&/2g' file
Insert a newline in before the second or more non-empty strings containing an =.

Finding strings across lines and replace with nothing

I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.

Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.

This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta

Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.

Using sed to replace one line (that might change) with another

I want to run a script that changes a line in the HTML code, indicating when the page was last updated. So for instance, I have the line
<d>This page was last updated on 29.04.2013 at 00:34 UTC</d>
and I am updating it now, so I want to replace that line with
<d>This page was last updated on 15.05.2013 at 15:50 UTC</d>
This is the only line in my source code that has the <d> tag, so hopefully that helps. I already have some code that generates the new string with the current date and time, but I can't figure out a way to replace the old one (which changes, so I don't know exactly what it is).
I've tried putting in a comment <!--date--> in the previous line, deleting the whole line that has <d> (with grep), and then putting in a new line after the comment that is the new string, but that fails. For example, if I want to just insert the string text after the comment, and use
sed -i 's/<!--date-->/<!--date-->text/' file.html
I get invalid command code j. I think it might be because there are some special characters like <,!, and > in the strings, but if I want to put in the date string above, I will have even more, like : and /. Thanks for any ideas on how to fix this.

This will change the text only on lines that contain <d>:
sed -i.bak "/<d>/s/on .* at [^<]*/on newdate at newtime/" file.html
I've tested this with the BSD sed that ships with MacOS X 10.8.3

You don't need your <!--date--> hack. You can use regular expressions and another delimiter besides "/" in your sed command:
sed -i.bak 's#<d>This page was last updated on.*</d>#<d>This page was last updated on 12.05.2013 at 00:38 UTC</d>#' whatever.html
Or, if you have your update in a variable called $replacement:
sed -i.bak "s#<d>This page was last updated on.*</d>#$replacement#" whatever.html

When using the command line, try escaping special characters like this:
! ===> \!

egrep regular expression works within PHP, but doesn't work at unix shell - escaping issues?

I think my problem has something to do with escaping differences between using a regex within PHP versus using it at Bash commandline.
Here is my regex that is working in PHP:
$emailregex = '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$';
So I try giving the following at commandline and it doesn't seem to match anything.
(where emails.txt is a long plain text file with thousands of (possibly badly-formed) email addresses, one per line).
[root#host dir]# egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
I have tried surrounding the regex with double-quotemarks instead of single-quotemarks, but it made no difference.
Do I need to add some backslashes into the regex?
SOLVED! Thank you!
My file was created in Windows and extra CR in the END-OF-LINE markers did not agree with the dollar sign in the regex.

Single quotes should work with bash...
It works for me with this simple case:
echo test#test.com | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
In your text file, the line has to only contain the email address. Any additional spaces on the line will throw it off. For example this doesn't print anything:
echo " test#test.com" | egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$'
Your problem might be that you have a dos formatted file. In that case the extra \r will make it so that the regex doesn't match since it will think there's an extra character at the end of the line. You can run dos2unix against it, or make your regex less restrictive by removing the beginning and end markers from your regex:
egrep '[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})'

WWorks for me:
JPP-MacBookPro-4:tmp jpp$ cat emails.txt
aa#bb.com
bb#cc.com
not an email
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$ egrep '^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,6})$' emails.txt
aa#bb.com
bb#cc.com
cc#dd.ee.ff
JPP-MacBookPro-4:tmp jpp$
Beware trailing whitespace/tabs/and returns - they have a way of biting regexs
There is a great ref on shell quoting here http://www.mpi-inf.mpg.de/~uwe/lehre/unixffb/quoting-guide.html

How to remove nonnumeric junk from a file

Here's an output from less:
487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491
I see a bunch of nonprintable characters here. How do I remove them using sed/tr?
My try was 's/\([0-9][0-9]*\)/\1/g', but it doesn't work.
EDIT: Okay, let's go further down the source. The numbers are extracted from this file:
487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/\([0-9][0-9]*\).*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".
EDIT II: Here's a clarification: There are no brackets in the text file. These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ?? to represent such characters. I bet xterm on my Ubuntu would print that white oval with a question mark.

Classic job for either sed's or Unix's tr command.
sed 's/[^0-9]//g' $file
(Anything that is not a digit - or newline - is deleted.)
tr -cd '0-9\012' < $file > $file.1
Delete (-d) the complement (-c) of the digits and newline...

You missed the bit where you match the rest of the line.
sed 's/\([0-9][0-9]*\)[^0-9]*/\1/g'
^^^^^^^

Try this sed command:
sed 's/^\([0-9][0-9]*\).*$/\1/' file.txt
OUTPUT (running above command on the input file you provided)
487451
487450
487449
487448
487447
487446
487445
484300
484299
484297
484296
484295
484294
484293
483496
483495
483494
483493
483492
483491

If you know the crap will always be inside brackets, why not delete that crap?
sed 's/<[^>]*>//g'
EDIT: Thanks, Mike that makes sense. In that case, how about:
sed 's/([0-9]+).*/\1/g'

If the data always is like the sample, deleting from the less-than to the end of the line would work fine.
sed -i "s/<.*$//" file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash: Change "Title" of postscript file - regex

Maybe this is what you're looking for: myvar="some content" sed -e "s/^\(%%Title:\).*/\1 $myvar/" < inputfile # output ... %%Title: some content ...

Related

Insert newline before/after match for TSV

Finding strings across lines and replace with nothing

Using sed to replace one line (that might change) with another

egrep regular expression works within PHP, but doesn't work at unix shell - escaping issues?

How to remove nonnumeric junk from a file

Categories

Resources