Sed on Mac not recognizing regular expressions - regex

In terminal, I am attempting to clean up some .txt files so they can be imported into another program. Only literal search/replaces seem to be working. I cannot get regular expression searches to work.
If I attempt a search and replace with a literal string, it works:
find . -type f -name '*.txt' -exec sed -i '' s/Title Page// {} +;
(remove the words "Title Page" from every text file)
But if I am attempting even the most basic of regular expressions, it does not work:
find . -type f -name '*.txt' -exec sed -i '' s/\n\nDOWN/\\n<DOWN\>/ {} +;
(In every text file, reformat any word "DOWN" that follows double return: remove extra newline and put word in brackets: "\n")
This does not work. The only thing at all "regular expression" about this is looking for the newline.
I must be doing something incorrectly.
Any help is much appreciated.
Update: part 2
John1024's answer helped me out a lot for one aspect.
find . -type f -name '*.txt' -exec sed -i '' '/^$/{N; s/\n[0-9]+/\n/;}' {} +;
Now I am having trouble getting other types of regular expressions to respond properly. The example above, I wish to remove all numbers that appear at the beginning of a line.
Argh! What am I missing?

By default, sed handles only one line at a time. When a line is read into sed's pattern space the newline character is removed.
I see that you want to look for an empty line followed by DOWN and, when found, remove the empty and change the text to <DOWN>. That can be done. Consider this as the test file:
$ cat file
some
thing
DOWN
DOWN
other
Try:
$ sed '/^$/{N; s/\nDOWN/<DOWN>/;}' file
some
thing
DOWN
<DOWN>
other
How it works
/^$/
This looks for empty lines. The commands in braces which follow are executed only on empty lines.
{N; s/\nDOWN/<DOWN>/;}
The N command reads the next line into the pattern space, separated from the current line by a newline character.
If the pattern space matches an empty line followed by DOWN, the substitution command, s/\nDOWN/<DOWN>/, removes the newline and replaces the DOWN with <DOWN>.
Special Case: DOS/Windows Files
If a file has DOS/Windows line endings, \r\n, sed will only remove the \n when the line is read in. The \r will remain. When dealing with these files, the presence of that character, if unanticipated, may lead to surprising results.

Related

I need to use sed to comment out two lines in a text file

I am running a custom kernel build and have created a custom config file in a bash script, now I need to comment out two lines in Kbuild in order to prevent the bc compiler from running. The lines are...
$(obj)/$(timeconst-file): kernel/time/timeconst.bc FORCE
$(call filechk,gentimeconst)
Using Expresso, I have a regex that matches the first line...
^\$\(obj\)\/\$\(timeconst-file\): kernel\/time\/timeconst\.bc FORCE
Regex Match
But can't get sed to actually insert a # in front of the line.
Any help would be much appreciated.
sed -i "/<Something that matches the lines to be replaced>/s/^#*/#/g"
This uses a regex to select lines you want to comment/<something>/, then substitutes /s/ the start of the string ^(plus any #*s already there, with #. So you can comment lines that are already commented no problem. the /g means continue after you found your first match, so you can do mass commenting.
I have a bash script that I can mass comment using the above as:
sed -i.bkp "/$1/s/^#\+\s*//g" $2
i.bkp makes a backup of the file named .bkp
Script is called ./comment.sh <match> <filename>
The match does not have to match the entire line, just enough to make it only hit lines you want.
You can use following sed for replacement:
sed 's,^\($(obj)/$(timeconst-file): kernel/time/timeconst.bc FORCE\),#\1,'
You don't need to escape ( ) or $, as in sed without -r it is treated as literal, for grouping \( \) is used.

Find and replace every second occurence with sed

Hi all I have the code below
find . -type f -exec sed -i 's#EText-No.#New EText-No. #g' {} +
I have been using the script to find and replace some characters in multiple files in folders and subfolders.
I have discovered that some values occurs more than twice. Hence I need to modify my script to replace only the second instance of an attribute
find . -type f -exec sed -i '/Subject/{:a;N;/Subject.*Subject/!Ta;s/Subject/SecondSubject/2;}/g' {} +
I am trying to use the code above to achive this .. but it seems not to be working. I need to modify the code to work with "#" as a seperatore like the above code. because I have backlash characters in my file.
Any Idea how I might make the code to work and using the sperator #?
Thanks for your help
ORIGINAL FILE BEFORE PROCESSING
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">Subject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
After processing
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">SecondSubject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
Please note that the second Subject is changed from 'Subject' to 'SecondSubject'
Try this:
sed -i '/Subject/{:a;s/\(Subject.*\)Subject/\1SecondSubject/;tb;N;ba;:b}'
If a line appended to the pattern space (with the N command) contains more than one occurrence of the word "Subject", then you can use this command to only target the first occurrence of the appended line (the second occurrence of the pattern space):
sed -i '/Subject/{:a;/Subject.*Subject/!{N;ba;};s/Subject/newSubject/2;}'

Parse file for specific word in line

In my directory there are several files with the pattern
simulation_y_t
for all files with this pattern I would need to check whether in the last line of the file the word hgip comesup or not ...the word might not be separated by spaces from the surrounding characters but if it comes up it will come up within the last 20 characters of the line...
the last line of the file might look something like this (if it shall be removed)
((((1560:0.0129775,(1565:0.00473242,1447:0.00473242):0.00824512):0.0133245,((((1421:0.00357462,(1496:0.00352733,1472:0.00352733):4.72931e-05):0.00597691,1505:0.00955153):0.0104055,((((1465:0.00716479,(1527:0.00380709,1556:0.00380709):0.0033577):0.000984333,(1555:0.00381533,((1423:0.00169525,1411:0.00169525):0.00168847,1587:0.00338372):0.00043161):0.00433379):0.00159571,((1546:0.000908968,1584:0.000908968):0.00775293,(1492:0.00374859,1489:0.00374859):0.00491332):0.00108293):0.00962105,1594:0.0193659):0.000591157):0.00510731,(1442:0.0198716,(1525:0.00416688,(1550:0.00378343,1544:0.00378343):0.000383449):0.0157047):0.00519277):0.00123765):0.000318786,(1538:0.00713072,1530:0.00713072):0.0194901):0.000325926,((1483:0.00663734,1484:0.00663734):0.00471454,(1518:0.00352348,(1433:0.000365709,1450:0.000365709):0.00315777):0.0078284):0.0155948):0.00081517,1561:0.0277619):0.00127735):0.00271069hgip: 77113
note that the numbers and way the brackets are coudl be diffferent in every of the files it is really about whether these 4 characters appear in a row in that line ... if that do the line shall be removed from the file
how would i be able to do that?
This should be easy
sed '/hgip/d' YourFile
This will delete all lines where 'hgip' is inside
For checking only last line
sed '${/hgip/d}' YourFile
This will delete only last line if 'hgip' is inside it
Use find to search for the files and then use the -exec option to delete the last line if it contains hgip
find . -type f -name '*simulation_y_t*' -exec sed -i '${/hgip/d}' {} \;

unix sed command regular expression

Can anyone explain me how the regular expression works in the sed substitute command.
$ cat path.txt
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/sbin:/sbin:/bin/:/usr/sbin:/usr/bin:/opt/omni/bin:
/opt/omni/lbin:/opt/omni/sbin:/root/bin
$ sed 's/\(\/[^:]*\).**/\1/g' path.txt
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
From the above sed command they used back reference and save operator concept.
Can anyone explain me how the regular expression especially /[^:]* work in the substitute command to get only the first path in each line.
I think you wrote an extra asterisk * in your sed code, so it should be like this:
$ sed 's/\(\/[^:]*\).*/\1/g' file
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
To change the delimiter will help to understand it a little bit better:
sed 's#\(/[^:]*\).*#\1#g'
The s#something#otherthing#g is a basic sed command that looks for something and changes it for otherthing all over the file.
If you do s#(something)#\1#g then you "save" that something and then you can print it back with \1.
Hence, what it is doing is to get a pattern like /[^:]* and then print is back. /[^:]* means / and then every char except :. So it will get / + all the string until it finds a semicolon :. It will store that piece of the string and then print it back.
Small examples:
# get every char
$ echo "hello123bye" | sed 's#\([a-z]*\).*#\1#g'
hello
# get everything until it finds the number 3
$ echo "hello123bye" | sed 's#\([^3]*\).*#\1#g'
hello12
[^:]*
in regex would match all characters except for :, so it would match until this:
/usr/kbos/bin
also it would match these,
/usr/local/bin
/usr/jbin
/usr/bin
/usr/sas/bin
As, these all contains characters, that are not :
.* match any character, zero or more times.
Thus, this regex [^:]*.*, would match all this expressions:
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/jbin:/usr/bin:/usr/sas/bin
/usr/bin:/usr/sas/bin
However, you get only the first field (ie,/usr/kbos/bin, by using back reference in sed), because, regular expression output the longest possible match found.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens