Find and replace every second occurence with sed - regex

Hi all I have the code below
find . -type f -exec sed -i 's#EText-No.#New EText-No. #g' {} +
I have been using the script to find and replace some characters in multiple files in folders and subfolders.
I have discovered that some values occurs more than twice. Hence I need to modify my script to replace only the second instance of an attribute
find . -type f -exec sed -i '/Subject/{:a;N;/Subject.*Subject/!Ta;s/Subject/SecondSubject/2;}/g' {} +
I am trying to use the code above to achive this .. but it seems not to be working. I need to modify the code to work with "#" as a seperatore like the above code. because I have backlash characters in my file.
Any Idea how I might make the code to work and using the sperator #?
Thanks for your help
ORIGINAL FILE BEFORE PROCESSING
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">Subject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
After processing
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">SecondSubject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
Please note that the second Subject is changed from 'Subject' to 'SecondSubject'

Try this:
sed -i '/Subject/{:a;s/\(Subject.*\)Subject/\1SecondSubject/;tb;N;ba;:b}'
If a line appended to the pattern space (with the N command) contains more than one occurrence of the word "Subject", then you can use this command to only target the first occurrence of the appended line (the second occurrence of the pattern space):
sed -i '/Subject/{:a;/Subject.*Subject/!{N;ba;};s/Subject/newSubject/2;}'

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Say I have the following files:
a_b.txt a_b_c.txt a_b_c_d_e.txt a_b_c_d_e_f_g_h_i.txt
I want to rename them in such a way that I split their filenames by _ and I retain the first and last field, so I end up with:
a_b.txt a_c.txt a_e.txt a_i.txt
Thought it would be easy, but I'm a bit stuck...
I tried rename with the following regexp:
rename 's/^([^_]*).*([^_]*[.]txt)/$1_$2/' *.txt
But what I would really need to do is to actually split the filename, so I thought of awk, but I'm not so proficient with it... This is what I have so far (I know at some point I should specify FS="_" and grab the first and last field somehow...
find . -name "*.txt" | awk -v mvcmd='mv "%s" "%s"\n' '{old=$0; <<split by _ here somehow and retain first and last fields>>; printf mvcmd,old,$0}'
Any help? I don't have a preferred method, but it would be nice to use this to learn awk. Thanks!
Your rename attempt was close; you just need to make sure the final group is greedy.
rename 's/^([^_]*).*_([^_]*[.]txt)$/$1_$2/' *_*_*.txt
I added a _ before the last opening parenthesis (this is the crucial fix), and a $ anchor at the end, and also extended the wildcard so that you don't process any files which don't contain at least two underscores.
The equivalent in Awk might look something like
find . -name "*_*_*.txt" |
awk -F _ '{ system("mv " $0 " " $1 "_" $(NF)) }'
This is somewhat brittle because of the system call; you might need to rethink your approach if your file names could contain whitespace or other shell metacharacters. You could add quoting to partially fix that, but then the command will fail if the file name contains literal quotes. You could fix that, too, but then this will be a little too complex for my taste.
Here's a less brittle approach which should cope with completely arbitrary file names, even ones with newlines in them:
find . -name "*_*_*.txt" -exec sh -c 'for f; do
mv "$f" "${f%%_*}_${f##*_}"
done' _ {} +
find will supply a leading path before each file name, so we don't need mv -- here (there will never be a file name which starts with a dash).
The parameter expansion ${f##pattern} produces the value of the variable f with the longest available match on pattern trimmed off from the beginning; ${f%%pattern} does the same, but trims from the end of the string.
With your shown samples, please try following pure bash code(with great use parameter expansion capability of BASH). This will catch all files with name/format .txt in their name. Then it will NOT pick files like: a_b.txt it will only pick files which have more than 1 underscore in their name as per requirement.
for file in *_*_*.txt
do
firstPart="${file%%_*}"
secondPart="${file##*_}"
newName="${firstPart}_${secondPart}"
mv -- "$file" "$newName"
done
This answer works for your example, but #tripleee's "find" approach is more robust.
for f in a_*.txt; do mv "$f" "${f%%_*}_${f##*_}"; done
Details: https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html / https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Here's an alternate regexp for the given samples:
$ rename -n 's/_.*_/_/' *.txt
rename(a_b_c_d_e_f_g_h_i.txt, a_i.txt)
rename(a_b_c_d_e.txt, a_e.txt)
rename(a_b_c.txt, a_c.txt)
A different rename regex
rename 's/(\S_)[a-z_]*(\S\.txt)/$1$2/'
Using the same regex with sed or using awk within a loop.
for a in a_*; do
name=$(echo $a | awk -F_ '{print $1, $NF}'); #Or
#name=$(echo $a | sed -E 's/(\S_)[a-z_]*(\S\.txt)/\1\2/g');
mv "$a" "$name";
done

Regex - how to prevent or work around interference between files when searching through them?

So, I am using a regular expression to search through a bunch of files from a corpus. The point is to find the titles of newspaper articles.
This is what I use:
cat *.txt | grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
It finds lines that begin with a capital, followed by any character, but not ending with a dot and that works for these specific files.
The problem is that two files interfere with each other and the dot from the very end of one file shows up in the beginning of another and I get this:
Kõik Kataria jüngrid kinnitavad , et nende elu on pärast naeruklubiga liitumist oluliselt paranenud
.Kosmosepall teeb maailmareisi 39 kilomeetri kõrgusel.
Is there any way to prevent that interference without actually modifying the files or a way to change the regular expression, so that this dot at the beginning is excluded? I must say that I am a beginner, I tried to find solutions, but none of them were specific to my case.
The files probably does not have a newline at the end, so last line of the first file is merged with the first one in the second one.
You can try to append newline on the fly:
find *.txt | xargs -I{} sh -c "cat {}; echo ''" | grep ... grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
Source: https://stackoverflow.com/a/44675414/580346

Sed on Mac not recognizing regular expressions

In terminal, I am attempting to clean up some .txt files so they can be imported into another program. Only literal search/replaces seem to be working. I cannot get regular expression searches to work.
If I attempt a search and replace with a literal string, it works:
find . -type f -name '*.txt' -exec sed -i '' s/Title Page// {} +;
(remove the words "Title Page" from every text file)
But if I am attempting even the most basic of regular expressions, it does not work:
find . -type f -name '*.txt' -exec sed -i '' s/\n\nDOWN/\\n<DOWN\>/ {} +;
(In every text file, reformat any word "DOWN" that follows double return: remove extra newline and put word in brackets: "\n")
This does not work. The only thing at all "regular expression" about this is looking for the newline.
I must be doing something incorrectly.
Any help is much appreciated.
Update: part 2
John1024's answer helped me out a lot for one aspect.
find . -type f -name '*.txt' -exec sed -i '' '/^$/{N; s/\n[0-9]+/\n/;}' {} +;
Now I am having trouble getting other types of regular expressions to respond properly. The example above, I wish to remove all numbers that appear at the beginning of a line.
Argh! What am I missing?
By default, sed handles only one line at a time. When a line is read into sed's pattern space the newline character is removed.
I see that you want to look for an empty line followed by DOWN and, when found, remove the empty and change the text to <DOWN>. That can be done. Consider this as the test file:
$ cat file
some
thing
DOWN
DOWN
other
Try:
$ sed '/^$/{N; s/\nDOWN/<DOWN>/;}' file
some
thing
DOWN
<DOWN>
other
How it works
/^$/
This looks for empty lines. The commands in braces which follow are executed only on empty lines.
{N; s/\nDOWN/<DOWN>/;}
The N command reads the next line into the pattern space, separated from the current line by a newline character.
If the pattern space matches an empty line followed by DOWN, the substitution command, s/\nDOWN/<DOWN>/, removes the newline and replaces the DOWN with <DOWN>.
Special Case: DOS/Windows Files
If a file has DOS/Windows line endings, \r\n, sed will only remove the \n when the line is read in. The \r will remain. When dealing with these files, the presence of that character, if unanticipated, may lead to surprising results.

Parse file for specific word in line

In my directory there are several files with the pattern
simulation_y_t
for all files with this pattern I would need to check whether in the last line of the file the word hgip comesup or not ...the word might not be separated by spaces from the surrounding characters but if it comes up it will come up within the last 20 characters of the line...
the last line of the file might look something like this (if it shall be removed)
((((1560:0.0129775,(1565:0.00473242,1447:0.00473242):0.00824512):0.0133245,((((1421:0.00357462,(1496:0.00352733,1472:0.00352733):4.72931e-05):0.00597691,1505:0.00955153):0.0104055,((((1465:0.00716479,(1527:0.00380709,1556:0.00380709):0.0033577):0.000984333,(1555:0.00381533,((1423:0.00169525,1411:0.00169525):0.00168847,1587:0.00338372):0.00043161):0.00433379):0.00159571,((1546:0.000908968,1584:0.000908968):0.00775293,(1492:0.00374859,1489:0.00374859):0.00491332):0.00108293):0.00962105,1594:0.0193659):0.000591157):0.00510731,(1442:0.0198716,(1525:0.00416688,(1550:0.00378343,1544:0.00378343):0.000383449):0.0157047):0.00519277):0.00123765):0.000318786,(1538:0.00713072,1530:0.00713072):0.0194901):0.000325926,((1483:0.00663734,1484:0.00663734):0.00471454,(1518:0.00352348,(1433:0.000365709,1450:0.000365709):0.00315777):0.0078284):0.0155948):0.00081517,1561:0.0277619):0.00127735):0.00271069hgip: 77113
note that the numbers and way the brackets are coudl be diffferent in every of the files it is really about whether these 4 characters appear in a row in that line ... if that do the line shall be removed from the file
how would i be able to do that?
This should be easy
sed '/hgip/d' YourFile
This will delete all lines where 'hgip' is inside
For checking only last line
sed '${/hgip/d}' YourFile
This will delete only last line if 'hgip' is inside it
Use find to search for the files and then use the -exec option to delete the last line if it contains hgip
find . -type f -name '*simulation_y_t*' -exec sed -i '${/hgip/d}' {} \;

search and replace substring in string in bash

I have the following task:
I have to replace several links, but only the links which ends with .do
Important: the files have also other links within, but they should stay untouched.
<li>Einstellungen verwalten</li>
to
<li>Einstellungen verwalten</li>
So I have to search for links with .do, take the part before and remember it for example as $a , replace the whole link with
<s:url action=' '/>
and past $a between the quotes.
I thought about sed, but sed as I know does only search a whole string and replace it complete.
I also tried bash Parameter Expansions in combination with sed but got severel problems with the quotes and the variables.
cat ./src/main/webapp/include/stoBox2.jsp | grep -e '<a href=".*\.do">' | while read a;
do
b=${a#*href=\"};
c=${b%.do*};
sed -i 's/href=\"$a.do\"/href=\"<s:url action=\'$a\'/>\"/g' ./src/main/webapp/include/stoBox2.jsp;
done;
any ideas ?
Thanks a lot.
sed -i sed 's#href="\(.*\)\.do"#href="<s:url action='"'\1'"'/>"#g' ./src/main/webapp/include/stoBox2.jsp
Use patterns with parentheses to get the link without .do, and here single and double quotes separate the sed command with 3 parts (but in fact join with one command) to escape the quotes in your text.
's#href="\(.*\)\.do"#href="<s:url action='
"'\1'"
'/>"#g'
parameters -i is used for modify your file derectly. If you don't want to do this just remove it. and save results to a tmp file with > tmp.
Try this one:
sed -i "s%\(href=\"\)\([^\"]\+\)\.do%\1<s:url action='\2'/>%g" \
./src/main/webapp/include/stoBox2.jsp;
You can capture patterns with parenthesis (\(,\)) and use it in the replacement pattern.
Here I catch a string without any " but preceding .do (\([^\"]\+\)\.do), and insert it without the .do suffix (\2).
There is a / in the second pattern, so I used %s to delimit expressions instead of traditional /.