Parse file for specific word in line - regex

In my directory there are several files with the pattern
simulation_y_t
for all files with this pattern I would need to check whether in the last line of the file the word hgip comesup or not ...the word might not be separated by spaces from the surrounding characters but if it comes up it will come up within the last 20 characters of the line...
the last line of the file might look something like this (if it shall be removed)
((((1560:0.0129775,(1565:0.00473242,1447:0.00473242):0.00824512):0.0133245,((((1421:0.00357462,(1496:0.00352733,1472:0.00352733):4.72931e-05):0.00597691,1505:0.00955153):0.0104055,((((1465:0.00716479,(1527:0.00380709,1556:0.00380709):0.0033577):0.000984333,(1555:0.00381533,((1423:0.00169525,1411:0.00169525):0.00168847,1587:0.00338372):0.00043161):0.00433379):0.00159571,((1546:0.000908968,1584:0.000908968):0.00775293,(1492:0.00374859,1489:0.00374859):0.00491332):0.00108293):0.00962105,1594:0.0193659):0.000591157):0.00510731,(1442:0.0198716,(1525:0.00416688,(1550:0.00378343,1544:0.00378343):0.000383449):0.0157047):0.00519277):0.00123765):0.000318786,(1538:0.00713072,1530:0.00713072):0.0194901):0.000325926,((1483:0.00663734,1484:0.00663734):0.00471454,(1518:0.00352348,(1433:0.000365709,1450:0.000365709):0.00315777):0.0078284):0.0155948):0.00081517,1561:0.0277619):0.00127735):0.00271069hgip: 77113
note that the numbers and way the brackets are coudl be diffferent in every of the files it is really about whether these 4 characters appear in a row in that line ... if that do the line shall be removed from the file
how would i be able to do that?

This should be easy
sed '/hgip/d' YourFile
This will delete all lines where 'hgip' is inside
For checking only last line
sed '${/hgip/d}' YourFile
This will delete only last line if 'hgip' is inside it

Use find to search for the files and then use the -exec option to delete the last line if it contains hgip
find . -type f -name '*simulation_y_t*' -exec sed -i '${/hgip/d}' {} \;

Related

Using sed, for every occurrence of a string, except the first, delete two lines

I have a bunch of text files that were generated from TN3270 screens that contain an annoying 2-line header every 24 lines. The first line of each header contains "X310A000", but I want to keep the first occurrence of the header (which is not on the first line).
I can delete all the headers with
sed '/X310A000/{N;d}' $file
but my attempt at printing everything up to the first occurrence and then deleting the rest of the headers is not working:
sed '1,/X310A000/p;/X310A000/,$ /X310A000/{N;d}' $file
sed: -e expression #1, char 28: unknown command: `/'
What can I do?
Add additional curly braces for the second range:
sed '1,/X310A000/p;/X310A000/,${/X310A000/{N;d}}' $file
If you want to give awk a chance then it is much easier:
awk 'index($0, "X310A000") { if (p) {getline; next} else p=1 } 1' file
This command toggles a flag p to 1 when it encounters pattern first time. Once flag is set it will skip line with the pattern and the next line from output.

Regex - how to prevent or work around interference between files when searching through them?

So, I am using a regular expression to search through a bunch of files from a corpus. The point is to find the titles of newspaper articles.
This is what I use:
cat *.txt | grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
It finds lines that begin with a capital, followed by any character, but not ending with a dot and that works for these specific files.
The problem is that two files interfere with each other and the dot from the very end of one file shows up in the beginning of another and I get this:
Kõik Kataria jüngrid kinnitavad , et nende elu on pärast naeruklubiga liitumist oluliselt paranenud
.Kosmosepall teeb maailmareisi 39 kilomeetri kõrgusel.
Is there any way to prevent that interference without actually modifying the files or a way to change the regular expression, so that this dot at the beginning is excluded? I must say that I am a beginner, I tried to find solutions, but none of them were specific to my case.
The files probably does not have a newline at the end, so last line of the first file is merged with the first one in the second one.
You can try to append newline on the fly:
find *.txt | xargs -I{} sh -c "cat {}; echo ''" | grep ... grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
Source: https://stackoverflow.com/a/44675414/580346

Find and replace every second occurence with sed

Hi all I have the code below
find . -type f -exec sed -i 's#EText-No.#New EText-No. #g' {} +
I have been using the script to find and replace some characters in multiple files in folders and subfolders.
I have discovered that some values occurs more than twice. Hence I need to modify my script to replace only the second instance of an attribute
find . -type f -exec sed -i '/Subject/{:a;N;/Subject.*Subject/!Ta;s/Subject/SecondSubject/2;}/g' {} +
I am trying to use the code above to achive this .. but it seems not to be working. I need to modify the code to work with "#" as a seperatore like the above code. because I have backlash characters in my file.
Any Idea how I might make the code to work and using the sperator #?
Thanks for your help
ORIGINAL FILE BEFORE PROCESSING
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">Subject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
After processing
<tr><th scope="row">Subject</th><td>United States -- Biography</td></tr><tr><th scope="row">SecondSubject</th><td>United States -- Short Stories</td></tr><tr><th scope="row">EText-No.</th><td>24200</td></tr><tr><th scope="row">Release Date</th><td>2008-01-07</td></tr><tr>
Please note that the second Subject is changed from 'Subject' to 'SecondSubject'
Try this:
sed -i '/Subject/{:a;s/\(Subject.*\)Subject/\1SecondSubject/;tb;N;ba;:b}'
If a line appended to the pattern space (with the N command) contains more than one occurrence of the word "Subject", then you can use this command to only target the first occurrence of the appended line (the second occurrence of the pattern space):
sed -i '/Subject/{:a;/Subject.*Subject/!{N;ba;};s/Subject/newSubject/2;}'

Sed on Mac not recognizing regular expressions

In terminal, I am attempting to clean up some .txt files so they can be imported into another program. Only literal search/replaces seem to be working. I cannot get regular expression searches to work.
If I attempt a search and replace with a literal string, it works:
find . -type f -name '*.txt' -exec sed -i '' s/Title Page// {} +;
(remove the words "Title Page" from every text file)
But if I am attempting even the most basic of regular expressions, it does not work:
find . -type f -name '*.txt' -exec sed -i '' s/\n\nDOWN/\\n<DOWN\>/ {} +;
(In every text file, reformat any word "DOWN" that follows double return: remove extra newline and put word in brackets: "\n")
This does not work. The only thing at all "regular expression" about this is looking for the newline.
I must be doing something incorrectly.
Any help is much appreciated.
Update: part 2
John1024's answer helped me out a lot for one aspect.
find . -type f -name '*.txt' -exec sed -i '' '/^$/{N; s/\n[0-9]+/\n/;}' {} +;
Now I am having trouble getting other types of regular expressions to respond properly. The example above, I wish to remove all numbers that appear at the beginning of a line.
Argh! What am I missing?
By default, sed handles only one line at a time. When a line is read into sed's pattern space the newline character is removed.
I see that you want to look for an empty line followed by DOWN and, when found, remove the empty and change the text to <DOWN>. That can be done. Consider this as the test file:
$ cat file
some
thing
DOWN
DOWN
other
Try:
$ sed '/^$/{N; s/\nDOWN/<DOWN>/;}' file
some
thing
DOWN
<DOWN>
other
How it works
/^$/
This looks for empty lines. The commands in braces which follow are executed only on empty lines.
{N; s/\nDOWN/<DOWN>/;}
The N command reads the next line into the pattern space, separated from the current line by a newline character.
If the pattern space matches an empty line followed by DOWN, the substitution command, s/\nDOWN/<DOWN>/, removes the newline and replaces the DOWN with <DOWN>.
Special Case: DOS/Windows Files
If a file has DOS/Windows line endings, \r\n, sed will only remove the \n when the line is read in. The \r will remain. When dealing with these files, the presence of that character, if unanticipated, may lead to surprising results.

Comment out file paths in a file matching lines in another file with sed and bash

I have a file (names.txt) with the following content:
/bin/pgawk
/bin/zsh
/dev/cua0
/dev/initctl
/root/.Xresources
/root/.esd_auth
... and so on. I want to read this file line by line, and use sed to comment out matches in another file. I have the code below, but it does nothing:
#/bin/bash
while read line
do
name=$line
sed -e '/\<$name\>/s/^/#/' config.conf
done < names.txt
Lines in the input file needs to be commented out in config.conf file. Like follows:
config {
#/bin/pgawk
#/bin/zsh
#/dev/cua0
#/dev/initctl
#/root/.Xresources
#/root/.esd_auth
}
I don't want to do this by hand, because the file contains more then 300 file paths. Can someone help me to figure this out?
You need to use double quotes around your sed command, otherwise shell variables will not be expanded. Try this:
sed "/\<$name\>/s/^/#/" config.conf
However, I would recommend that you skip the bash for-loop entirely and do the whole thing in one go, using awk:
awk 'NR==FNR{a[$0];next}{for(i=1;i<=NF;++i)if($i in a)$i="#"$i}1' names.txt config.conf
The awk command stores all of the file names as keys in the array a and then loops through every word in each line of the config file, adding a "#" before the word if it is in the array. The 1 at the end means that every line is printed.
It is better not to use regular expression matching here, as some of the characters in your file names (such as .) will be interpreted by the regular expression engine. This approach does a simple string match, which avoids the problem.