Use of grep + sed based on a pattern file? - regex

Here's the problem: i have ~35k files that might or might not contain one or more of the strings in a list of 300 lines containing a regex each
if I grep -rnwl 'C:\out\' --include=*.txt -E --file='comp.log' i see there are a few thousands of files that contain a match.
now how do i get sed to delete each line in these files containing the strings in comp.log used before?
edit: comp.log contains a simple regex in each line, but for the most part each string to be matched is unique
this is is an example of how it is structured:
server[0-9]\/files\/bobba fett.stw
[a-z]+ mochaccino
[2-9] CheeseCakes
...
etc. silly examples aside, it goes to show each line is unique save for a few variations so it shouldn't affect what i really want: see if any of these lines match the lines in the file being worked on. it's no different than 's/pattern/replacement/' except that i want to use the patterns in the file instead of inline.
Ok here's an update (S.O. gets inpatient if i don't declare the question answered after a few days)
after MUCH fiddling with the #Kenavoz/#Fischer approach, i found a totally different solution, but first things first.
creating a modified pattern list for sed to work with does work.
as well as #werkritter's approach of dropping sed altogether. (this one i find the most... err... "least convoluted" way around the problem).
I couldn't make #Mklement's answer work under windows/cygwin (it did work on under ubuntu, so...not sure what that means. figures.)
What ended up solving the problem in a more... long term, reusable form was a wonderful program pointed out by a colleage called PowerGrep. it really blows every other option out of the water. unfortunately it's windows only AND it's not free. (not even advertising here, the thing is not cheap, but it does solve the problem).
so considering #werkiter's reply was not a "proper" answer and i can't just choose both #Lars Fischer and #Kenavoz's answer as a solution (they complement each other), i am awarding #Kenavoz the tickmark for being first.
final thoughts: i was hoping for a simpler, universal and free solution but apparently there is not.

You can try this :
sed -f <(sed 's/^/\//g;s/$/\/d/g' comp.log) file > outputfile
All regex in comp.log are formatted to a sed address with a d command : /regex/d. This command deletes lines matching the patterns.
This internal sed is sent as a file (with process substitition) to the -f option of the external sed applied to file.
To delete just string matching the patterns (not all line) :
sed -f <(sed 's/^/s\//g;s/$/\/\/g/g' comp.log) file > outputfile
Update :
The command output is redirected to outputfile.

Some ideas but not a complete solution, as it requires some adopting to your script (not shown in the question).
I would convert comp.log into a sed script containing the necessary deletes:
cat comp.log | sed -r "s+(.*)+/\1/ d;+" > comp.sed`
That would make your example comp.sed look like:
/server[0-9]\/files\/bobba fett.stw/ d;
/[a-z]+ mochaccino/ d;
/[2-9] CheeseCakes/ d;
then I would apply the comp.sed script to each file reported by grep (With your -rnwl that would require some filtering to get the filename.):
sed -i.bak -f comp.sed $AFileReportedByGrep
If you have gnu sed, you can use -i inplace replacement creating a .bak backup, otherwise use piping to a temporary file

Both Kenavoz's answer and Lars Fischer's answer use the same ingenious approach:
transform the list of input regexes into a list of sed match-and-delete commands, passed as a file acting as the script to sed via -f.
To complement these answers with a single command that puts it all together, assuming you have GNU sed and your shell is bash, ksh, or zsh (to support <(...)):
find 'c:/out' -name '*.txt' -exec sed -i -r -f <(sed 's#.*#/\\<&\\>/d#' comp.log) {} +
find 'c:/out' -name '*.txt' matches all *.txt files in the subtree of dir. c:/out
-exec ... + passes as many matching files as will fit on a single command line to the specified command, typically resulting only in a single invocation.
sed -i updates the input files in-place (conceptually speaking - there are caveats); append a suffix (e.g., -i.bak) to save backups of the original files with that suffix.
sed -r activates support for extended regular expressions, which is what the input regexes are.
sed -f reads the script to execute from the specified filename, which in this case, as explained in Kenavoz's answer, uses a process substitution (<(...)) to make the enclosed sed command's output act like a [transient] file.
The s/// sed command - which uses alternative delimiter # to facilitate use of literal / - encloses each line from comp.log in /\<...\>/d to yield the desired deletion command; the enclosing of the input regex in \<...\>ensures matching as a word, as grep -w does.
This is the primary reason why GNU sed is required, because neither POSIX EREs (extended regular expressions) nor BSD/OSX sed support \< and \>.
However, you could make it work with BSD/OSX sed by replacing -r with -E, and \< / \> with [[:<:]] / [[:>:]]

Related

How to use sed to migrate from process.env.MY_VAR to env.get('MY_VAR').required() using sed regex?

I'd like to migrate from dotenv to env-var npm package for a dozen of repositories.
Therefore I am looking for a smart and easy way to search and replace a pattern on every file.
My goal is to move from this pattern process.env.MY_VAR to env.get('MY_VAR').required()
And to move from this pattern process.env.MY_VAR || DEFAULT_VALUE to env.get('MY_VAR').required().default('DEFAULT_VALUE')
For reference, I found this command clear; grep -r "process\.env\." --exclude-dir=node_modules | sed -r -n 's|^.*\.([[:upper:]_]+).*$|\1=|p' > .env.example to generate .env.example
Apparently I can use sed -e "s/pattern/result/" <file list> but I am not sure how to catch the pattern, and return this same pattern in the result.
You have already figured out the main parts of the answer I think. But I'm unclear about what you refer to with MY_VAR. If its actually the name MY_VAR or if its just a dummy name for all var-names consisting of only uppercase characters and underscores. I expect it to be the latter on. Then you could go with something like this:
sed "s/\<process.env.\([A-Z_]*\)\>/env.get('\1').required()/" <file list>
This will read all the files and output them all to stdout with the replacement done. But I guess you should use -i for in-place replacement directly in the file (be careful!).
Since you got several replacements you could give each replacement separately like:
sed -i -e "s/pattern1/result1/" -e "s/pattern2/result2/" <file list>
NOTE: The thing described above could for sure be done in multiple other ways, this is only one solution to my interpretation of your problem!
I would suggest that you take some tutorials on regexp to start of with. It is a handy tool that is present in one form or the other in most programming languages and programming tools (sed being just one such tool).
sed -E '
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+) \|\| ([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').required().default('\''\3'\'')\4/g
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').required()\3/g
' myfile
It's essential that the two substitute commands happen in the above order, because the second pattern also matches the first pattern (which we don't want).
The pattern (^|[^[:alnum:]_]) is just a more portable version of the \< word boundary symbol.
Remember you can use the -i flag with sed to edit the file in place.
Running this on the third paragraph in your question (for example), we get:
My goal is to move from this pattern env.get('MY_VAR').required() to env.get('MY_VAR').required() And to move from this pattern env.get('MY_VAR').required().default('DEFAULT_VALUE') to env.get('MY_VAR').required().default('DEFAULT_VALUE')

Using find and sed to do batch find/replace using HTML tags

I need to add a javascript tag immediately before the closing tag on all pages on my site (both .html and .htm). I have used guidance from another question here but my command isn't working:
find . -name "*.htm*" -print | xargs sed -i 's/<\/head>/<script language="javascript" src="https://secure.appprovider.com.js"><\/script><script language="javascript">InitiateCall('767576styisgsjgshshskshjkshkshs');<\/script><\/head>/g'
I get:
sed: 1: "./index.html": invalid command code .
I assume the issue is to do with the regex and me needing to escape particular characters?
Yes, the pattern and replacement contains / (the ones after https:), which will confuse sed.
You need to escape these, but using a different delimiter for the sed command may make it more readable:
's#</head>#<script (etc.) ;</script></head>#g'
Additionally, some sed implementations (on BSD systems, like OS X) requires an argument for the -i flag. You may give it an empty string by specifying ''.

Converting Regex to Sed

I have the following regex.
/http:\/\/([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9\-]+:[a-zA-Z0-9\-]+\/[a-zA-Z]+\.[a-zA-Z]+/g
Which identifies matching URL's (https://regex101.com/r/sG9zR7/1). I need to modify it in order to be able to use it on the command line so it prints out the results. so I modified it to following
sed -n 's/.*\(http:\/\/\([a-zA-Z0-9\-]+\.\)+[a-zA-Z0-9\-]+:[a-zA-Z0-9\-]+\/[a-zA-Z]+\.[a-zA-Z]+\).*/\1/p' filename
(I was trying to add bold to the characters added but could not)
there were as follows
sed -n 's/.*( (in the beginning )
\ (For the inner parenthesis)
).*/\1/p' filename (at the end)
However, i get no results when i execute it.
Make it a habit to use a delimiter other that / when dealing with
URLs. It makes the pattern easier to read.
sed -r -n 's~.*\(http://\([a-z0-9\-]+\.\)+[a-z0-9\-]+:[a-z0-9\-]+/[a-z]+\.[a-z]+\).*~\1~ip' file
Note that I use i modifier for ignorecase.
As hwnd comments, you should put -r flag to sed command as well since your pattern requires + to be treated in a special manner.
sed -rn 's~.*(http://([a-z0-9\-]+.)*[a-z0-9\-]+:[0-9]+\/[a-z0-9]+.[a-z]+).*~\1~ip' Filename is the working command. With the assistance of the sample supplied (thank you hjpotler92) I was able to figure out the escape character did not need to be applies to certain characters. Will have to find out when and how it is applied when using the -r option.
You can achieve the same with an xpath query via xidel:
xidel file.html -e '//a/#href[fn:matches(.,"http://[^/]*:")]/fn:substring-after(.,"=")'

Replace more than 150000 character with sed

I want to replace this LONG string with sed
And I got the string from grep which I store it into variable var
Here is my grep command and my var variable :
var=$(grep -P -o "[^:]//.{0,}" /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map | grep -P -o "//.{0,})
Here is the output from grep : string
Then I try to replace it with sed command
sed -i "s|$var||g" /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map
But it give me output bash: /bin/sed: Argument list too long
How can I replace it?
NB : That string has 183544 character in one line.
What are you actually trying to accomplish here? sed is line-oriented, so you cannot replace a multi-line string (not even if you replace literal newlines with \n .... Well, there are ways to write a sed script which effectively replaces a sequence of lines, but it gets tortured quickly).
bash$ var=$(head -n 2 /etc/mtab)
bash$ sed "s|$var||" /etc/mtab
sed: -e expression #1, char 25: unterminated `s' command
bash$ sed "s|${var//$'\n'/\\n}||" /etc/mtab | diff -u /etc/mtab -
bash$ # (didn't replace anything, so no output)
As a workaround, what you probably want could be approached by replacing the newlines in $var with \| (or possibly just |, depending on your sed dialect) similarly to what was demonstrated above, but you'd still be bumping into the ARG_MAX limit and have a bunch of other pesky wrinkles to iron out, so let's not go there.
However, what you are attempting can be magnificently completed by sed itself, all on its own. You don't need a list of the strings; after all, sed too can handle regular expressions (and nothing in the regex you are using actually requires Perl extensions, so the -P option is by and large superfluous).
sed -i 's%\([^:]\)//.*%\1%' file
There is a minor caveat -- if there are strings which occur both with and without : in front, your original command would have replaced them all (if it had worked), whereas this one will only replace the occurrences which do not have a colon in front. That means comments at beginning of line will not be touched -- if you want them removed too, just add a line anchor as an alternative; sed -i 's%\(^\|[^:]\)//.*%\1%' file
If you want the comments in var for other reasons, the grep can be cleaned up significantly, too. (Obviously, you'd run this before performing the replacement.)
var=$(grep -P -o '[^:]\K//.*' file)
(The \K extension is one which genuinely requires -P. And of course, the common, clear, standard, readable, portable, obvious, simple way to write {0,} is *.)
On most systems these days, the value of ARG_MAX is big enough to handle 150k without problems, but it is important to note that while the limit is called ARG_MAX and the error message indicates that the command line is too long, the real limit is the sum of the sizes of the arguments and all (exported) environment variables. Also, Linux imposes a limit of 128k (131,072 bytes) for a single argument string. Exceeding any of these limits triggers an error return of E2BIG, which is printed as "Argument list too long".
In any case, bash built-ins are exempt from the limit, so you should be able to feed the command into sed as a command file:
echo "s|$var||g" | sed -f - -i /home/lazuardi/project/assets/static/admin/bootstrap3/css/bootstrap.css.map
That may not help you much, though. Your variable is full of regex metacharacters, so it will not match the string itself. You'll need to clean it up in order to be able to use it as a regular expression.
There's probably a cleaner way to do that edit, though.

Sed substitution not doing what I want and think it should do

I have am trying to use sed to get some info that is encoded within the path of a file which is passed as a parameter to my script (Bourne sh, if it matters).
From this example path, I'd like the result to be 8
PATH=/foo/bar/baz/1-1.8/sing/song
I first got the regex close by using sed as grep:
echo $PATH | sed -n -e "/^.*\/1-1\.\([0-9][0-9]*\).*/p"
This properly recognized the string, so I edited it to make a substitution out of it:
echo $PATH | sed -n -e "s/^.*\/1-1\.\([0-9][0-9]*\).*/\1/"
But this doesn't produce any output. I know I'm just not seeing something simple, but would really appreciate any ideas about what I'm doing wrong or about other ways to debug sed regular expressions.
(edit)
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
It is also possible to have an input string that the regular expression should not match and should product no output.
The -n option to sed supresses normal output, and since your second line doesn't have a p command, nothing is output. Get rid of the -n or stick a p back on the end
It looks like you're trying to get the 8 from the 1-1.8 (where 8 is any sequence of numerics), yes? If so, I would just use:
echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
No doubt you could get it working with one sed "instruction" (-e) but sometimes it's easier just to break it down.
The first strips out everything from the start up to and including 1-1., the second strips from the first non-numeric after that to the end.
$ echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
8
$ echo /foo/bar/baz/1-1.752/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
752
And, as an aside, this is actually how I debug sed regular expressions. I put simple ones in independent instructions (or independent part of a pipeline for other filtering commands) so I can see what each does.
Following your edit, this also works:
$ echo /foo/bar/baz/1-1.962/sing/song | sed -e "s/.*\/1-1\.\([0-9][0-9]*\).*/\1/"
962
As to your comment:
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
The two-part sed command I gave you should work with numerics anywhere in the string (as long as there's no 1-1. after the one you're interested in). That's because it actually deletes up to the specific 1-1. string and thereafter from the first non-numeric). If you have some examples that don't work as expected, toss them into the question as an update and I'll adjust the answer.
You can shorten you command by using + (one or more) instead of * (zero or more):
sed -n -e "s/^.*\/1-1\.\([0-9]\+\).*/\1/"
don't use PATH as your variable. It clashes with PATH environment variable
echo $path|sed -e's/.*1-1\.//;s/\/.*//'
You needn't divide your patterns with / (s/a/b/g), but may choose every character, so if you're dealing with paths, # is more useful than /:
echo /foo/1-1.962/sing | sed -e "s#.*/1-1\.\([0-9]\+\).*#\1#"