I am trying to match some methods in a bunch of python scripts if certain conditions are met. First thing i am looking at is if import re exists in a file, and if it does, then find all cases of re.sub(something). I tried following the documentation here on how to use if then without else regexs, but cant seem to make it work with ripgrep with or without pcre2.
My next approach was to use groups, so rg -n "(^import.+re)|(re\.sub.+)" -r '$2', but the issue with this approach is that because the first import group matches, i get a lot of empty files back in my output. The $2 is being handled correctly.
I am hoping to avoid doing a or group capture, and use the regex if option if possible.
To summarize, what I am hoping for is, if import re appears anywhere in a file, then search for re\.sub.+ and output only the matching files and lines using ripgrep. Using ripgrep is a hard dependency.
Some sample code:
import re
for i in range(10):
re.match(something)
print(i)
re.sub(something)
This can be accomplished pretty easily with a shell pipeline and xargs. The idea is to use the first regex as a filter for which files to search in, and the second regex to show the places where re.sub occurs.
Here are three Python files to test with.
import-without-sub.py has an import re but no re.sub:
import re
for i in range(10):
re.match(something)
print(i)
import-with-sub.py has both an import re and an re.sub:
import re
for i in range(10):
re.match(something)
print(i)
re.sub(something)
And finally, no-import.py has no import re but does have a re.sub:
for i in range(10):
re.match(something)
print(i)
re.sub(something)
And now here's the command to show only matches of re.sub in files that contain import re:
rg '^import\s+re$' --files-with-matches --null | xargs -0 rg -F 're.sub('
--files-with-matches and --null print out all matching file paths separated by a NUL byte. xargs -0 then reads those file paths and turns them into arguments to be given to rg -F 're.sub('. (We use --null and -0 in order to correctly handle file names that contain spaces.)
Its output in a directory with all three of the above files is:
import-with-sub.py
7:re.sub(something)
Related
I have a bunch of files on a Linux machine. I want to find whether any of those files have the string foo123 bar, AND the string foo123 must not appear before that foo123 bar .
Plot twist: I want the search to do this for any number instead of "123", without me having to specify a specific number.
How can I do that?
A solution with Python's newer regex module:
import regex as re
string = """
I have a bunch of files on a Linux machine. I want to find whether any of those files have the string foo123 bar#12, AND the string foo123 must not appear before that foo123 bar#34 .
Plot twist: I want the search to do this for any number instead of "123", without me having to specify a specific number.
How can I do that?
"""
rx = re.compile(r'(?<!foo\d(?s:.*))foo123 bar#\w+')
print(rx.findall(string))
# ['foo123 bar#12']
Making use of the infinite lookbehind and the single line mode ((?s:.*)).
Well, that's a tricky one. Here's an imperfect solution:
grep . -Prle '(?s)(?<ref>foo\d+)\b(?! bar).*\k<ref>(*SKIP)(*FAIL)|foo\d+ bar'
Why is it imperfect? Because if you have a file containing foo123 foo456 bar foo123 bar, it won't detect the foo456 bar part. If this situation cannot happen in your set of files, then I suppose you're fine.
This makes use of the (*SKIP)(*FAIL) trick, once you learn that the rest of the pattern should be pretty clear.
So maybe plain regex isn't the best solution here, let's just write a one-liner script instead:
find . -type f -execdir perl -e 'while(<>) { while(/foo(\d+)( bar)?/g) { if ($2) { exit 0 if !$n{$1} } else { $n{$1} = 1 } } } exit 1;' {} \; -print
That one does the job and is hopefully more understandable :)
I'm trying to come up with a SED greedy expression which ignores the stuff inside html quotes and ONLY matches the text of that element.
<p alt="100">100</p> #need to match only second 100
<img src="100.jpg">100</img> #need to match only second 100
<span alt="tel:100">100</span> #need to match only second 100
These are my attempts:
grep -E '(!?\")100(!?\")' html # this matches string as well as quotes
grep -E '[^\"]100[^\"]' html # this doesn't work either
Edit
Ok. I was trying to simplify the question but maybe that's wrong.
with command sed -r '/?????/__replaced__/g' file i would need to see :
<p alt="100">__replaced__</p>
<img src="100.jpg">__replaced__</img>
<span alt="tel:100">__replaced__</span>
I don't think handling HTML with sed (or grep) is a good idea. Consider using python, which has an HTML push parser in its standard library. This makes separating tags from data easy. Since you only want to handle the data between tags, it could look something like this:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import argv
class MyParser(HTMLParser):
def handle_data(self, data):
# data is the string between tags. You can do anything you like with it.
# For a simple example:
if data == "100":
print data
# First command line argument is the HTML file to handle.
with open(argv[1], "r") as f:
MyParser().feed(f.read())
Update for updated question: To edit HTML with this, you'll have to implement the handle_starttag and handle_endtag methods as well as handle_data in a manner that reprints the parsed tags. For example:
#!/usr/bin/python
from HTMLParser import HTMLParser
from sys import stdout, argv
import re
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
stdout.write("<" + tag)
for k, v in attrs:
stdout.write(' {}="{}"'.format(k, v))
stdout.write(">")
def handle_endtag(self, tag):
stdout.write("</{}>".format(tag))
def handle_data(self, data):
data = re.sub("100", "__replaced__", data)
stdout.write(data)
with open(argv[1], "r") as f:
MyParser().feed(f.read())
First warning is that HTML is not a good idea to parse with regular expressions - generally speaking - use an HTML parser is the answer. Most scripting languages (perl, python etc.) have HTML parsers.
See here for an example as to why: RegEx match open tags except XHTML self-contained tags
If you really must though:
/(?!\>)([^<>]+)(?=\<)/
DEMO
You may try the below PCRE regex.
grep -oP '"[^"]*100[^"]*"(*SKIP)(*F)|\b100\b' file
or
grep -oP '"[^"]*"(*SKIP)(*F)|\b100\b' file
This would match the number 100 which was not present inside double quotes.
DEMO
You're questions gotten kinda muddy through it's evolution but is this what you're asking for?
$ sed -r 's/>[^<]+</>__replaced__</' file
<p alt="100">__replaced__</p> #need to match only second 100
<img src="100.jpg">__replaced__</img> #need to match only second 100
<span alt="tel:100">__replaced__</span> #need to match only second 100
If not please clean up your question to just show the latest sample input and expected output and explanation.
Can someone explain to me why my sed command isn't working? I'm sure I'm doing something stupid. Here's small text file that demonstrates my issue:
#!/usr/bin/env python
class A:
def candy(self):
print "cane"
Put that in a file and call it test.py
My goal is to add #profile before the def line with the same indentation as the function declaration. I try with this:
$ sed -i '/\( *\)def /i \
\1#profile' test.py
Note that the capture group should be the set of spaces before the def and I'm referencing the group with \1.
Here's my result:
#!/usr/bin/env python
class A:
1#profile
def candy(self):
print "cane"
Why is that 1 being placed in there literally instead of being replaced by my capture group (four spaces)?
Thanks!
I don't know this to be true but I'm going to assume that sed doesn't maintain captures from address selectors and into manually inserted text and in fact may not be evaluating references inside "literal" text at all.
Try sed -e 's/\( *\)def /\1#profile\n&/' test.py instead.
What about that :
sed -i -e 's/^\(.*\)\(def.*\)/\1#profile\n\2/' test.py
Just use awk:
$ awk '{orig=$0} sub(/def.*/,"#profile"); {print orig}' file
#!/usr/bin/env python
class A:
#profile
def candy(self):
print "cane"
simple, portable, easily extendable, debuggable, etc., etc....
I'm after a way to batch rename files with a regex i.e.
s/123/onetwothree/g
I recall i can use awk and sed with a regex but couldnt figure out how to pipe them together for the desired output.
You can install perl based rename utility:
brew install rename
and than just use it like:
rename 's/123/onetwothree/g' *
if you'd like to test your regex without renaming any files just add -n switch
An efficient way to perform the rename operation is to construct the rename commands in a sed pipeline and feed them into the shell.
ls |
sed -n 's/\(.*\)\(123\)\(.*\)/mv "\1\2\3" "\1onetwothree\2"/p' |
sh
Namechanger is super nice. It supports regular expressions for search and replace: consider that I am doing a super complex rename with the following regex:
\.sync-conflict-.*\.
thats a life saver.
Regex captures groups (Diomidis answer) be the CLI way, into variables I think called $1 and $2 so rename -nv 's/^(\d{2})\.(\d{2}).*/s$1e$2.mp4/' *.mp4 becomes possible. Notice the $1 and $2? Those are coming from capture group one (\d{2}) and two (\d{2}) in my example.
My take on a friendly recursive regex file name renamer which by default only emulates the replacement and shows what the resulting file names would be.
Use -w to actually write changes when you are satisfied with the dry run result, -s to suppress displaying non-matching files; -h or --help will show usage notes.
Simplest usage:
# replace all occurences of 'foo' with 'bar'
# "foo-foo.txt" >> "bar-bar.txt"
ren.py . 'foo' 'bar' -s
# only replace 'foo' at the beginning of the filename
# "foo-foo.txt" >> "bar-foo.txt"
ren.py . '^foo' 'bar' -s
Matching groups (e.g. \1, \2 etc) are supported too:
# rename "spam.txt" to "spam-spam-spam.py"
ren.py . '(.+)\.txt' '\1-\1-\1.py' -s
# rename "12-lovely-spam.txt" to "lovely-spam-12.txt"
# (assuming two digits at the beginning and a 3 character extension
ren.py . '^(\d{2})-(.+)\.(.{3})' '\2-\1.\3' -s
NOTE: don't forget to add -w when you tested the results and want to actually write the changes.
Works both with Python 2.x and Python 3.x.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import print_function
import argparse
import os
import fnmatch
import sys
import shutil
import re
def rename_files(args):
pattern_old = re.compile(args.search_for)
for path, dirs, files in os.walk(os.path.abspath(args.root_folder)):
for filename in fnmatch.filter(files, "*.*"):
if pattern_old.findall(filename):
new_name = pattern_old.sub(args.replace_with, filename)
filepath_old = os.path.join(path, filename)
filepath_new = os.path.join(path, new_name)
if not new_name:
print('Replacement regex {} returns empty value! Skipping'.format(args.replace_with))
continue
print(new_name)
if args.write_changes:
shutil.move(filepath_old, filepath_new)
else:
if not args.suppress_non_matching:
print('Name [{}] does not match search regex [{}]'.format(filename, args.search_for))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Recursive file name renaming with regex support')
parser.add_argument('root_folder',
help='Top folder for the replacement operation',
nargs='?',
action='store',
default='.')
parser.add_argument('search_for',
help='string to search for',
action='store')
parser.add_argument('replace_with',
help='string to replace with',
action='store')
parser.add_argument('-w', '--write-changes',
action='store_true',
help='Write changes to files (otherwise just simulate the operation)',
default=False)
parser.add_argument('-s', '--suppress-non-matching',
action='store_true',
help='Hide files that do not match',
default=False)
args = parser.parse_args(sys.argv[1:])
print(args)
rename_files(args)
files = "*"
for f in $files; do
newname=`echo "$f" | sed 's/123/onetwothree/g'`
mv "$f" "$newname"
done
Say I have a line in a file "This is perhaps the easiest place to add new functionality." and I want to grep two words close to each other. I do
grep -ERHn "\beasiest\W+(?:\w+\W+){1,6}?place\b" *
that works and gives me the line. But when I do
grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" *
it fails, defeating the whole point of the {1,10}?
This one is listed in the regular-expression.info site and also a couple of Regex books. Though they do not describe it with grep but that should not matter.
Update
I put the regex into a python script. Works, but doesn't have the nice grep -C thing ...
#!/usr/bin/python
import re
import sys
import os
word1 = sys.argv[1]
word2 = sys.argv[2]
dist = sys.argv[3]
regex_string = (r'\b(?:'
+ word1
+ r'\W+(?:\w+\W+){0,'
+ dist
+ '}?'
+ word2
+ r'|'
+ word2
+ r'\W+(?:\w+\W+){0,'
+ dist
+ '}?'
+ word1
+ r')\b')
regex = re.compile(regex_string)
def findmatches(PATH):
for root, dirs, files in os.walk(PATH):
for filename in files:
fullpath = os.path.join(root,filename)
with open(fullpath, 'r') as f:
matches = re.findall(regex, f.read())
for m in matches:
print "File:",fullpath,"\n\t",m
if __name__ == "__main__":
findmatches(sys.argv[4])
Calling it as
python near.py charlie winning 6 path/to/charlie/sheen
works for me.
Do you really need the look ahead structure?
Maybe this is enough:
grep -ERHn "\beasiest\W+(\w+\W+){1,10}new\b" *
Here is what I get:
echo "This is perhaps the easiest place to add new functionality." | grep -EHn "\beasiest\W+(\w+\W+){1,10}new\b"
(standard input):1:This is perhaps the easiest place to add new
functionality.
Edit
As Camille Goudeseune said:
To make it easily usable, this can be added in a .bashrc:
grepNear() {
grep -EHn "\b$1\W+(\w+\W+){1,10}$2\b"
}.
Then at a bash prompt: echo "..." | grepNear easiest new
grep does not support the non-capturing groups of Python regular expressions. When you write something like (?:\w+\W+), you are asking grep to match a question mark ? followed by a colon : followed by one or more word chars \w+ followed by one or more non-word chars \W+. ? is a special character for grep regexes, for sure, but since it is following the beginning of a group, it is automatically escaped (in the same way that the regex [?] matches the question mark).
Let us test it? I have the following file:
$ cat file
This is perhaps the easiest place to add new functionality.
grep does not match it with the expression you used:
$ grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" file
Then, I created the following file:
$ cat file2
This is perhaps the easiest ?:place ?:to ?:add new functionality.
Note that each word is preceded by ?:. In this case, your expression matches the file:
$ grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" file2
file2:1:This is perhaps the easiest ?:place ?:to ?:add new functionality.
The solution is to remove the ?: of the expression:
$ grep -ERHn "\beasiest\W+(\w+\W+){1,10}?new\b" file
file:1:This is perhaps the easiest place to add new functionality.
Since you do not even need a non-capturing group (at least as far as I've seen) it does not bear any problem.
Bonus point: you can simplify your expression changing {1,10} to {0,10} and removing the following ?:
$ grep -ERHn "\beasiest\W+(\w+\W+){0,10}new\b" file
file:1:This is perhaps the easiest place to add new functionality.