Using grep to find dynamic text

Using grep to find dynamic text - regex

Need help with a bash script. We are modifying our database structure, the problem is we have many live sites that have pre-written queries referencing the current database structure. I need to find all of our scripts with references to MySQL tables. Here is what I started:
grep -ir 'from' /var/www/sites/inspection.certifymyshop.com/ > resultsList.txt
I am trying to grep through our scripts recursively and export ALL table names found to a text file, we can use the "->from" and the "->join" prefixes to help us:
->from('databaseName.table_name dtn') // dtn = table alias
OR
->join('databaseName.table_name dtn') // dtn = table alias
I need to find the database and table name within the single quotes (i.e. databaseName.table_name). I also need to list the filename this was found in underneath or next to the match like so:
someDatabaseName.someTableName | /var/www/sites/blah.com/index.php
| line 36

Try doing this :
grep -oPriHn -- "->(?:from|join)\('\K[^']+" . |
awk -F'[ :]' '{print $3, "|", $1, "| line " $2}'
If this fits your needs, I can explain the snippet more as well.

The one problem you have with only using grep is removing the from, join or whatever identifying prefix from the result. To fix this we can also use sed
grep -EHroi -- '->(from|join)\('\''[^'\'' ]*' /path/to/files | sed -re 's/:.*(from|join)\('\''/:/g'
You could also use sed alone in a for loop
for i in `find /path/to/files -type f -print`
do
echo $i
sed -nre 's/^.*->(from|join)\('\''([^'\'' ]*)['\'' ].*$/\2/gp' $i
done
Edit: The above for loop breaks with filenames with spaces so here's the previous sed statement using find
find ./ -type f -exec sh -c "echo {} ; sed -nre 's/^.*->(from|join)\('\''([^'\'' ]*)['\'' ].*$/\2/gp' \"{}\" ;" \;

Related

Pattern-based filename filtering in gnu shell command

Lets say I have an active/ directory that contains these files
active/
foo.bar.abc
foo.bar.xyz
foo.bat.abc
archive/
foo.bat.xyz
I want to write a command to output only unique filenames in active/ (uniqueness based on the middle item) AND doesn't match to any files already in archive/ (again based on that middle term).
Sample output:
foo.bar.abc
Explanation: output either foo.bar.abc or foo.bar.xyz doesn't matter. Not foo.bat.abc since foo.bat.xyz exists in archive/
I've found this to help identify unique values based on a pattern but I can't figure out how to combine that with my additional clause of no match in archive/

Awk is actually not needed here, you can do it with simple grep/sed and sort:
(ls ./archive | sed 's/^/1 /'; ls ./active | sed 's/^/2 /') | \
sort --field-separator="." --key="2,2" --uniq --stable | \
grep '^2 ' | sed 's/^2 //'
Explanation:
First list both directories and mark which lines are from which directory. Then sort both listings together by their middle parts. Option --field-separator="." splits all lines into fields on dosts and option --key="2,2" tells to sort by the middle field, i.e. by the part between the dots. We use a stable sort to make sure lines from archive are the first and tell sort to print only the first matches of all duplicate lines.
Finally we filter only lines that we marked with 2, i.e. the lines from ./active.
Example:
active/
foo.aaa.xxx
foo.bar.abc
foo.bar.xyz
foo.bat.abc
zoo.aaa.xxx
zoo.bbb.aaa
archive/
aaa.bbb.zoo
foo.bat.xyz
Result:
foo.aaa.xxx
foo.bar.abc

Another attempt using GNU grep, awk and GNU findutils
$ grep -Fxvf <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz
Am using process-substitution <() to run the find/awk commands and pass it to grep for finding the difference.
While find command lists the file on the specified directory, one entry per line, the awk filters the list by retaining the ones which are not duplicated by 2nd word. For awk, with the delimiter as . !seen[$2]++ prints only unique lines by hashing it in the array only if it hasn't been seen before.
Do remember the -printf '%P in find is NOT POSIX compatible and will work with GNU findutils. Recommend upgrading to it for this to work.
Other possible solutions, with a similar logic, one with comm and join, both part of GNU coreutils are below:-
$ join -v 2 <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz
Another with comm
$ comm -13 <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz

Pass sed output to mv

I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!

A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.

You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).

Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.

What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.

Find directories with names matching pattern and move them

I have a bunch of directories like 001/ 002/ 003/ mixed in with others that have letters in their names. I just want to grab all the directories with numeric names and move them into another directory.
I try this:
file */ | grep ^[0-9]*/ | xargs -I{} mv {} newdir
The matching part works, but it ends up moving everything to the newdir...

I am not sure I understood correctly but here is at least something to help.
Use a combination of find and xargs to manipulate lists of files.
find -maxdepth 1 -regex './[0-9]*' -print0 | xargs -0 -I'{}' mv "{}" "newdir/{}"
Using -print0 and -0 and quoting the replacement symbol {} make your script more robust. It will handle most situations where non-printable chars are presents. This basically says it passes the lines using a \0 char delimiter instead of a \n.

mv is not powerfull enough by itself. It cannot work on patterns.
Try this approach: Rename multiple files by replacing a particular pattern in the filenames using a shell script
Either use a loop or a rename command.

With loop and array,
Your script would be something like this:
#!/bin/bash
DIR=( $(file */ | grep ^[0-9]*/ | awk -F/ '{print $1}') )
for dir in "${DIR[#]}"; do
mv $dir /path/to/DIRECTORY
done

How to use find command with sed and awk to remove duplicate IP from files

Howdie do,
I'm writing a script that will remove duplicate IP's from two files. For example,
grep -rw "123.234.567" /home/test/ips/
/home/test/ips/codingte:123.234.567
/home/test/ips/codingt2:123.234.567
Ok, so that IP is in two different files and so I need to remove the IP from the second file.
The grep gives me the file path and the IP address. My thinking: store the file path in a variable with awk and then use find to go to that file and use sed to remove the duplicate IP, so I changed my grep statement to:
grep -rw "123.234.567" . | awk -F ':' '{print $1}'
which returns:
./codingte
./codingt2
I originally tried to use the fully pathname in the find command, but that didn't work either
find -name /var/cpanel/dips/codingte -exec sed '/123.234.567/d' {} \;
So, I just did a CD in the directory and changed the find command to:
find -name 'codingt2' -exec sed '/123.234.567/d' {} \;
Which runs, but doesn't delete the IP address:
cat codingt2
123.234.567
Now, I know the issue is with the dots in the IP address. They need to be escaped, but I'm not sure how to do this. I've been reading for hours on escaping the regex, but I'm not sure how to do this with sed
Any help would be appreciated. I'm just trying to learn more about regex and using them with other linux tools such as awk and find.
I haven't written the full script yet. I'm trying to break it into pieces and then bring it together in the script.
So you know what the output should look like:
codingte
123.234.567
codingt2
The second file would just have the IP removed

cat FILE1.txt | while read IP ; do sed -i "/^${IP}$/d" FILE2.txt ; done
The command does the following:
There are two files: FILE1.txt and FILE2.txt
It will remove in FILE2.txt lines (in your case, IP addresses) found in FILE1.txt

You want grep -l which only print the filenames containing a match:
grep -lrw "123.234.567" /home/test/ips/
would print
/home/test/ips/codingte
/home/test/ips/codingt2
So, to skip the first file and work on the rest:
grep -l ... | sed 1d | while IFS= read -r filename; do
whatever with "$filename"
done

I think you're just missing the -i argument to sed to edit the files in place.
echo foo > test
find -name test -exec sed -i 's/foo/bar/' {} \;
seems to do the trick.

Sort by function using bash/coreutils instead of perl

I found out that if you sort a list of files by file extension rather than alphabetically before putting them in a tar archive, you can dramatically increase the compression ratio (especially for large source trees where you likely have lots of .c, .o, and .h files).
I couldn't find an easy way to sort files using the shell that works in every case the way I'd expect. An easy solution such as find | rev | sort | rev does the job but the files appear in an odd order, and it doesn't arrange them as nicely for the best compression ratio. Other tools such as ls -X don't work with find, and sort -t. -k 2,2 -k 1,1 messes up when files have more than one period in the filename (e.g. version-1.5.tar). Another quick-n-dirty option, using sed replaces the last period with a / (which never occurs in a filename), then sorts, splitting along the /:
sed 's/\(\.[^.]*\)$/\/\1/' | sort -t/ -k 2,2 -k 1,1 | sed 's/\/\([^/]*\)$/\1/'
However, once again this doesn't work using the output from find which has /s in the names, and all other characters (other than 0) are allowed in filenames in *nix.
I discovered that using Perl, you can write a custom comparison subroutine using the same output as cmp (similar to strcmp in C), and then run the perl sort function, passing your own custom comparison, which was easy to write with perl regular expressions. This is exactly what I did: I now have a perl script which calls
#lines = <STDIN>;
print sort myComparisonFunction #lines;
However, perl is not as portable as bash, so I want to be able to do with with a shell script. In addition, find does not put a trailing / on directory names so the script thinks directories are the same as files without an extension. Ideally, I'd like to have tar read all the directories first, then regular files (and sort them), then symbolic links which I can achieve via
cat <(find -type d) <(find -type f | perl exsort.pl) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar
but I still run into the issue that either I have to type this monstrosity every time, or I have both a shell script for this long line AND a perl script for sorting, and perl isn't available everywhere so stuffing everything into one perl script isn't a great solution either. (I'm mainly focused on older computers, cause nowadays all modern Linux and OSX come with a recent enough version of perl).
I'd like to be able to put everything together into one shell script, but I don't know how to pass a custom function to GNU sort tool. Am I out of luck, and have to use one perl script? Or can I do this with one shell script?
EDIT: Thanks for the idea of a Schwartizan Transform. I used a slightly different method, using sed. My final sorting routine is as follows:
sed 's_^\(\([^/]*/\)*\)\(.*\)\(\.[^\./]*\)$_\4/\3/\1_' | sed 's_^\(\([^/]*/\)*\)\([^\./]\+\)$_/\3/\1_' | sort -t/ -k1,1 -k2,2 -k3,3 | sed 's_^\([^/]*\)/\([^/]*\)/\(.*\)$_\3\2\1_'
This handles special characters (such as *) in filenames and places files without an extension first because they are often text files. (Makefile, COPYING, README, configure, etc.).
P.S. In case anyone wants my original comparison function or think I could improve on it, here it is:
sub comparison {
my $first = $a;
my $second = $b;
my $fdir = $first =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r;
my $sdir = $second =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r;
my $fname = $first =~ s/^([^\/]*\/)*([^\/]*)$/$2/r;
my $sname = $second =~ s/^([^\/]*\/)*([^\/]*)$/$2/r;
my $fbase = $fname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r;
my $sbase = $sname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r;
my $fext = $fname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r;
my $sext = $sname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r;
if ($fbase eq "" && $sbase ne ""){
return -1;
}
if ($sbase eq "" && $fbase ne ""){
return 1;
}
(($fext cmp $sext) or ($fbase cmp $sbase)) or ($fdir cmp $sdir)
}

If you're familiar with Perl, you can use a Schwartzian Tranform in BASH too.
A Schwartian Transform is merely adding to your sorting information the sort key you desire, do the sort, then remove the sort key. It was created by Randal Schwartz and is used heavily in Perl. However, it's also good to use in other languages too:
You want to sort your files by extension:
find . -type f 2> /dev/null | while read file #Assuming no strange characters or white space
do
suffix=${file##*.}
printf "%-10.10s %s\n" "$suffix" "$file"
done | sort | awk '{print substr( $0, 8 ) }' > files_to_tar.txt
I'm reading each file in with my find. I use printf to prepend my file name with the suffix I want to sort by. Then, I do my sort. My awk strips my sort key off leaving just my file name which are still sorted by suffix.
Now, your files_to_tar.txt file contains the names of your files sorted by suffix. You can use the -T parameter of tar to read the names of the files from this file:
$ tar -czvf backup.tar.gz -T files_to_tar.txt

you could pipe result of find to ls -X, using xargs, (read man page here) which should sort them by extension,
cat <(find -type d) <(find -type f | xargs ls -X ) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar

To sort by extension to group similar files,
and then my md5sum to group identical files:
find $your_dir | xargs md5sum | sed 's/ /\x00/; s/\.[^.]$/&\x00&/' | sort -t'\0' -k3,3 | cut -d '' -f2
Note sort -k3,3 is the extension sort, and the default "last resort" sort done will group the files by md5sum.
Also consider xz instead of gz if you are worried about space

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js