assign a portion of a word to a variable - regex

I am working on a bash script.
grep -R -l "image17" *
image17 will change to some other number when I go through my loop. When I execute the grep above, I get back the following:
slides/_rels/slide33.xml.rels
I need to put slide33 in a variable because I want to use that to rename the file named image17.jpeg to be called slide33.jpeg. I need something to check for the above format and parse out starting at slide and ending with the numbers.
Another problem is the grep statement could come up with multiple results rather than one. I need a way to check to see how many results and if one do one thing and if more than one do another.
Here is what I have so far. Now I just need to put the grep as a variable and check to see how many times it happens and if it is one then do the regular expression to get the filename.
#!/bin/sh IFS=$'\n'
where="/Users/mike/Desktop/test"
cd "${where}"
for file in $(find * -maxdepth 0 -type d)
do
cd "${where}/${file}/images"
ls -1 | grep -v ".png" | xargs -I {} rm -r "{}"
cd "${where}/${file}/ppt"
for images in $(find * -maxdepth 0 -type f)
do
if [ (grep -R -l "${images}" * | wc -l) == 1 ]
then
new_name=grep -R -l "slide[0-9]"
fi
done
done

i=0
while [ $i -lt 50 ]
do
grep -R -l "image${i}"
done
something like this might help
Or, to detect similar structured words you can do
grep -R -l "slide[0-9][0-9]"
or you can do
grep -R -l "slide[0-9]+"
to match atleast one digit and atmost any number
Check man grep for more in the "REGULAR EXPRESSION" section
this will match words starting with "slide" and ending with exactly two numbers
grep -c does count the number of matches, but does not print the matches. I think you should count the lines to detect the number of lines which grep matched and then execute the conditional statement.

Related

append epoch date at the beginning of a file in bash

I have a list of 20 files, 10 of them already have 1970-01-01- at the beginning of the name and 10 does not ( the remaining ones all start with a small letter ) .
So my task was to rename those files that do not have the epoch date in the beginning with the epoch date too. Using bash, the below code works, but I could not solve it using a regular expression for example using rename. I had to extract the basename and then further mv. An elegant solution would be just use one pipe instead of two.
Works
find ./ -regex './[a-z].*' | xargs -I {} basename {} | xargs -I {} mv {} 1970-01-01-{}
Hence looking for a solution with just one xargs or -exec?
You can just use a single rename command:
rename -n 's/^([a-z])/1970-01-01-$1/' *
Assuming you're operating on all the files present in current directory.
Note that -n flag (dry run) will only show intended actions by rename command but won't really rename any files.
If you want to combine with find then use:
find . -type f -maxdepth 1 -name '[a-z]*.txt' -execdir rename -n 's/^/1970-01-01-/' {} +
I always prefer readable code over short code.
r() {
base=$(basename "$1")
dir=$(dirname "$1")
if [[ "$base" =~ ^1970-01-01- ]]
then
: "ignore, already has correct prefix"
else
echo mv "$1" "$dir/1970-01-01-$base"
fi
}
export -f r
find . -type f -exec bash -c 'r {}' \;
This also just prints out what would have been done (for testing). Remove the echo before the mv to have to real thing.
Mind that the mv will overwrite existing files (if there is a ./a/b/c and an ./a/b/1970-01-01-c already). Use option -i to mv to be save from this.

find works in terminal but fails in bash script

So I'm writing a bash script that counts the number of files in a directory and outputs a number. The function takes a directory argument as well as an optional file-type extension argument.
I am using the following lines to set the dir variable to the directory and ext variable to a regular expression that will represent all the file types to count.
dir=$1
[[ $# -eq 2 ]] && ext="*.$2" || ext="*"
The problem I am encountering occurs when I attempt to run the following line:
echo $(find $dir -maxdepth 1 -type f -name $ext | wc -l)
Running the script from the terminal works when I provide the second file-type argument but fails when I don't.
harrison#Luminous:~$ bash Documents/howmany.sh Documents/ sh
3
harrison#Luminous:~$ bash Documents/howmany.sh Documents/
find: paths must precede expression: Desktop
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]
0
I have searched for this error and I know it's an issue with the shell expanding my wildcard as explained here. I've tried experimenting with single quotes, double quotes, and backslashes to escape the asterisk but nothing seems to work. What's particularly interesting is that when I try running this directly through the terminal, it works perfectly fine.
harrison#Luminous:~$ echo $(find Documents/ -maxdepth 1 -type f -name "*" | wc -l)
6
Simplified:
dir=${1:-.} #if $1 not set use .
name=${2+*.$2} #if $2 is set use *.$2 for name
name=${name:-*} #if name still isnt set, use *
find "$dir" -name "$name" -print #use quotes
or
name=${2+*.$2} #if $2 is set use *.$2 for name
find "${1:-.}" -name "${name:-*}" -print #use quotes
also, as #John Kugelman says, you could use:
name=${2+*.$2}
find "${1:-.}" ${name:+-name "$name"} -print
find . -name "*" -print is the same as find . -print, so if $name isn't set, there's no need to specify -name "*".
Try this:
dir="$1"
[[ $# -eq 2 ]] && ext='*.$2' || ext='*'
If that doesn't work, you can just switch to an if statement, where you use the -name pattern in a branch and you don't in the other.
A couple more points:
Those are not regular expressions, but rather shell patterns.
echo $(command) is just equivalent to command.

Pass sed output to mv

I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.

How can I exclude directories matching certain patterns from the output of the Linux 'find' command?

I want to use regex's with Linux's find command to dive recursively into a gargantuan directory tree, showing me all of the .c, .cpp, and .h files, but omitting matches containing certain substrings. Ultimately I want to send the output to an xargs command to do certain processing on all of the matching files. I can pipe the find output through grep to remove matches containing those substrings, but that solution doesn't work so well with filenames that contain spaces. So I tried using find's -print0 option, which terminates each filename with a nul char instead of a newline (whitespace), and using xargs -0 to expect nul-delimited input instead of space-delimited input, but I couldn't figure out how to pass the nul-delimited find through the piped grep filters successfully; grep -Z didn't seem to help in that respect.
So I figured I'd just write a better regex for find and do away with the intermediary grep filters... perhaps sed would be an alternative?
In any case, for the following small sampling of directories...
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
...I want the output to include all of the .h, .c, and .cpp files but NOT those ones that appear in the 'generated' and 'deploy' directories.
BTW, you can create an entire test directory (named fredbarney) for testing solutions to this question by cutting & pasting this whole line into your bash shell:
mkdir fredbarney; cd fredbarney; mkdir fred; cd fred; mkdir inc; mkdir docs; mkdir generated; mkdir deploy; mkdir src; echo x > inc/dino.h; echo x > docs/info.docx; echo x > generated/dino.h; echo x > deploy/dino.h; echo x > src/dino.cpp; cd ..; mkdir barney; cd barney; mkdir inc; mkdir docs; mkdir generated; mkdir deploy; mkdir src; echo x > 'inc/bam bam.h'; echo x > 'docs/info info.docx'; echo x > 'generated/bam bam.h'; echo x > 'deploy/bam bam.h'; echo x > 'src/bam bam.cpp'; cd ..;
This command finds all of the .h, .c, and .cpp files...
find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$"
...but if I pipe its output through xargs, the 'bam bam' files each get treated as two separate (nonexistant) filenames (note that here I'm simply using ls as a stand-in for what I actually want to do with the output):
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" | xargs -n 1 ls
ls: ./barney/generated/bam: No such file or directory
ls: bam.h: No such file or directory
ls: ./barney/src/bam: No such file or directory
ls: bam.cpp: No such file or directory
ls: ./barney/deploy/bam: No such file or directory
ls: bam.h: No such file or directory
ls: ./barney/inc/bam: No such file or directory
ls: bam.h: No such file or directory
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
So I can enhance that with the -print0 and -0 args to find and xargs:
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" -print0 | xargs -0 -n 1 ls
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
...which is great, except that I don't want the 'generated' and 'deploy' directories in the output. So I try this:
$ find . -regextype posix-egrep -regex ".+\.(c|cpp|h)$" -print0 | grep -v generated | grep -v deploy | xargs -0 -n 1 ls
barney fred
...which clearly does not work. So I tried using the -Z option with grep (not knowing exactly what the -Z option really does) and that didn't work either. So I figured I'd write a better regex for find and this is the best I could come up with:
find . -regextype posix-egrep -regex "(?!.*(generated|deploy).*$)(.+\.(c|cpp|h)$)" -print0 | xargs -0 -n 1 ls
...but bash didn't like that (!.*: event not found, whatever that means), and even if that weren't an issue, my regex doesn't seem to work on the regex tester web page I normally use.
Any ideas how I can make this work? This is the output I want:
$ find . [----options here----] | [----maybe grep or sed----] | xargs -0 -n 1 ls
./barney/src/bam bam.cpp
./barney/inc/bam bam.h
./fred/src/dino.cpp
./fred/inc/dino.h
...and I'd like to avoid scripts & temporary files, which I suppose might be my only option.
Thanks in advance!
-Mark
This works for me:
find . -regextype posix-egrep -regex '.+\.(c|cpp|h)$' -not -path '*/generated/*' \
-not -path '*/deploy/*' -print0 | xargs -0 ls -L1d
Changes from your version are minimal: I added exclusions of certain path patterns separately, because that's easier, and I single-quote things to hide them from shell interpolation.
The event not found is because ! is being interpreted as a request for history expansion by bash. The fix is to use single quotes instead of double quotes.
Pop quiz: What characters are special inside of a single-quoted string in sh?
Answer: Only ' is special (it ends the string). That's the ultimate safety.
grep with -Z (sometimes known as --null) makes grep output terminated with a null character instead of newline. What you wanted was -z (sometimes known as --null-data) which causes grep to interpret a null character in its input as end-of-line instead of a newline character. This makes it work as expected with the output of find ... -print0, which adds a null character after each file name instead of a newline.
If you had done it this way:
find . -regextype posix-egrep -regex '.+\.(c|cpp|h)$' -print0 | \
grep -vzZ generated | grep -vzZ deploy | xargs -0 ls -1Ld
Then the input and output of grep would have been null-delimited and it would have worked correctly... until one of your source files began being named deployment.cpp and started getting "mysteriously" excluded by your script.
Incidentally, here's a nicer way to generate your testcase file set.
while read -r file ; do
mkdir -p "${file%/*}"
touch "$file"
done <<'DATA'
./barney/generated/bam bam.h
./barney/src/bam bam.cpp
./barney/deploy/bam bam.h
./barney/inc/bam bam.h
./fred/generated/dino.h
./fred/src/dino.cpp
./fred/deploy/dino.h
./fred/inc/dino.h
DATA
Since I did this anyway to verify I figured I'd share it and save you from repetition. Don't do anything twice! That's what computers are for.
Your command:
find . -regextype posix-egrep -regex "(?!.*(generated|deploy).*$)(.+\.(c|cpp|h)$)" -print0 | xargs -0 -n 1 ls
fails because you are trying to use Posix extended regular expressions, which dont support lookaround/lookbehind etc. https://superuser.com/a/596499/658319
find does support pcre, so if you convert to pcre, this should work.

Improving Shell Script Performance

This shell script is used to extract a line of data from $2 if it contains the pattern $line.
$line is constructed using the regular expression [A-Z0-9.-]+#[A-Z0-9.-]+ (a simple email match), form the lines in file $1.
#! /bin/sh
clear
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
do
echo `cat "$2" | grep -m 1 "\b$line\b"`
done
File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).
File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).
The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.
Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).
NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.
Quick suggestions:
Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.
Thus:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | while read line; do
grep -m 1 "\b$line\b" "$2"
done
Next step:
Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.
No more repeated grep:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"
Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:
grep -f <(grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"
That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
while [ -s patterns ]; do
grep -f <(head -n 100 patterns) "$2"
sed -e '1,100d' -i patterns
done
That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.
the problem is you are piping too many shell commands, as well as unnecessary use of cat.
one possible solution using just awk
awk 'FNR==NR{
# get all email address from file1
for(i=1;i<=NF;i++){
if ( $i ~ /[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+/){
email[$i]
}
}
next
}
{
for(i in email) {
if ($0 ~ i) {
print
}
}
}' file1 file2
I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)
To allow for you to take the loop out
First create a file of all your Email Addresses with your outer grep command.
Then use this as a pattern file to do your secondary grep by using grep -f
If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" $1
Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so
grep -i -o -E "[A-Z0-9._-]+#[A-Z0-9.-]+" $1
As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.
First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.