Is my regex too greedy? - regex

Background: We're using a tape library and the backup software NetWorker to back up data here. The client that's installed is fairly basic, and when we need to restore more than one target directory we create a script that simply calls X client instances in the background via a script with X of the following lines:
recover -c client-srv -t "Mon Dec 10 08:00:00" -s barckup-srv -d /dest/dir/ -f -a /src/dir &
The trouble is that different partitions/directories backed up from the same machine at the same time might be spread across several different tapes, and some of those tapes may have been removed from the library between the backup and restore.
Up until recently the only ways the people here have been finding out about which tapes are needed were to either wait for the library to complain that it doesn't have a particular tape, or to set up a fake restore in an crappy old desktop GUI client and hit a particular menu option. The first option is super bad when the tape turns out to be off-site and takes a day to get back, and the second is tedious and time-consuming.
Actual Question: I've written a "meta"-script that reads the script that we've already created with the commands above, feeds it into the interactive CLI client, and gets it to spit out what tapes are required, and if they're actually in the library. To do this, the script uses the following regular expressions to pull out necessary info:
# pull out a list of the -a targets
restore_targets="`sed 's/^.* -a \([^ ]*\) .*$/\1/' $rec_script`"
# pull out a list of -c clients
restore_clients="`sed 's/^.* -c \([^ ]*\) .*$/\1/' $rec_script`"
numclients=`echo $restore_clients | uniq | wc -l`
# pull out a list of -t dates
restore_dates="`sed 's/^.* -t \"\([^\"]*\)\" .*$/\1/' $rec_script`"
numdates=`echo $restore_dates | uniq | wc -l`
I am not terribly familiar with using s/\(x\)/\1/ types of regexes, to the point that I don't remember the name, but is this the best way of accomplishing what I am doing? The commands work, but I'm wondering if I'm using the .* needlessly.

\1 refers to the first capturing group. If you replace foo(.*?) with \1 and feed in foobar, the resulting text becomes bar, as \1 points to the text captured by the first capturing group.
As for your your question, it might be safer and easier to parse the arguments using Python (or another high-level scripting language):
>>> import shlex
>>> shlex.split('recover -c client-srv -t "Mon Dec 10 08:00:00" -s barckup-srv -d /dest/dir/ -f -a /src/dir &')
['recover', '-c', 'client-srv', '-t', 'Mon Dec 10 08:00:00', '-s', 'barckup-srv', '-d', '/dest/dir/', '-f', '-a', '/src/dir', '&']
Now, this is much easier to work with. The quotes are gone and all of the components of the command are nicely split up into a list.
If you want this to be completely foolproof, you could use argparse and implement your own parser for this command line pretty easily. This will enable you to easily get the info, but it might be overkill for your situation.
As for your actual question, you can dissect the regex:
^.* -t "([^\"]*)" .*$
This regex captures -t "foo \" bar", while a non-greedy version would stop at -t "foo \".

Related

How do you pipe and filter text from tail as input for a variable in a script?

Backstory
I am trying to create a script that updates a "device" through the devices cli, but it doesn't accept any form of command following the establishment of an ssh connection.
for this reason i have started using screen to logging the output from the device and then attempting to filter the log for relevant info so i can pass commands back to the remote device by stuffing it into screens buffer.(kind of a ramshackled way of doing it but its all i can think of.
Issue
I need to use some combo of grep and sed or awk to filter out one of two outputs i'm looking for respectively "SN12345678" '\w[a-zA-Z]\d{6-10}' and "finished" inside screenlog.2 I've got regex patterns for both of these but i cannot seem to get the right output and assign it to a variable
.screenrc (relevant excerpt)
screen -t script 0 ./script
screen -t local 1 bash
screen -t remote 2 bash
screen -t Shell 3 bash
./script
screen -p 2 -X log on #turns logging on window 2
screen -p 3 -X stuff 'tail-Fn 0 screenlog.2 | #SOMESED Function that i cant figure out'
screen -p 2 -X stuff 'ssh -o "UserKnownHostsFile /dev/null" -o "StrictHostKeyChecking=no" admin#192.168.0.1^M' && echo "Stuffed ssh login -> window 2"
sleep 2 # wait for ssh connection
screen -p 2 -X stuff admin^M && echo "stuffed pw"
sleep 4 # wait for auth
screen -p 2 -X stuff "copy sw ftp://ftpuser:admin#192.168.0.2/dev_uimage-4_4_5-26222^M" && echo "initiated flash"
screen -p 2 -X stuff "copy license ftp://ftpuser:admin#192.168.0.2/$(result of sed from screenlog.2).lic^M" && echo "uploading license"
sorry if this is a bit long winded i've been wracking my brain for the last few days trying to get this to work.
Thank you for your time!
Answer
Regular Expression
Looking at the example regex you provided, I'm going to assume SN can't just be hardcoded and that it could be uppercase,lowercase,digit for first character and uppercase,lowercase for the second digit, so I think you are looking for:
grep -Eo '[[:alnum:]][[:alpha:]][[:digit:]]{6,10}' # Works regardless of computer's local settings
# OR
egrep -o '[[:alnum:]][[:alpha:]][[:digit:]]{6,10}' # Works regardless of computer's local settings
# OR
grep -Eo '[0-9A-Za-z][A-Za-z][0-9]{6,10}'
# OR
egrep -o '[0-9A-Za-z][A-Za-z][0-9]{6,10}'
These are exact conversions of your regular expression (includes the _ as. a possibility of the first character):
grep -Eo '[[:alnum:]_][[:alpha:]][[:digit:]]{6,10}' # Works regardless of computer's local settings
# OR
grep -Eo '[0-9A-Za-z_][A-Za-z][0-9]{6,10}'
# OR (non-extended regular expressions)
grep -o '[[:alnum:]_][[:alpha:]][[:digit:]]\{6,10\}'
grep -o '[0-9A-Za-z_][A-Za-z][0-9]\{6,10\}'
Reuse the Match
I don't know how you would assign the output to a variable, but I would just write it to a file and delete the file afterwards (assuming the "script" and "Shell" windows have the same pwd [present working directory]):
. . .
screen -p 3 -X stuff 'tail -Fn1 screenlog.2 | grep -Eo "[[:alnum:]][[:alpha:]][[:digit:]]{6,10}" >> SerialNumberOrID^M'
. . .
screen -p 2 -X stuff "copy license ftp://ftpuser:admin#192.168.0.2/$(cat SerialNumberOrID).lic^M" && echo "uploading license"
rm -f SerialNumberOrID
Explanation
Regular Expression
I'm fairly confident that grep, sed, and awk (and most POSIX compliant utilities) don't support \w and \d. Those are Perl-like flags. You can pass -E to grep and sed to make them use extended regular expressions (will save you from having to do as much escaping).
Command Changes
Writing the match to a file seemed like the best way to reuse it. Using >> ensures that we append to the file, so that grep will only write the matching expression to the file and won't overwrite it with an empty file. This is why it's necessary to delete the file at the end of your script (so that it won't mess up next run and also so you don't have unnecessary files laying around). In the license upload command, we use cat to output the contents of the file in-line. I also changed the tail command to tail -Fn1 because I'm pretty sure you need to at least have 1 for it to feed a line into grep.
Resources
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions
grep, sed, and awk man pages

One parameter for multiple patterns - grep

I'm trying to search pdf files from terminal. My attempt is to provide the search string from terminal. The search string can be one word, multiple words with (AND,OR) or an exact phrase. I would like to keep only one parameter for all search queries. I'll save the following command as a shell script and will call shell script as an alias from .aliases in zsh or bash shell.
Following from sjr's answer, here: search multiple pdf files.
I've used sjr's answer like this:
find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - |
grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \;
$1 takes path
$2 limits the number of results
$3 is context parameter (it is accepting -A , -B , -C , either individually or jointly)
$4 takes search string
The issue I am facing is with $4 value. As I said earlier I want this parameter to pass my search string which can be a phrase or one word or multiple words with AND / OR relation.
I am not able to get desired results, till now I was not getting search results for phrase search until I followed Robin Green's Comment. But still phrase results are not accurate.
Edit Text from judgments:
The original rule was that you could not claim for psychiatric injury in
negligence. There was no liability for psychiatric injury unless there was also
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried
both about fraudulent claims and that if they allowed claims, the floodgates would
open.
The claimant was 15 metres away behind a tram and did not see the accident but
later saw blood on the road. She suffered nervous shock and had a miscarriage. She
sued for negligence. The court held that it was not reasonably foreseeable that
someone so far away would suffer shock and no duty of care was owed.
White v Chief Constable of South Yorkshire [1998] The claimants were police
officers who all had some part in helping victims at Hillsborough and suffered
psychiatric injury. The House of Lords held that rescuers did not have a special
position and had to follow the normal rules for primary and secondary victims.
They were not in physical danger and not therefore primary victims. Neither could
they establish they had a close relationship with the injured so failed as
secondary victims. It is necessary to define `nervous shock' which is the rather
quaint term still sometimes used by lawyers for various kinds of
psychiatric injury...rest of para
word1 can be: shock, (nervous shock)
word2 can be: psychiatric
exact phrase: (nervous shock)
Commands
alias s='sh /path/shell/script.sh'
export p='path/pdf/files'
In terminal:
s "$p" 10 -5 "word1/|word2" #for OR search
s "$p" 10 -5 "word1.*word2.*word3" #for AND search
s "$p" 10 -5 ""exact phrase"" #for phrase search
Second Test Sample:
An example pdf file, since command runs on pdf document: Test-File. Its 4 pages (part of 361 pg file)
If we run the following command on it, as the solution mentions:
s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt
we'll get the relevant text and 'll avoid going through entire file. Thought it would be a cool way to read what we want rather than going traditional approach.
You need to:
pass a double-quoted command string to sh -c in order for the embedded shell-variable references to be expanded (which then requires escaping embedded " instances as \").
quote the regex with printf %q for safe inclusion in the command string - note that this requires bash, ksh, or zsh as the shell.
dir=$1
numMatches=$2
context=$3
regexQuoted=$(printf %q "$4")
find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - |
grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \;
The 3 invocation scenarios would then be:
s "$p" 10 -5 'word1|word2' #for OR search
s "$p" 10 -5 'word1.*word2.*word3' #for AND search
s "$p" 10 -5 'exact phrase' #for phrase search
Note that there's no need to escape | and no need to add an extra layer of double quotes around exact phrase.
Also note that I've replaced --line-buffered with --with-filename, as I assume that's what you meant (to have the matching lines prefixed with the PDF file path).
Note that with the above approach a shell instance must be created for every input path, which is inefficient, so consider rewriting your command as follows, which also obviates the need for printf %q (assume regex=$4):
find "${dir}" -type f -name '*.pdf' |
while IFS= read -r file; do
pdftotext "$f" - |
grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}"
done
The above assumes that your filenames have no embedded newlines, which is rarely a real-world concern. If it is, there a ways to solve the problem.
An additional advantage of this solution is that it uses only POSIX-compliant shell features, but note that the grep command uses nonstandard options.

How can I improve this bash command to get UI rendering times from ActivityManager?

I'm working on a sample app based on Fire App Builder and would like to programatically measure how long (in milliseconds) it takes for an Activity to be drawn. This information is included in the ActivityManager logs:
I/ActivityManager( 1843): Displayed com.amazon.android.calypso/com.amazon.android.tv.tenfoot.ui.activities.ContentBrowseActivity: +656ms
Times above one second look like this:
I/ActivityManager( 1843): Displayed com.amazon.android.calypso/com.amazon.android.tv.tenfoot.ui.activities.ContentBrowseActivity: +1s001ms
However, Android is designed such that apps can only read their own logs. As such, Runtime.getRuntime().exec("logcat -d ActivityManager:I *:S") won't show anything. I can read the logs using a shell command, but my regular expression isn't giving me what I want.
The following command
adb logcat -d ActivityManager:I *:S | sed -n 's/ContentBrowseActivity:\s+\+\([0-9].*\)ms/\1/p'
matches, but because sed doesn't support non-greedy expressions, the whole line is printed:
I/ActivityManager( 1843): Displayed com.amazon.android.calypso/com.amazon.android.tv.tenfoot.ui.activities.656
I want to get just the time. Tacking on another grep command gets the data:
adb logcat -d ActivityManager:I *:S | sed -n 's/ContentBrowseActivity:\s+\+\([0-9].*\)ms/\1/p' | grep -Po [0-9]+
For times above one second, I want to simply multiply the first number by 1,000 and add it to the second.
However, this is extremely hacky and also captures the number in I/ActivityManager(####). Another know of a more elegant solution?
$ echo 'I/ActivityManager( 1843): Displayed com.amazon.android.calypso/com.amazon.android.tv.tenfoot.ui.activities.ContentBrowseActivity: +1s001ms' | sed -e 's/^.*: +\(.*\)ms$/\1/' -e 's/s//'
1001

How to substitute words in Git history & properly debug related problems?

I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with removedSensitiveInfo. This is what I came up with after browsing through numerous StackOverflow topics and other sites.
git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.
For reference this is how I do the check
git grep aSecretPassword1 $(git rev-list --all)
Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.
Any idea what's going on here?
I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.
Any help is very much appreciated.
P.S.
I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.
git-filter-branch is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:
Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
Ensure that the updated Git history does not contain the old password text
Do the above as simply as possible
There is a tailor-made solution to this problem:
Use The BFG... not git-filter-branch
The BFG Repo-Cleaner is a simpler alternative to git-filter-branch specifically designed for removing passwords and other unwanted data from Git repository history.
Ways in which the BFG helps you in this situation:
The BFG is 10-720x faster
It automatically runs on all tags and references, unlike git-filter-branch - which only does that if you add the extraordinary --tag-name-filter cat -- --all command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems)
The BFG doesn't generate any refs/original/ refs - so no need for you to perform an extra step to remove them
You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.
Using the BFG
Carefully follow the usage steps - the core bit is just this command:
$ java -jar bfg.jar --replace-text replacements.txt my-repo.git
The replacements.txt file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):
PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass # replace with 'examplePass' instead
PASSWORD3==> # replace with the empty string
regex:password=\w+==>password= # Replace, using a regex
Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.
Full disclosure: I'm the author of the BFG Repo-Cleaner.
Looks OK. Remember that filter-branch retains the original commits under refs/original/, e.g.:
$ git commit -m 'add secret password, oops!'
[master edaf467] add secret password, oops!
1 file changed, 4 insertions(+)
create mode 100644 secret
$ git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
Rewrite edaf467960ade97ea03162ec89f11cae7c256e3d (2/2)
Ref 'refs/heads/master' was rewritten
Then:
$ git grep aSecretPassword `git rev-list --all`
edaf467960ade97ea03162ec89f11cae7c256e3d:secret:aSecretPassword2
but:
$ git lola
* e530e69 (HEAD, master) add secret password, oops!
| * edaf467 (refs/original/refs/heads/master) add secret password, oops!
|/
* 7624023 Initial
(git lola is my alias for git log --graph --oneline --decorate --all). Yes, it's in there, but under the refs/original name space. Clear that out:
$ rm -rf .git/refs/original
$ git reflog expire --expire=now --all
$ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)
and then:
$ git grep aSecretPassword `git rev-list --all`
$
(as always, run filter-branch on a copy of the repo Just In Case; and then removing original refs, expiring the reflog "now", and gc'ing, means stuff is Really Gone).

Extracting string after matched pattern in Shell

How to extract whatever string comes after a matched pattern in Shell script. I know this functionality in Perl scripting, but i dont know in Shell scripting.
Following is the example,
Subject_01: This is a sample subject and this may vary
I have to extract whatever string that follows "Subject_01:"
Any help please.
It depends on your shell.
If you're using bourne shell or bash or (I believe) pdksh, then you can do fancy stuff like this:
$ string="Subject_01: This is a sample subject and this may vary"
$ output="${string#*: }"
$ echo $output
This is a sample subject and this may vary
$
Note that this is pretty limited in terms of format. The line above requires that you have ONE space after your colon. If you have more, it will pad the beginning of $output.
If you're using some other shell, you may have to do something like this, with the cut command:
> setenv string "Subject_01: This is a sample subject and this may vary"
> setenv output "`echo '$string' | cut -d: -f2`"
> echo $output
This is a sample subject and this may vary
> setenv output "`echo '$string' | sed 's/^[^:]*: *//'`"
> echo $output
This is a sample subject and this may vary
>
The first example uses cut, which is very small and simple. The second example uses sed, which can do far more, but is a (very) little heavier in terms of CPU.
YMMV. There's probably a better way to handle this in csh (my second example uses tcsh), but I do most of my shell programming in Bourne.