Regex: Find string in file ensuring it didn't appear before - regex

I have a bunch of files on a Linux machine. I want to find whether any of those files have the string foo123 bar, AND the string foo123 must not appear before that foo123 bar .
Plot twist: I want the search to do this for any number instead of "123", without me having to specify a specific number.
How can I do that?

A solution with Python's newer regex module:
import regex as re
string = """
I have a bunch of files on a Linux machine. I want to find whether any of those files have the string foo123 bar#12, AND the string foo123 must not appear before that foo123 bar#34 .
Plot twist: I want the search to do this for any number instead of "123", without me having to specify a specific number.
How can I do that?
"""
rx = re.compile(r'(?<!foo\d(?s:.*))foo123 bar#\w+')
print(rx.findall(string))
# ['foo123 bar#12']
Making use of the infinite lookbehind and the single line mode ((?s:.*)).

Well, that's a tricky one. Here's an imperfect solution:
grep . -Prle '(?s)(?<ref>foo\d+)\b(?! bar).*\k<ref>(*SKIP)(*FAIL)|foo\d+ bar'
Why is it imperfect? Because if you have a file containing foo123 foo456 bar foo123 bar, it won't detect the foo456 bar part. If this situation cannot happen in your set of files, then I suppose you're fine.
This makes use of the (*SKIP)(*FAIL) trick, once you learn that the rest of the pattern should be pretty clear.
So maybe plain regex isn't the best solution here, let's just write a one-liner script instead:
find . -type f -execdir perl -e 'while(<>) { while(/foo(\d+)( bar)?/g) { if ($2) { exit 0 if !$n{$1} } else { $n{$1} = 1 } } } exit 1;' {} \; -print
That one does the job and is hopefully more understandable :)

Related

PCRE regex doesn't seem to work when used from shell with grep

I'm trying to output a node using a capture group for a decompiled DTS file using a PCRE regex passed into grep. I'm interested in the key-samkey {(...)}; node only.
Any ideas as to what I might be doing wrong or can you point at any alternate methods to extract the node with its contents? I can't use bash's =~ operator, because there is a requirement that we only use sh.
I tried patterns:
/(key-samkey {[.]*.+?[.]*};)/s
(key-samkey {[\s\S]*.+?(?=};))
The exact command I'm using is:
cat {input file} | grep -Po "{pattern}"
Both of these patterns seem to work correctly on online regex testing websites with PCRE syntax, but fail when executed from the shell.
The file which I'm running pattern matching on is structured like this:
/dts-v1/;
/ {
signature {
key-samkey {
required = "conf";
algo = "sha256,rsa4096";
rsa,r-squared = <xxxxxxxx>;
rsa,modulus = <xxxxxxxx>;
rsa,exponent = <0xxx 0xxxxxx>;
rsa,n0-inverse = <0xxxxxxxxx>;
rsa,num-bits = <0xxxxx>;
key-name-hint = "samkey";
};
};
};
More simple to define a range of lines with sed:
sed -n '/key-samkey {/,/};/p' file
You're almost there. A regular expression you can use is
(?s)(key-samkey \{.+?\};).
(?s): The dot . matches everything (DOTALL)
\{ and \}: You have to escape these, because they have special meaning in a regex.
.+?: matches everything it can non-greedy, meaning, in this case, everything up to the first };
Then use the -z switch of grep, this replaces newlines in the input with null-bytes, so that grep sees the input as one big line.
Example: I stored your example in the file test.file:
> grep -Pzo '(?s)(key-samkey \{.+?\};)' test.file
key-samkey {
required = "conf";
algo = "sha256,rsa4096";
rsa,r-squared = <xxxxxxxx>;
rsa,modulus = <xxxxxxxx>;
rsa,exponent = <0xxx 0xxxxxx>;
rsa,n0-inverse = <0xxxxxxxxx>;
rsa,num-bits = <0xxxxx>;
key-name-hint = "samkey";
};
the answers provided will do the trick for you. However, if you don't want to change your regular expression I found a convenient way to execute PCRE expressions without pain. You can try opening your DTS file with Sublime (COMMAND+F in MAC), make sure Regular expression option is turned on, and paste your expression (I just tried on your example). Click "Find All" and Copy the result.

Extract a substring from a string in shell script

$sea = xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tgz
and i want to extract only DMJK.UK.1.0-32133-Z-1 that is 1st occurenceand and search string will start as DMJK.
$sea = anything-anything.xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tgz matchessadds.dsdsds.21212.anything-anything
or
$sea = anything-anything.xyz-ajay-no-siddhart-ubidone-fdh-DMJK.UK.1.0-32133-Z-1.tar.bz2 matchessadds.dsdsds.21212.anything-anything
need to extract start from DMJK and before either ".tgz" or ".tar" or "tar.bz2",
o/p should be DMJK.UK.1.0-32133-Z-1 in the above case.
main String can be anything that include the format
"anything.anything-DMJK.XXX.[]0-9].[0-9].[0-9]-3213309-[A-Z]-[0-9].tar.tgz anything-anything "
change on fly, o/p could be DMJK.XXX.1.3.0-3213309-Z-13
I tried but its not working:
echo $sea | sed 's#.*(DMJK.*).t*#\1#g'
There are better ways to do this, but you can fix your current solution with:
sed -E 's#.*(DMJK.*)\.(tgz|tar.bz2|tar)#\1#g'
( well, sort of. This will perhaps not work as desired on a string with .tar.bz5. It's not entirely clear what you want to do in that case, though.)

Edit within multi-line sed match

I have a very large file, containing the following blocks of lines throughout:
start :234
modify 123 directory1/directory2/file.txt
delete directory3/file2.txt
modify 899 directory4/file3.txt
Each block starts with the pattern "start : #" and ends with a blank line. Within the block, every line starts with "modify # " or "delete ".
I need to modify the path in each line, specifically appending a directory to the front. I would just use a general regex to cover the entire file for "modify #" or "delete ", but due to the enormous amount of other data in that file, there will likely be other matches to this somewhat vague pattern. So I need to use multi-line matching to find the entire block, and then perform edits within that block. This will likely result in >10,000 modifications in a single pass, so I'm also trying to keep the execution down to less than 30 minutes.
My current attempt is a sed one-liner:
sed '/^start :[0-9]\+$/ { :a /^[modify|delete] .*$/ { N; ba }; s/modify [0-9]\+ /&Appended_DIR\//g; s/delete /&Appended_DIR\//g }' file_to_edit
Which is intended to find the "start" line, loop while the lines either start with a "modify" or a "delete," and then apply the sed replacements.
However, when I execute this command, no changes are made, and the output is the same as the original file.
Is there an issue with the command I have formed? Would this be easier/more efficient to do in perl? Any help would be greatly appreciated, and I will clarify where I can.
I think you would be better off with perl
Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
while (<>) {
#multi-lines of text one at a time here.
if (m/^start :\d+/) {
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.
It'll take data from STDIN via a pipe, or myscript.pl somefile.
Output is to STDOUT and you can redirect that in the normal way.
Your limiting factor on processing files in this way are typically:
Data transfer from disk
pattern complexity
The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.
(If you want to do an in place edit, you can with this approach)
If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ( /^start :/ .. /^$/)
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).
It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.
sed's pattern ranges are your friend here:
sed -r '/^start :[0-9]+$/,/^$/ s/^(delete |modify [0-9]+ )/&prepended_dir\//' filename
The core of this trick is /^start :[0-9]+$/,/^$/, which is to be read as a condition under which the s command that follows it is executed. The condition is true if sed currently finds itself in a range of lines of which the first matches the opening pattern ^start:[0-9]+$ and the last matches the closing pattern ^$ (an empty line). -r is for extended regex syntax (-E for old BSD seds), which makes the regex more pleasant to write.
I would also suggest using perl. Although I would try to keep it in one-liner form:
perl -i -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit
Or you can use redirection of stdout:
perl -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit > new_file
with gnu sed (with BRE syntax):
sed '/^start :[0-9][0-9]*$/{:a;n;/./{s/^\(modify [0-9][0-9]* \|delete \)/\1NewDir\//;ba}}' file.txt
The approach here is not to store the whole block and to proceed to the replacements. Here, when the start of the block is found the next line is loaded in pattern space, if the line is not empty, replacements are performed and the next line is loaded, etc. until the end of the block.
Note: gnu sed has the alternation feature | available, it may not be the case for some other sed versions.
a way with awk:
awk '/^start :[0-9]+$/,/^$/{if ($1=="modify"){$3="newdirMod/"$3;} else if ($1=="delete"){$2="newdirDel/"$2};}{print}' file.txt
This is very simple in Perl, and probably much faster than the sed equivalent
This one-line program inserts Appended_DIR/ after any occurrence of modify 999 or delete at the start of a line. It uses the range operator to restrict those changes to blocks of text starting with start :999 and ending with a line containing no printable characters
perl -pe"s<^(?:modify\s+\d+|delete)\s+\K><Appended_DIR/> if /^start\s+:\d+$/ .. not /\S/" file_to_edit
Good grief. sed is for simple substitutions on individual lines, that is all. Once you start using constructs other than s, g, and p (with -n) you are using the wrong tool. Just use awk:
awk '
/^start :[0-9]+$/ { inBlock=1 }
inBlock { sub(/^(modify [0-9]+|delete) /,"&Appended_DIR/") }
/^$/ { inBlock=0 }
{ print }
' file
start :234
modify 123 Appended_DIR/directory1/directory2/file.txt
delete Appended_DIR/directory3/file2.txt
modify 899 Appended_DIR/directory4/file3.txt
There's various ways you can do the above in awk but I wrote it in the above style for clarity over brevity since I assume you aren't familiar with awk but should have no trouble following that since it reuses your own sed scripts regexps and replacement text.

Find all text within square brackets using regex

I have a problem that because of PHP version, I need to change my code from $array[stringindex] to $array['stringindex'];
So I want to find all the text using regex, and replace them all. How to find all strings that look like this? $array[stringindex].
Here's a solution in PHP:
$re = "/(\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\])/";
$str = "here is \$array[stringindex] but not \$array['stringindex'] nor \$3array[stringindex] nor \$array[4stringindex]";
$subst = "$1'$2'$3";
$result = preg_replace($re, $subst, $str);
You can try it out interactively here. I search for variables beginning with a letter, otherwise things like $foo[42] would be converted to $foo['42'], which might not be desirable.
Note that all the solutions here will not handle every case correctly.
Looking at the Sublime Text regex help, it would seem you could just paste (\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\]) into the Search box and $1'$2'$3 into the Replace field.
It depends of the tool you want to use to do the replacement.
with sed for exemple, it would be something like that:
sed "s/\(\$array\)\[\([^]]*\)\]/\1['\2']/g"
If sed is allowed you could simply do:
sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" file
Explanation:
sed "s/pattern/replace/g" is a sed command which searches for pattern and replaces it with replace. The g options means replace multiple times per line.
(\$[^[]*[)([^]]*)] this pattern consists of two groups (in between brackets). The first is a dollar followed by a series of non [ chars. Then an opening square bracket follows, followed by a series of non closing brackets which is then followed by a closing square bracket.
\1'\2'] the replacement string: \1 means insert the first captured group (analogous for \2. Basically we wrap \2 in quotes (which is what you wanted).
the -i options means that the changes should be applied to the original file, which is supplied at the end.
For more information, see man sed.
This can be combined with the find command, as follows:
find . -name '*.php' -exec sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" '{}' \;
This will apply the sed command to all php files found.

Batch rename screen shots on Mac OS X

Custom batch rename files
Hello, Mac OS X takes screen shot's in a very long format of filename. I would like to rename any of them that sit at path /Users/me/desktop.
Here are some examples of the filenames:
Screen Shot 2012-08-02 at 1.15.29 AM.png
Screen Shot 2012-08-02 at 1.22.12 AM.png
Screen Shot 2012-08-02 at 1.22.14 PM.png
Screen Shot 2012-08-02 at 1.22.16 PM.png
I was once told, not to do a for loop against an ls so I am trying globbing this time around. So far, this is all I can come up with, but done know how to karen wrap the expression and then get that to a file rename in the format I desire:
for i in *; do
screen_name=$(echo $i | grep --only-matching --extended-regexp '(Screen\ Shot)\ [0-9]+-[0-9]+-[0-9]+\ at\ [0-9]+\.[0-9]+.[0-9]+.[AP]M\.png');
echo $screen_name;
done
I am not sure about the hour of the time, it may be safest to assume possible 2 digits on all chunks of the time, so 1.14.29 and 01.15.29
ss.08-02-12-01.15.29-AM.png
ss.08-02-12-01.22.12-AM.png
ss.08-02-12-01.22.14-PM.png
ss.08-02-12-01.22.16-PM.png
The end goal, is a bash script that when run will rename ALL files at the above mentioned path to the new format listed.
Thank you for any help.
for i in "Screen Shot"*.png; do
new=`echo $i |awk '
{
split($3,a,"-")
split($5,b,".")
printf("ss.%s-%s-%s-%02d.%02d.%02d-%s",a[2],a[3],a[1],b[1],b[2],b[3],$6)
}
'`
mv "$i" $new
done
Before:
Screen Shot 2012-08-02 at 1.22.16 PM.png
Screen Shot 2012-09-02 at 13.42.06 PM.png
After:
ss.08-02-2012-01.22.16-PM.png
ss.09-02-2012-13.42.06-PM.png
EDIT:
as suggested by steve
printf("ss.%s-%s-%s-%02d.%02d.%02d-%s",a[2],a[3],substr(a[1]3,2),b[1],b[2],b[3],$6)
which yields
ss.08-02-12-01.22.16-PM.png
ss.09-02-12-13.42.06-PM.png
You can use stream editor sed to match and substitute using regular expressions. You would do something like this
echo $i | sed "s/PATTERN/REPLACE/"
to genereate the filename out of $i. sed will read from stdin, search (s command) for pattern and replace it with REPLACE.
In your REGEXP pattern you can mark seperate groups by surrounding them with brackets (), in most situations you will have to escape them by () and access these parts in the replace pattern by using #, where # is the number of the subgroup starting from 1. Here's a simple example:
echo "ScreenShotXYZ.png" | sed "s/ScreenShot\(.*\)\.png/\1.png/"
Here, the XYZ is matched by the expression in brackets and can be accessed using \1 in the replacment string. The whole pattern in thus replaced by XYZ.png.
So use your regexp for matching, put brackets around the relevant blocks and do something like
ss.\1.\2.(and so on)
for your replacement pattern. There's still some way to optimize the process by first using sed to replace dashes by dots, then grouping the whole time block in just one pattern but for a start it's easier to code like that.