Specific Perl Regular Expression Needed - regex

Have a perl script that needs to process all files of a certain type from a given directory. The files should be those that end in .jup, and SHOULDN'T contain the word 'TEMP_' in the filename. I.E. It should allow corrected.jup, but not TEMP_corrected.jup.
Have tried a look-ahead, but obviously have the pattern incorrect:
/(?!TEMP_).*\.jup$/
This returns the entire directory contents though, including files with any extension and those containing TEMP_, such as the file TEMP_corrected.jup.

The regular expression you want is:
/^((?!TEMP_).)*\.jup$/
The main difference is that your regular expression is not anchored at the start of the string, so it matches any substring that satisfies your criteria - so in the example of TEMP_corrected.jup, the substrings corrected.jup and EMP_corrected.jup both match.
(The other difference is that putting () round both the lookahead and the . ensures that TEMP_ isn't allowed anywhere in the string, as opposed to just not at the start. Not sure whether that's important to you or not!)
If you're getting files other than .jup files, then there is another problem - your expression should only match .jup files. You can test your expression with:
perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
then type strings: perl will echo them back if they match, and not if they don't. For example:
$ perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
foo
foo.jup
foo.jup <-- perl printed this because 'foo.jup' matched
TEMP_foo.jup

Related

Perl extract group with lookbehind from different line

I've tried web search and have read several answers on stackexchange, still cannot grasp why command does not extract anything. At the end I want to extract group with lookbehind from different line, e.g. from
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
by finding needed key between Type and extracting first Code above the finding, so it case above to get test2. But I cannot succeed to extract even something from multiple lines, i.e.
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Type>(.*)<Type/'<test.txt prints nothing.
I've played with removing ln parameters and adding/removing greedy ? and trying just . in place of [\s\S\n].
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Code2>(.*)<Code2/'<test.txt
gives TEST1_best so same line extraction works.
What am I missing? Can what I want be done in one line of command?
The following command answers your question: it collects all values contained in a Code>...<Code pattern, if they are followed by a Type>...<Type pattern (with potentially other patterns in between, but no other occurrences of Code>...<Code in between):
perl -lne 's/^.*?(?=Code>)//s; for (split /Code>/) { print qq($1:$2\n) if /(.*?)<Code.*?Type>(.*?)<Type/s }' -0777 <test.txt
If e.g. test.txt contains the following lines,
Code>test4<Code Type>false<Type
Code>test3<Code
Type>true<Type
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
then the command will collect the following value pairs:
test4:false
test3:true
test2:false
Edited on 04/08/2019, 17:38 CEST I edited the command to remove the "header part" of the file (the part before the first occurrence of Code>), as it might - by some error of the file's editor - contain a closing tag <Code which had not been opened with Code> but instead with a typo like e.g. Cde>. My assumption was that the complete file was "syntactically correct" in the sense that it consists of elements of type /(\w+)>.*?<\1/, separated by whitespace (including newlines). For files which do not conform to this syntax, the statement was not waterproof.
Another way to do it, using progressive matching and embedded code
perl -lne 'while (/\b(?:Code>(.*?)<Code(?{$c=$1})|Type>(.*?)<Type(?{print qq($c:$2\n) if defined $c;undef $c}))\b/g){}' -0777 <test.txt
Explanations:
Basically, the expression finds occurrences of Code>(.*?)<Code or Type>(.*)<Type. This gives the basic form of an alternation in an unnamed grouping expression: (?:Code>(.*?)<Code|Type>(.*?)<Type).
The word boundary assertions \b around the group ensure that the keywords Codeand Type are matched, but not e.g. Code2 or TType.
The modifier g ensures progressive application of the regular expression on the string. Since I want to extract the result inside of the expression itself, I place the regex in an empty loop, i.e. while (/.../g) {}.
You suppose a grammar rule Code ⟶ Type, i.e. you look for occurrences of a Type token following a Code token. For this, a Code token is memorized in a variable $c with the code expression (?{$c=$1}). If a Type token is found, it is considered a match only if formerly a Code token has been found, indicated by the fact that the variable $c is defined. In any case, if a Type token has been found, the variable $c will be undefd to prepare it for the next search. This gives the code evaluation (${print qq($c:$2\n) if defined $c;undef $c;}) in the Type branch of the regular expression.
Note that the captures of the Code>(.*?)<Code and Type>(.*?)<Type tokens may be the empty string. This is why I am working with undef $c and if defined $c instead of the simpler $c='' and if $c.
if your data in 'd', by gnu sed;
sed -Ez 's/.*Code>(\w+)<Code\sType>\w*<Type.*/\1/' d
Perl
perl -ne 'BEGIN{undef $/} /Code>(\w+)<Code\nType>\w*<Type/; print $1' d

Regex to match PowerShell drive path

PowerShell's New-PsDrive Cmdlet allows for drives to be created with more-flexible names like HKLM.
I'd like to match these drive\path\file patterns in the NavigationCmdletProvider that I'm building:
csb:
csb:\
csb:\foo\bar
csb:\foo\bar\
csb:\foo\bar bar\test.txt
but not these
csb:\\
csb:\\\
([a-zA-Z]+:)?(\\[a-zA-Z0-9_.-: :]+)*\\? matches everything that I want, but still includes the two that I don't. I can't seem to get it to match 0 or 1 \ at the end of the string.
What am I missing?
All you should need to do is tie your regular expression to the beginning and end of the line using a ^ and a $ respectively:
^([a-zA-Z]+:)?(\\[a-zA-Z0-9_.-: :]+)*\\?$
This is necessary almost any time you are trying to count a specific number of character in a regex.

Regular expression for begins with and ends with

I'm trying to run a search with git to get me all the staged files in one of two folders: local or components. I only want to get JS files. The command runs in the console.
What I have so far:
STAGED_FILES=($(git diff --cached --name-only --diff-filter=ACM | grep "^(local|components).*?.js"))
This gets me all the staged files:
git diff --cached --name-only --diff-filter=ACM
This gets me all files paths beginning with local or components
grep "^(local|components)"
And this gets me all js files
grep ".js"
And this returns me nothing for some reason:
($(git diff --cached --name-only --diff-filter=ACM | grep "^(local|components).*?.js"))
What is the Regular Expression that I could search with that would get me all the js files in these two folders?
It didn't work because grep doesn't support ? lazy matches. You can use -E for extended regular expressions.
For example consider these
$ echo "asfasdfzasdfasdfz" | grep -E "a.*?z"
asfasdfzasdfasdfz
$ echo "asfasdfzasdfasdfz" | grep "a.*?z"
$ echo "asfasdfzasdfasdf?z" | grep "a.*?z"
asfasdfzasdfasdf?z
As you can see without -E it tries to match ? also within the string.
Besides the regular expression based answers, you can do this directly in Git, which has the notion of a "pathspec" including shell style globbing:
git diff --cached --name-only \
--diff-filter=ACM -- 'local/**/*.js' 'components/**/*.js'
(line broken for display formatting; note that the ** support is new in Git version 1.8.2).
That said, regular expressions are "more powerful" than shell globs, so you may want to keep nu11p01n73R's answer in mind. Note, however, that non-greedy matches (*?) match as little as possible, rather than as much as possible:
pattern input result (matched part in parentheses)
abc.*e 0abcdefeged 0(abcdefege)d
abc.*?e 0abcdefeged 0(abcde)feged
abc.*d 0abcdefeged 0(abcdefeged)
abc.*?d 0abcdefeged 0(abcd)efeged
Your expression, ^(local|components).*?.js, says (in extended interpretations anyway): match the start of line; then match either local or components as literal text; then match as few characters as possible, perhaps none; then match any character; then match a literal j; then match a literal s. Hence this matches local-jaguar-xjs-vehicles because it begins with local, contains some text, has one character more before js, and continues on.
The shell glob pattern local/**/*.js matches only the directory local, followed by any number—possibly zero—of subdirectory components, followed by a file whose name ends with .js, with the dot matched literally. So this is equivalent to the pattern ^local/(.*/|)[^/]*\.js$: the literal text local matched at the start of the line, followed by one slash; followed by either: any number of characters ending in slash (taking as many as possible), or nothing at all; followed by any number (including none) of any character except slash, followed by a literal .js, followed by the end of the line.
Note that because this expression is anchored at both ends (must match at the beginning and end of line), and there is only one Kleene star in the middle, it does not matter whether we use a greedy or non-greedy match: the left-side anchor matches at the left and the right-side anchor matches at the right, and a greedy match takes as much of the middle as it can—i.e., all of it—while a non-greedy match takes as little of the middle as it can ... which is still "all of it".
(This does, of course, assume that the file names are being printed with just one on each line. Fortunately git diff --name-only does just that. Also, shell ** for "any number of directories" is not supported in all shells, nor all non-shell file name globbing, but it is used in Git's pathspecs (search for "pathspec").)

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Find and trim part of what is found using regular expression

I'm a newbie in writing regular expressions
I have a file name like this TST0101201304-123.txt and my target is to get the numbers between '-' and '.txt'
So I wrote this formula -([0-9]*)\.txt this will get me the numbers that I want, but in addition, it is retrieving the highfin '-' and the last part of the string also '.txt' so the result in the example above is '-123.txt'
So my question is:
Is there a way in regular expressions to get only part of the matched string, like a submatch of the match without the need to trim it in my shell script code for unix?
I found this answer but it is getting the same result:
Regexp: Trim parts of a string and return what ever is left
Tip: To test my regular expressions is used this website
You can use lookbehind and lookahead
(?<=-)[0-9]*(?=[.]txt)
Don't know if it would work in unix
Different regex-engines are different. Since you're using expr match, you need to make two changes:
expr match expects a regex that matches the entire string; so, you need to add .* at the beginning of yours, to cover everything before the hyphen.
expr match uses POSIX Basic Regular Expressions (BREs), which use \( and \) for grouping (and capturing) rather than merely ( and ).
But, conveniently, when you give expr match a regex that contains a capture-group, its output is the content of that capture-group; you don't need to do anything else special. So:
$ expr match TST0101201304-123.txt '.*-\([0-9]*\)\.txt'
123
sed is your friend.
echo filename | sed -e 's/-\([0-9]*\)/\1'
should get you what you want.