Powershell: REGEXP to modify paths to different *nix styles - regex

I'm using Rsync and ICACLS to sync two (windows) directories and to do so, I must the same path translated to several 'styles': cygwin *nix, remote *nix, UNC. (see examples below)
I'm using the following code to do so, and while it works, the regexp I'm using could be surely made more robust and better working (as you can see, I'm doing a replace of a replace, wich i find ugly at best...)
$remote="remotesrv"
$path="g:\tools\example\"
$local_dos=$path
$remote_dos="\\$remote\"+(($local_dos -replace "^\w","$&$") -replace "(:\\)|(\\)","\")
$local_nix="/cygdrive/"+($local_dos -replace "(:\\)|(\\)","/")
$remote_nix="//$remote/"+(($local_dos -replace "^\w","$&$") -replace "(:\\)|(\\)","/")
"Local DOS : $local_dos"
"Remote DOS : $remote_dos"
"Local *nix : $local_nix"
"Remote *nix: $remote_nix"
the output is:
Local DOS : g:\tools\example\
Remote DOS : \\remotesrv\g$\tools\example\
Local *nix : /cygdrive/g/tools/example/
Remote *nix: //remotesrv/g$/tools/example/
Can someone please help me with the regexes above? Many thanks!

How about this:
$local_dos = $path
$remote_dos = "\\$remote\$path" -replace ':', '$'
$local_nix = "/cygdrive/$path" -replace ':?\\', '/'
$remote_nix = "//$remote/$path" -replace ':?\\', '/'
The dos ones are pretty simple and aren't really using regex much.
For the nix ones, :?\\ means "\ optionally preceded by :". We're replacing that with a forward slash.
This should work fine for the vast majority of non-pathological cases. However, it's not bulletproof. Crazy file names with slashes or colons in them could easily break this.

The usual approach is to split a path into its component parts, translate the parts, then recombine the parts for the destination OS. For Windows/Cygwin/DOS, the component parts are:
Drive
Path
Filename (the part before the first '.' in most filesystems)
File extension(s) (where zim.tar.gz would have filename = zim and extension = .tar.gz)
Long-term, this is easier and faster than trying to do it all at once. (I know -- I've tried it all-at-once before.)
Perl's File::Spec modules may be worth a look for ideas if you can read Perl code.

Related

How to update path in Perl script in multiple files

I am working on creating some training material where I am using perl. One of the things I want to do is have the scripts be set up for the student correctly, regardless of where they extra the compressed files. I am working on a Windows batch file that will copy the perl templates to the working location and then update path in the copy of the perl template files to the correct location. The perl template have this as the first line:
#!_BASE_software/perl/bin/perl.exe
The batch file looks like this:
SET TRAINING=%~dp0
copy %TRAINING%\template\*.pl %TRAINING%work
%TRAINING%software\perl\bin\perl -pi.bak -e 's/_BASE_/%TRAINING%/g' %TRAINING%work\*.pl
I have a few problems with this:
Perl doesn't seem to like the wildcard in the filename
It turns out that %TRAINING% is going to expand into a string with backslashes which need to be converted into forwardslashes and needs to be escaped within the regex.
How do I fix this?
First of all, Windows doesn't use the shebang line, so I'm not sure why you're doing any of this work in the first place.
Perl will read the shebang line and look for options if perl is found in the path, even on Windows, but that means that #!perl is sufficient if you want to pass options via the shebang line (e.g. #!perl -n).
Now, it's possible that you use Cygwin, MSYS or some other unix emulation instead of Windows to run the program, but you are placing a Windows path in the shebang line (C:...) rather than a unix path, so that doesn't make sense either.
There are three additional problems with the attempt:
cmd uses double-quotes for quoting.
cmd doesn't perform wildcard expansion like sh, so it's up to your program do it.
You are trying to generate Perl code from cmd. ouch.
If we go ahead, we get:
"%TRAINING%software\perl\bin\perl" -MFile::DosGlob=glob -pe"BEGIN { #ARGV = map glob, #ARGV; $base = $ENV{TRAINING} =~ s{\\}{/}rg } s/_BASE_/$base/g" -i.bak -- %TRAINING%work\*.pl
If we add line breaks for readability, we get the following (that cmd won't accept):
"%TRAINING%software\perl\bin\perl"
-MFile::DosGlob=glob
-pe"
BEGIN {
#ARGV = map glob, #ARGV;
$base = $ENV{TRAINING} =~ s{\\}{/}rg
}
s/_BASE_/$base/g
"
-i.bak -- %TRAINING%work\*.pl

replace part of file name with wrong encoding

Need some guidance how to solve this one. Have 10 000s of files in multiple subfolders where the encoding got screwed up. Via ls command I see a filename named like this 'F'$'\366''ljesedel.pdf', that includes the ' at beginning and end. That's just one example where the Swedish characters åäö got wrong, in this example this should have been 'Följesedel.pdf'. If If I run
#>find .
Then I see a list of files like this:
./F?ljesedel.pdf
Not the same encoding. How on earth solving this one? The most obvious ways:
myvar='$'\366''
char="ö"
find . -name *$myvar* -exec rename 's/$myvar/ö' {} \;
and other possible ways fails since
find . -name cannot find it due to the ? instead of the "real" characters " '$'\366'' "
Any suggestions or guidance would be very much appreciated.
The first question is what encoding your terminal expects. Make sure that is UTF-8.
Then you need to find what bytes the actual filename contains, not just what something might display it as. You can do this with a perl oneliner like follows, run in the directory containing the file:
perl -E'opendir my $dh, "."; printf "%s: %vX\n", $_, $_ for grep { m/jesedel\.pdf/ } readdir $dh'
This will output the filename interpreted as UTF-8 bytes (if you've set your terminal to that) followed by the hex bytes it actually contains.
Using that you can determine what your search pattern should be. Your replacement must be the UTF-8 encoded representation of ö, which it will be by default as part of the command arguments if your terminal is set to that.
I'm not an expert - but it might not be a problem with the file name (which seems to hold the correct Unicode file name) - but with the way ls (and many other utilities) show the name to the terminal.
I was able to show the correct name by setting the terminal character encoding to Unicode. Also I've noticed the GUI programs (file manager, etc), were able to show the correct file name.
Gnome Terminal: "Terminal .. set character encoding - Unicode UTF8
It is still a challenge with many utilities to 'select' those files (e.g., REGEXP, wildcard). In few cases, you will have to select those character using '*' pattern. If this is a major issue considering using Ascii only - may be use the 'o' instead of 'ö'. Not sure if this is acceptable.

Regular expression to shorten file paths

I need to shorten file paths for a report I'm working on for NTFS and share permissions. I'm trying to remove for example \\ in share paths and C:\ in drive paths and replace any slashes thereafter with >. I also need to shorten the path down to just the last folder but taking into account spaces and special characters. Between the > and the folder name it needs a space.
So for example \\Finance\Accounts & Payroll\Sage becomes >> Sage.
And D:\HR\Personnel\Records\Holidays\2015 becomes >>>> 2015.
Here's a regex-based solution (that at least works with the sample data):
echo '\Finance\Accounts & Payroll\Sage
D:\HR\Personnel\Records\Holidays\2015' \
| perl -pe 's/(^|[^\\]+)\\+/>/g; s/(>*)>/$1 /'
↓
>> Sage
>>>> 2015
(No language was specified, so I just used my personal favourite. Most regex implementations should work, though.)
It's a bit of a hack. Another way would be something like (in pseudocode):
parts = split(/\\+/, path)
return ('>' × (parts.size - 2) ) ⌢ ' ' ⌢ parts[-1]
Keep in mind though that Windows (and others) generally accepts / as a separator as well. And that neither of the above take into account things like .. and \.\. Normalising the path first would be a good idea.

bsd_glob behaving differently on different machines

I am using bsd_glob to get a list of files matching a regular expression for file path. My perl utility is working on RHEL, but not on Suse 11/AIX/Solarix, for the exact same set of files and the same regular expression. I googled for any limitations of bsd_glob, but couldn't find much information. Can someone point what's wrong?
Below is the regular expression for the file path I am searching for:
/datafiles/data_one/level_one/*/DATA*
I need all files beginning with DATA, in any directory present under 'level_one'.
This works perfectly on my RHEL box, but not on any other Unix and Suse Linux.
Below is the code snipped where I am using bsd_glob
foreach my $file (bsd_glob ( "$fileName", GLOB_ERR )) {
if ($fileName =~ /[[:alnum:]]\*\/\*$/) {
next if -d $file;
$fileList{$file} = $permissions;
$total++;
}
elsif ($fileName =~ /[[:alnum:]]\*$/) {
$fileList{$file} = $permissions;
$total++;
}
else {
$fileList{$file} = $permissions;
$total++;
}
}
In this case where I am facing the issue, /datafiles/data_one/level_one/*/DATA* is being passed to bsd_glob. I am creating a map ($fileList) of files that are returned by bsd_glob based on the regular expression I am passing to it. $permissions is a predefined value.
Any help is appreciated.
The problem here looks to be that you're confusing glob patterns and regular expressions.
/[[:alnum:]]\*\/\*$/
/[[:alnum:]]\*$/
You're looking for a file called * with that, under a directory containing a literal *.
Whilst that is technically possible it's really very strange. And simply cannot ever match the patterns your glob should find.
Do you perhaps mean:
m,\w+.*/.*$,
(different delimiter for clarity)
Also - why are you using bsd_glob specifically? From File::Glob:
Since v5.6.0, Perl's CORE::glob() is implemented in terms of bsd_glob(). Note that they don't share the same prototype--CORE::glob() only accepts a single argument. Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns, whereas bsd_glob() considers them as one pattern. But see :bsd_glob under EXPORTS, below.
Comment:
I used bsd_glob instead of glob as there was slight difference in the way it works on different UNIX platforms. Specifically, for the above mentioned pattern, on some UNIX platforms, it didn't return a file having exact name 'DATA', and only returned files with something appended to DATA.
I'm a little surprised at that, as they should be implementing the same mechanisms and the same POSIX standard on globbing. Is there any chance there's a permissions related problem instead?
But otherwise you could perhaps try not using glob to do the heavy lifting, and instead just compare the file name to a bunch of regular expressions. (Although note - REs have very different syntax)
foreach my $file ( glob('/datafiles/data_one/level_one/*/*') ) {
next unless $filename =~ m,DATA\w+$,;
}

grep replacement with extensive regular expression implementation

I have been using grepWin for general searching of files, and wingrep when I want to do replacements or what-have-you.
GrepWin has an extensive implementation of regular expressions, however doesn't do replacements (as mentioned above).
Wingrep does replacements, however has a severely limited range of regular expression implementation.
Does anyone know of any (preferably free) grep tools for windows that does replacement AND has a reasonable implementation of regular expressions?
Thanks in advance.
I think perl at the command line is the answer you are looking for. Widely portable, powerful regex support.
Let's say that you have the following file:
foo
bar
baz
quux
you can use
perl -pne 's/quux/splat!/' -i /tmp/foo
to produce
foo
bar
baz
splat!
The magic is in Perl's command line switches:
-e: execute the next argument as a perl command.
-n: execute the command on every line
-p: print the results of the command, without issuing an explicit
'print' statement.
-i: make substitutions in place. overwrite the document with the
output of your command... use with caution.
I use Cygwin quite a lot for this sort of task.
Unfortunately it has the world's most unintuitive installer, but once it's installed correctly it's very usable... well apart from a few minor issues with copy and paste and the odd issue with line-endings.
The good thing is that all the tools work like on a real GNU system, so if you're already familiar with Linux or similar, you don't have to learn anything new (apart from how to use that crazy installer).
Overall I think the advantages make up for the few usability issues.
If you are on Windows, you can use vbscript (requires no downloads). It comes with regex. eg change "one" to "ONE"
Set objFS=CreateObject("Scripting.FileSystemObject")
Set WshShell = WScript.CreateObject("WScript.Shell")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
strFileContents = objFile.ReadAll
Set objRE = New RegExp
objRE.Global = True
objRE.IgnoreCase = False
objRE.Pattern = "one"
strFileContents = objRE.Replace(strFileContents,"ONE") 'simple replacement
WScript.Echo strFileContents
output
C:\test>type file
one
two one two
three
C:\test>cscript //nologo test.vbs file
ONE
two ONE two
three
You can read up vbscript doc to learn more on using regex