Perl Regex Output only characters that can be used as unix filename

Perl Regex Output only characters that can be used as unix filename - regex

I wrote a basic mp3 organizing script for myself. I have the line: $outname = "/home/jebsky/safehouse/music/mp3/" . $inital . "/" . $artist . "/" . $year ." - ". $album . "/" . $track ." - ". $artist ." - ". $title . ".mp3";
I want a regex to change $outname so that any non safe for filename characters get replaced by an underscore

If any of your components include "/", you really want to do the substitution on them before assembling them into $outname.
Which characters are safe can vary from one operating system and/or filesystem to another.
Many filesystems have no problem with any characters other than "/" and nul. You're probably better off deciding which characters you want to keep, for other reasons than what your filesystem allows.
The following keeps only letters and digits, replacing sequences of other characters with _:
for ( $initial, $artist, $year, $album, $track, $title ) {
s/[^A-Za-z0-9]+/_/g;
}

One quick way to escape all non-alphabetic characters in a string is to use the \Q and \U operators, as in:
# assuming $outname already contains the required path and
# globally "unescaping" file chars / and .
($outname = "\Q$outname\U") =~ s/\\([\/\.])/$1/g;
One thing to consider is that long run-on string cats like you have tend to both be hard to read and maintain. A better way of representing this operation might be to break it up into logical units, like:
my $basename = '/home/jebsky/safehouse/music/mp3';
my $dirpath = "${basename}/${initial}/${artist}/${year}-${album}/";
my $filename = "${track}-${artist}-${title}.mp3";
$outname = "${dirpath}/${filename}";
Within strings, representing a variable as "${varname}" assures that the character that follows the varname cannot interfere with it and is usually a good idea even if the next character after the var isn't alphanumeric because it clearly marks variables within the string.
Finally, I think it's a good idea to get away from using '"' and '\'' as string delimiters since they require quoting if the string contains the delimiter.
Use the qq// and q// delimiters (replacing the / with a char not appearing in the string if required) instead, as in:
my $basename = q!/home/jebsky/safehouse/music/mp3!;
my $dirpath = qq!${basename}/${initial}/${artist}!;
my $filename = qq!${year}-${album}/${track}-${artist}-${title}.mp3!;
$outname = qq!${dirpath}/${filename}!;
This way, you'll rarely have to quote any char in the string.

Related

How to split a string by \ in perl?

use Data::Dumper qw(Dumper);
#arr=split('\/',"\Program Files\Microsoft VisualStudio\VC98\Bin\Rebase.exe");
print(Dumper \#arr);
output:
$VAR1 = [
'Program FilesMicrosoft VisualStudioVC98BinRebase.exe'
];
Required output:
$VAR1 = [
'Program Files',
'Microsoft VisualStudio',
'VC98',
'Bin',
'Rebase.exe'
];

You are splitting on the forward slash / (escaped by \), while you clearly need the \.
Since \ itself escapes things, you need to escape it, too
use warnings;
use strict;
use feature 'say';
my $str = '\Program Files\Microsoft VisualStudio\VC98\Bin\Rebase.exe';
my #ary = split /\\/, $str;
shift #ary if $ary[0] eq '';
say for #ary;
what prints the path components, one per line.
Since this string begins with a \ the first element of #ary is going to be an empty string, as that precedes the first \. We remove it from the array by shift, with a check.
Note that for this string one must use '', or the operator form q(...), since the double quotes try to interpolate the presumed escapes \P, \M (etc) in the string, failing with a warning. It is a good idea to use '' for literal strings and "" (or qq()) when you need to interpolate variables.
Another way to do this is with regex
my #ary = $str =~ /[^\\]+/g;
The negated character class, [^...], with the (escaped) \ matches any character that is not \. The quantifier + means that at least one such match is needed while it matches as many times as possible. Thus this matches a sequence of characters up to the first \.
With the modifier /g the matching keeps going through the string to find all such patterns.
Assigning to an array puts the match operator in list context, in which the list of matches is returned, and assigned to #ary. In a scalar context only the true (1) or false (empty string) is returned.
No capturing () are needed here, since we want everything that is matched.
Generally the () in list context are needed so that only the captured matches are returned.
With this we don't have to worry about an empty string at the beginning since there is no match before the first \, as there are no characters before it while we requested at least one non-\.
But working with paths is common and there are ready tools for that, which take care of details. The core module File::Spec is multi-platform and its splitdir breaks the path into components
use File::Spec;
my #path_components = File::Spec->splitdir($str);
The first element is again an empty string if the path starts with \ (or / on Unix/Mac).
Thanks to Sinan Ünür for a comment.

How to filter a string for invalid filename characters using regex

My problem is that I don't want the user to type in anything wrong so I am trying to remove it and my problem is that I made a regex which removes everything except words and that also remove . , - but I need these signs to make the user happy :D
In a short summary: This Script removes bad characters in an input field using a regex.
Input field:
$CustomerInbox = New-Object System.Windows.Forms.TextBox #initialization -> initializes the input box
$CustomerInbox.Location = New-Object System.Drawing.Size(10,120) #Location -> where the label is located in the window
$CustomerInbox.Size = New-Object System.Drawing.Size(260,20) #Size -> defines the size of the inputbox
$CustomerInbox.MaxLength = 30 #sets max. length of the input box to 30
$CustomerInbox.add_TextChanged($CustomerInbox_OnTextEnter)
$objForm.Controls.Add($CustomerInbox) #adding -> adds the input box to the window
Function:
$ResearchGroupInbox_OnTextEnter = {
if ($ResearchGroupInbox.Text -notmatch '^\w{1,6}$') { #regex (Regular Expression) to check if it does match numbers, words or non of them!
$ResearchGroupInbox.Text = $ResearchGroupInbox.Text -replace '\W' #replaces all non words!
}
}
Bad Characters I don't want to appear:
~ " # % & * : < > ? / \ { | } #those are the 'bad characters'

Note that if you want to replace invalid file name chars, you could leverage the solution from How to strip illegal characters before trying to save filenames?
Answering your question, if you have specific characters, put them into a character class, do not use a generic \W that also matches a lot more characters.
Use
[~"#%&*:<>?/\\{|}]+
See the regex demo
Note that all these chars except for \ do not need escaping inside a character class. Also, adding the + quantifier (matches 1 or more occurrences of the quantified subpattern) streamlines the replacing process (matches whole consecutive chunks of characters and replaced all of them at once with the replacement pattern (here, empty string)).
Note you may also need to account for filenames like con, lpt1, etc.

To ensure the filename is valid, you should use the GetInvalidFileNameChars .NET method to retrieve all invalid character and use a regex to check whether the filename is valid:
[regex]$containsInvalidCharacter = '[{0}]' -f ([regex]::Escape([System.IO.Path]::GetInvalidFileNameChars()))
if ($containsInvalidCharacter.IsMatch(($ResearchGroupInbox.Text)))
{
# filename is invalid...
}

$ResearchGroupInbox.Text -replace '~|"|#|%|\&|\*|:|<|>|\?|\/|\\|{|\||}'
Or as #Wiketor suggest you can obviate it to '[~"#%&*:<>?/\\{|}]+'

Getting just the file name from full path

I need to get from a full file path just the name of the file. I've tried to use:
$out_fname =~ s/[\/\w+\/]+//;
but it "eats up" also purts of the file name.
example:
for a file:
/bla/bla/folder/file.part.1.file,
it returned:
.part.1,file

You can do:
use File::Basename;
my $path = "/bla/bla/folder/file.part.1.file";
my $filename = basename($path);

Besides File::Basename, there's also Path::Class, which can be handy for more complex operations, particularly when dealing with directories, or cross-platform/filesystem operations. It's probably overkill in this case, but might be worth knowing about.
use Path::Class;
my $file = file( "/bla/bla/folder/file.part.1.file" );
my $filename = $file->basename;

I agree with the other answers, but just wanted to explain the mistake in your pattern. Regex is tricky, but worth it to learn well.
The square brackets defines a class of objects that will match. In your case, it will match with the forward slash, a word character (from the \w), the + character, or the forward slash character (this is redundant). Then you are saying to match 1 or more of those. There are multiple strings that could match. It will match the earliest starting character, so the first /. Then it will grab as much as possible.
This is not what you intended clearly. For example, if you had a . in one of your directory names, you would stop there. /blah.foo/bar/x.y.z would return .foo/bar/x.y.z.
The way to think of this is that you want to match all characters up to and including the final /.
All characters then slash: /.*\//
But to be safer, add a caret at front to make sure it starts there: /^.*\//
And to allow forward and backslashes, make a class for that: /^.*[\/\\]/ (i.e. elusive's answer).
A really good reference is Learning Perl. There are about 3 really good regex chapters. They are applicable to non-Perl regex users as well.

Using split on the directory separator is another alternative. This has the same caveats as using a regex (i.e. with filenames it's better to use a module where someone else has already thought about edge cases, portability, different filesystems, etc, and so you don't need matching on both back- and forward-slashes), but useful as another general technique where you have a string with a repeated separator.
my $file = "/bla/bla/folder/file.part.1.file";
my #parts = split /\//, $file;
my $filename = $parts[-1];

This is exactly what I would expect it to retain in the given substitution. You are saying replace the longest string of slashes and word characters with nothing. So it grabs all the characters up until the first character you didn't specify and deletes them.
It's doing what you are asking it to do. I join with others in saying use File::Basename for what you are trying to do.
But here is the quickest way to do the same thing:
my $fname = substr( $out_fname, rindex( $out_fname, '/' ) + 1 );
Here, it says find the last occurrence of '/' in the string and give me the text starting one after that position. I'm not anti-regex by any stretch, but it's a simple expression of what you actually want to do. I've had to do stuff like this for so long, I wrote a last_after sub:
sub last_after {
my ( $string, $delim ) = #_;
unless ( length( $string ) and my $ln = length( $delim )) {
return $string // '';
}
my $ri = rindex( $string, $delim );
return $ri == -1 ? $string : substr( $string, $ri + $ln );
}

I also needed to pull just the last field from a bunch of path names. This worked for me:
grep -o '/\([^/]*\)$' inputfile > outputfile

What about this:
$out_fname =~ s/^.*[\/\\]//;
It should remove everything in front of your filename.

How to clean up a string to use as a filename in PERL?

I have a job application form where people fill in their name and contact info and attach a resume.
The the contact info gets emailed and the resume attached.
I would like to change the name of the file to that it is a combination of the competition number and their name.
How can I clean up my generated filename so that I can guarantee it has no invalid characters in it. So far I can remove all the spaces and lowercase the string.
I'd like to remove any punctuation ( like apostrophes ) and non-alphabetical characters ( like accents ).
For example if "André O'Hara" submitted his resume for job 555 using this form, I would be happy if all the questionable characters were removed and I ended up with a file name like:
555-andr-ohara-resume.doc
What regex can I use to remove all non-alphabetical characters ?
Here is my code so far:
# Create a cleaned up version of competition number + First Name + Last Name number to name the file
my $hr_generated_filename = $cgi->param("competition") . "-" . $cgi->param("first") . "-" . $cgi->param("last");
# change to all lowercase
$hr_generated_filename = lc( $hr_generated_filename );
# remove all whitespace
$hr_generated_filename =~ s/\s+//g;
push #{ $msg->{attach} }, {
Type => 'application/octet-stream',
Filename => $hr_generated_filename.".$file-extension",
Data => $data,
Disposition => 'attachment',
Encoding => 'base64',
};

If you are trying to "white-list" characters, your basic approach should be to use a character class complement:
[...] defines a character class in Perl regexes, which will match any characters defined inside (including ranges such as a-z). If you add a ^, it becomes a complement, so it matches any characters not defined inside the brackets.
$hr_generated_filename =~ s/[^A-Za-z0-9\-\.]//g;
That will remove anything that is not an un-accented Latin letter, a number, a dash, or a dot. To add to your white-list, just add characters inside the [^...].

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';

/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);

This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"

As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/

Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.

/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string

"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes

/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.

This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!

An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js

here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html

One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).

A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)

If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"

I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.

If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.

(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "

Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Regex Output only characters that can be used as unix filename - regex

Related

How to split a string by \ in perl?

How to filter a string for invalid filename characters using regex

Getting just the file name from full path

How to clean up a string to use as a filename in PERL?

Regex for quoted string with escaping quotes

Categories

Resources