regular expression matching filename with multiple extensions - regex

Is there a regular expression to match the some.prefix part of both of the following filenames?
xyz can be any character of [a-z0-9-_\ ]
some.prefix part can be any character in [a-zA-Z0-9-_\.\ ].
I intentionally included a . in some.prefix.
some.prefix.xyz.xyz
some.prefix.xyz
I have tried many combinations. For example:
(?P<prefix>[a-zA-Z0-9-_\.]+)(?:\.[a-z0-9]+\.gz|\.[a-z0-9]+)
It works with abc.def.csv by catching abc.def, but fail to catch it in abc.def.csv.gz.
I primarily use Python, but I thought the regex itself should apply to many languages.
Update: It's not possible, see discussion with #nowox below.

I think your regex works pretty well. I recommend you to trying regex101 with your example:
https://regex101.com/r/dV6cE8/3
The expression
^(?i)[ \w-]+\.[ \w-]+
Should work in your case:
som e.prefix.xyz.xyz
^^^^^^^^^^^
some.prefix.xyz
^^^^^^^^^^^
abc.def.csv.gz
^^^^^^^
And in Python you can use:
import re
text = """some.prefix.xyz.xyz
some.prefix.xyz
abc.def.csv.gz"""
print re.findall('^(?i)[ \w-]+\.[ \w-]+', text, re.MULTILINE)
Which will display:
['som e.prefix', 'some.prefix', 'abc.def']
I might think you are a bit confused about your requirement. If I summarize, you have a pathname made of chars and dot such as:
foo.bar.baz.0
foobar.tar.gz
f.o.o.b.a.r
How would you separate these string into a base-name and an extension? Here we recognize some known patterns .tar.gz is definitely an extension, but is .bar.baz.0 the extension or it is only .0?
The answer is not easy and no regexes in this World would be able to guess the correct answer at 100% without some hints.
For example you can list the acceptable extensions and make some criteria:
An extension match the regex \.\w{1,4}$
Several extensions may be concatenated together (\.\w{1,4}){1,4}$
The remaining is called the basename
From this you can build this regular expression:
(?P<basename>.*?)(?P<extension>(?:\.\w{1,4}){1,4})$

Try this[a-z0-9-_\\]+\.[a-z0-9-_\\]+[a-zA-Z0-9-_\.\\]+

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex replace from | [duplicate]

How can I replace several different words all at once in Notepad++?
For example;
I have "good", "great" and "fine" and I want to replace them with "bad", "worse" and "not", respectively, all at once.
I know that I can replace them one by one, but the problem I am facing requires that I replace a lot of words, which is not convenient to do.
Try a regular expression replace of (good)|(great)|(fine) with (?1bad)(?2worse)(?3not).
The search looks for either of three alternatives separated by the |. Each alternative has ist own capture brackets. The replace uses the conditional form ?Ntrue-expression:false-expression where N is decimal digit, the clause checks whether capture expression N matches.
Tested in Notepad++ 6.3
Update:
You can find good documentation, about the new PRCE Regular
Expressions, used by N++, since the 6.0 version, at the TWO addresses
below :
http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html
The FIRST one concerns the syntax of regular expressions in SEARCH
The SECOND one concerns the syntax of regular expressions in
REPLACEMENT
And, if you can understand "written French", I made a tutorial about
PCRE regular expressions, stored in the personal site of Christian
Cuvier (cchris), at the address below :
http://oedoc.free.fr/Regex/TutorielRegex.zip
(Extracted from a posting by THEVENOT Guy at http://sourceforge.net/p/notepad-plus/discussion/331754/thread/ca059a0a/ )
Install Python Script plugin from Plugin Manager.
Create a file with your substitutions (e.g., C:/Temp/Substitutions.txt), separate values with space:
good bad
great worse
fine not
Create a new script:
with open('C:/Temp/Substitutions.txt') as f:
for l in f:
s = l.split()
editor.replace(s[0], s[1])
Run the new script against the text you want to substitute.
I needed to run the substitution on several files. Based on Mauricio Morales's answer, I created the following script.
with open('C:/Temp/Substitutions.txt') as f:
files = notepad.getFiles()
for file in files:
notepad.activateFile(file[0])
for l in f:
s = l.split()
editor.replace(s[0], s[1])
f.seek(0) # Reset file input stream
If you're replacing the same words in several different files all the time, recording your action once using these buttons and saving it as a macro will be helpful. *Notepad++

Regular expression to amend Sysprep.inf file

I currently have a requirement to parse a sysprep.inf file and insert a value input by the end user.
I'm coding this utility using AutoIT and my regular expression is slightly out.
The line I need amending is as follows:
ComputerName=%DeviceName%
DeviceName is variable injected by LANDesk. If the device has previously been in the LANDesk database the name is injected into the file. If not the variable name remains. The device name must go after the =
Here is a snippet of my current code:
$FileContents = StringRegExpReplace($FileContents,'ComputerName=[a-z]','ComputerName='& $deviceNameInput)
Thanks for any guidance anyone can offer.
I'm not familiar with AutoIT or BASIC... but it looks like you need to be using something like this:
$FileContents = StringRegExpReplace($FileContents,'.*ComputerName=(\%[a-zA-Z]*\%).*', $deviceNameInput)
OR
$FileContents = StringRegExpReplace($FileContents,'ComputerName=\%[a-zA-Z]*\%', 'ComputerName='&$deviceNameInput)
this will only replace a device name that's a-z or A-Z. Not numerical or containing spaces.
Writing regular expressions can be tough because there are so many dialects of regular expressions. Assuming you are using a regex library that supports a Perl-like dialect you might want to try this for your regex:
^\s*ComputerName\s*=\s*(?:%DeviceName%|[a-zA-Z0-9_-]+)
Basically this regex will match an lines either the litteral string ComputerName=%DeviceName% or ComputerName=<some actual device name that only contains the characters a-z, A-Z, 0-9, _, and ->. This regex is also a bit lenient in that it will match a line that contains whitespace at the beginning of the line as well as before and/or after the equals sign. The image below explains the components of this regex in greater detail.
p.s. that image was generated by RegexBuddy, an excellent regular expression IDE.
Autoit has a great way of dealing with ini files - IniWrite
IniWrite("SysPrep.ini", "write_section_here", "ComputerName", $deviceNameInput)
creates or updates SysPrep.ini with:
[write_section_here]
ComputerName=localhost

Do calculation on captured number in regex before using it in replacement

Using a regex, I am able to find a bunch of numbers that I want to replace. However, I want to replace the number with another number that is calculated using the original - captured - number.
Is that possible in notepad++ using a kind of expression in the replacement-part?
Edit: Maybe a strange thought, but could the calculation be done in the search part, generating a second captured number that would effectively be the result?
Even if it is possible, it will almost certainly be "messy" - why not do the replacements with a simple script instead? For example..
#!/usr/bin/env ruby
f = File.new("f1.txt", File::RDWR)
contents = f.read()
contents.gsub!(/\d+/){|m|
m.to_i + 1 # convert the current match to an integer, and add one
}
f.truncate(0) # empty the existing file
f.seek(0) # seek to the start of the file, before writing again
f.write(contents) # write modified file
f.close()
..and the output:
$ cat f1.txt
This was one: 1
This two two: 2
$ ruby replacer.rb
$ cat f1.txt
This was one: 2
This two two: 3
In reply to jeroen's comment,
I was actually interested if the possibility existed in the regular expression itself as they are so widespread
A regular expression is really just a simple pattern matching syntax. To do anything more advanced than search/replace with the matches would be up to the text-editors, but the usefulness of this is very limited, and can be achieved via scripting most editors allow (Notepad++ has a plugin system, although I've no idea how easy it is to use).
Basically, if regex/search-and-replace will not achieve what you want, I would say either use your editors scripting ability or use an external script.
Is that possible in notepad++ using a kind of expression in the replacement-part?
Interpolated evaluation of regular-expression matches is a relatively advanced feature that I probably would not expect to find in a general-purpose text editing application. I played around with Notepad++ a bit but was unable to get this to work, nor could I find anything in the documentation that suggests this is possible.
Hmmm... I'd have to recommend AWK to do this.
http://en.wikipedia.org/wiki/AWK
notepad++ has limited regular expressions built in. There are extensions that add a bit more to the regular expression find and replace, but I've found those hard to use. I would recommend writing a little external program to do it for you. Either Ruby, Perl or Python would be great for it. If you know those languages. I use Ruby and have had lots of success with it.

Regex: Get Filename Without Extension in One Shot?

I want to get just the filename using regex, so I've been trying simple things like
([^\.]*)
which of course work only if the filename has one extension. But if it is adfadsfads.blah.txt I just want adfadsfads.blah. How can I do this with regex?
In regards to David's question, 'why would you use regex' for this, the answer is, 'for fun.' In fact, the code I'm using is simple
length_of_ext = File.extname(filename).length
filename = filename[0,(filename.length-length_of_ext)]
but I like to learn regex whenever possible because it always comes up at Geek cocktail parties.
Try this:
(.+?)(\.[^.]*$|$)
This will:
Capture filenames that start with a dot (e.g. .logs is a file named .logs, not a file extension), which is common in Unix.
Gets everything but the last dot: foo.bar.jpeg gets you foo.bar.
Handles files with no dot: secret-letter gets you secret-letter.
Note: as commenter j_random_hacker suggested, this performs as advertised, but you might want to precede things with an anchor for readability purposes.
Everything followed by a dot followed by one or more characters that's not a dot, followed by the end-of-string:
(.+?)\.[^\.]+$
The everything-before-the-last-dot is grouped for easy retrieval.
If you aren't 100% sure every file will have an extension, try:
(.+?)(\.[^\.]+$|$)
how about 2 captures one for the end and one for the filename.
eg.
(.+?)(?:\.[^\.]*$|$)
^(.*)\\(.*)(\..*)$
Gets the Path without the last \
The file without extension
The the extension with a .
Examples:
c:\1\2\3\Books.accdb
(c:\1\2\3)(Books)(.accdb)
Does not support multiple . in file name
Does support . in file path
I realize this question is a bit outdated, however, I had some trouble finding a good source and wound up making the regex myself. To save whoever may find this time,
If you're looking for a ~standalone~ regex
This will match the extension without the dot
\w+(?![\.\w])
This will always match the file name if it has an extention
[\w\. ]+(?=[\.])
Ok, I am not sure why I would use regular expression for this. If I know for example that the string is a full filepath, then I would use another API to get the file name. Regular expressions are very powerfull but at the same time quite complex (you have just proved that by asking how to create such a simple regex). Somebody said: you had a problem that you decided to solve it using regular expressions. Now you have two problems.
Think again. If you are on .NET platform for example, then take a look at System.IO.Path class.
I used this pattern for simple search:
^\s*[^\.\W]+$
for this text:
file.ext
fileext
file.ext.ext
file.ext
fileext
It finds fileext in the second and last lines.
I applied it in a text tree view of a folder (with spaces as indents).
Just the name of the file, without path and suffix.
^.*[\\|\/](.+?)\.[^\.]+$
Try
(?<=[\\\w\d-:]*\\)([\w\d-:]*)(?=\.[\.\w\d-:]*)
Captures just the filename of any kind within an entire filepath. Purposefully excludes the file path and the file extension
Etc:
C:\Log\test\bin\fee105d1-5008-410c-be39-883e5e40a33d.pdf
Doesn't capture (C:\Log\test\bin)
Captures (fee105d1-5008-410c-be39-883e5e40a33d)
Doesn't capture (.pdf)
This RegExp works for me:
(.+(?=\..+$))|(.+[^\.])
Results (bold means match):
test.txt
test 234!.something123
.test
.test.txt
test.test2.txt
.