Regex for truncating file name - regex

I'm having some trouble putting together a regex for trimming a file name after a certain length. This is being used to rename a large number of files simultaneously, far too many to reasonably rename by hand. Unfortunately, some of our employees like to leave notes on the end of the file name, which is what we're looking to remove.
Example file names, all of these are present and are making matching problematic.
ABC - A11B11 - Note.txt
ABC - A22B22 (Note).txt
ABC - A33B33 | Note.txt
All files will be identical in length, 16 characters specifically. The 1st section will be purely letters, specifically client account names. The 2nd section is a combination of numbers and letters, case ID numbers. The makeup of the 2nd sequence varies with each file name, but are always 6 digits long, and are always a mixture of 2 letters and 4 numbers.
I've tried using regex to pinpoint the number/letter pattern in the 2nd sequence and delete everything afterwards. I've also tried leveraging the 16 character length to delete all characters beyond 16. Unfortunately, I'm not particularly good with regex and I'm not making much headway. Most of my attempts are recognized as a valid regex search, but give incorrect match results.
Any assistance I can get on this would be greatly appreciated.

The cleanest regex replacement I can conjure up would be:
Find: ^([A-Z0-9]+ - [A-Z0-9]+).*(\.\w+)$
Replace: $1$2
Demo
This approach is to match and capture the first two portions of the file name which you want to retain. It also captures the file extension. Then, in the replacement, we form the new file name, effectively removing any notes which might have followed the second section of the name.

I think I just found a working search. It's not as hands free as I was originally hoping for and will require adjusting the query for match text for different account names and do them in batches, but it works.
Find: (^ABC - A11\d{2}).*
Replace: $1
It's probably better this way anyway, making automated changes to such a wide range of business documentation makes me a little nervous. This way we can roll out changes slowly to ensure accuracy and avoid data loss or mislabeling.
Thank you for everyone that pitched in ideas.

Related

Is there a list that makes the correlation between digits and visually-similar letters? Or vice-versa? (Unicode / Latin, OCR context)

Having the following European driving license info extracted using Tesseract.js (below is not the exact text, but rather an outline of the format these data fields would have on such documents), I would like to write multiple regular expressions that match different data fields on the driving license (ordering numbers below correspond to the digit preceding the field on any European driving license; the rules for labelling the data fields of these documents can be checked on Wikipedia - also, some document specimen photos there):
surname (last name)
other names ( first name(s) )
date of birth
b date of expiry
ID drivingLicense
address
For instance, for the first name(s) I wrote this regex: 2\.?\s*[\p{Letter}\s]*
The problem is that with such a name:
IOAN
my regex might not match because of the OCR seeing the O in IOAN as the digit 0. If I add the 0 to my regex like so: 2\.?\s*[\p{Letter}0\s]* then it will match; so the solution is simple for this scenario. Nonetheless, other times, I've had Bs mistaken for 8s or As mistaken for 4s. It might become even trickier when trying to account for other scenarios (consider all possible Unicode letter characters that could be mistaken for a digit).
Is there a list that makes the correlation between digits and their visually similar letter-counterparts? I am looking to hardcode these inside my regex for better matching. (correlation between digits and Unicode characters -not just the characters in the Latin alphabet- would help even better). Or is there any other better solution to account for all possible scenarios?
Some may think including a 0-9 in my regex inside the square brackets would be a viable solution; but something like below might happen, which does not satisfy my needs for the current scenario. Of course, I could do some whitespace parsing to account for that or even avoid the 3 digit that I know for certain will always follow on the next line by doing something like 2\.?\s*[\p{Letter}0-24-9\s]* - but still I think better solutions could be found.

regex substitution no global modifiers available

I'm using software with built-in regex implementation that does not support global modifiers, so I have to get it working without /g
my test string is(number of sections can be unlimited:
aaa%2dbbb%2dccc%2dddd%2deee
I want it to be: aaa-bbb-ccc-ddd-eee
normally I would write (%2d) and g flag and substitute with -
I managed to write this to match unlimited number of occurrences
(\w)((%2d)(\w+))+
but I have problems with substitution rule, because my group 2 has 2 subgroups and I cannot find out how to handle them,
can anyone help with substitution rule?
As comments in the end reach same conclusions that I had before posting question, I decided to post answer to close the question nicely (instead of deleting question, cause even negative answer is answer and may save someone an hour or more on research(that happened to me actually)). The general conclusion is - it's not possible to solve this with regex. And I'm quoting two best comments by #ltux here:
This problem can't be solved with regular expression in one go. If capture group is used with quantifier such as +, the content of the capture group will always be the last match found. In your case, the content of the 2nd capture group will be %2deee, and you can't get %2dbbb, %2dccc and so on, so there is chance for you to substitute it. – ltux 2 days ago
Regular expression can't solve your problem. You have to try to bypass the limitations of the software by yourself, unless you tell us which software you are using. – ltux 2 days ago
Create a file containing the line type that you want to process:
cat << EOF >> abcde.txt
aaa%2dbbb%2dccc%2dddd%2deee
EOF
Use this sed snippet as follows using the global substitution you mention as being the way you usually perform such a substitution.
sed -e "s#%2d#-#g" abcde.txt
aaa-bbb-ccc-ddd-eee
Basically you don't have to think about the type of characters that appear around the white space character but instead only concentrate on the white space itself. Replacing this character multiple times will solve the issue for you quite simply. In other words, pattern matching around the character you are concerned with changing is not necessary. This is a common issue that many of us fall into when dealing with regular expressions.
Basically the substitution is saying: find the first occurrence of a white space '%2d', replace it with a hyphen '-' and repeat for the rest of the string.

Insertion syntax for regex in Notepad++ or Perl

Shortform: searching:
"{,[0-9][0-9]," inserting Space+00... getting replaced string segment:
"{,SPACE00[0-9][0-9]," or other so-garbaged data for found [0-9][0-9] sequence ... so how do I search with a regex and insert in the middle???
Longform question:
I'm trying to do a series of simple character insertions -- digits actually -- in a series of mixed model CSV profiling data (five files each with different model parameters, several hundred lines each).
I'm visually challenged and desire to insert padding characters to columize data, so I can focus on tweaking key values, not keeping place data file to data file.
This need where the CSV data lines format are:
*Variable_symbolic-name*,{##,##,* ... ('Set of CSV Numerical Data lists' ...},\n*
an actual data line:
61,parameter17,{,70,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
to be morphed to:
61,parameter17,\t\t{, 0070,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Give or take a tab character to align all the { numeric field starts...
I've found searching: "{,[0-9][0-9]," failed but "\{,[0-9][0-9]," succeeds for the find part of the search and replace operation... but have hit a proverbial brick wall in how to do the actual replace (with an insert) of such a short length. (Obviously with so many parameters and files, I'm moving cautiously!)
However, This Perl Help tutorial leaves me in the dark as to how to keep the found ranges and insert padding before (Space, zero, zero to be specific if positive, '-00' if negative) In short, I need to know how to insert 2-3 places in the replace field in Notepad++... and retain the original data without prejudicing it!
Articles herein have cited replacing paragraphs and lines, adding newlines, etc. but this simple insertion alteration seems too simple for you all. But it's been several hours of frustration for me!
Thanks! // Frank
Resolved:
Good news: ({,)([0-9][0-9],) and \1 xx\2 works fine as does ({,)(#[0-9][0-9],) and replacing with \1 xx#\2 ... whether or not tabs are utilized. Obviously the key was ([0-9][0-9],) which included the discrimination of the comma... though I have no idea why that seemed to fail an hour ago with trials made using Sobrinho's help. Must have not tried the sequence. Thanks all!
Try to type this in the search box:
(.+)(\{,[0-9][0-9].*)
And in the replace:
\1\t\t\2
When you have things between parenthesis, they are "stored" by Notepad++ and can be reused in the replace box.
The order of the parenthesis starts with one and are accessed as \1, \2, ...
You tagged it as Perl, so here is how you do it in Perl ...
I prefer to use lookahead assertions rather than backreferences
s/(?= {,[0-9][0-9], ) /\t\t/x
Alternatively, $& contains the matched string ($0 is something different)
s/ {,[0-9][0-9], /\t\t$&/x
You will need a backreference here, meaning something which, in the replace part, will be equal to what you have matched.
Usually, the whole matched part is stored in the $0 backreference. (You can get $1 with a capture group too, and up to $2 with two capture groups, etc)
Back to your question, you could try this:
Find:
(\{,)([0-9][0-9],)
Replace by:
\t\t$1 00$2
This will insert two tab characters before the part that matched \{,[0-9][0-9], (or in other words, replace the part that matched by 2 tab characters and what you matched), then put the first captured part ({,) and then the space and double 0's and then the second captured part, the two digits and following comma.
regex101 demo

Select capitalized & all-caps words using RegEx

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
It selects two characters too many in front of the hit.
In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is
this Beetle ugly? It sings at the: La Scala opera house. I have a
dream that I will find work at' Frame Store but not in the USA! This
way ILM could do whatever they pleased. ILM was very sweet. Visual
Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

how to eliminate dots from filenames, except for the file extension

I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html