Validating a list using Regex - regex

I have been asked to help sort out a system that validates some returned MS Excel forms.
A Python program already exists that takes a returned Excel form, and another MS Excel workbook with looks much like the form, but the fields are filled with Regexes.
The Python code then validates the form by testing the returned form, against the Regexes in turn from the Regex form.
This all works as expected, and pumps out a workbook with a list of problems where a match doesn't occur.
However the validation wasn't always returning results as expected and I was asked to try sorting it out.
I have been reviewing the regexes against a document describes what the valid responses in the form should be. I can cope with most of them, but a couple of them got me to thinking. These are ones where the valid entry in the form is a list of items. E.g. a list of words seperated by commas or newlines.
Everytime I come across one of these I have been using the following approach:
^[A-Z]{0,10}((, *|\n)[A-Z]{0,10})*$
So this will match the first uppercase word up to 10 letters, then the rest of the list with each entry preceeded by a , or <CR>. It works, but I wonder if there is a better way? The reason I am thinking this is because the matching pattern for each list entry has to be in the regex twice. So if a problem is spotted, it has to be corrected in two places.
Is there a better way?

Use
^\b((^|, *|\n)[A-Z]{1,10})+$
It makes sure the text starts with a word character (i.e. a-zA-Z0-9_) by checking for a word boundary. Thus preventing the text to start with , or an empty line.
Then, repeating at least once, there should be a line beginning with one of
Start of text
A comma followed by any number of spaces
a newline
then, at least one, at the most ten, uppercase characters.
The first start alternative is obviously only possible for the start of the line, the second for intermediate words and the final for the last word on a line.
See it here at regex101.

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

REGEX to find first instance after set length

I'm probably going to get pilloried for asking this question, but after searching and trying to figure out this regex on my own, I'm just tired of wasting time trying to figure out. Here's the problem I'm trying to solve. I frequently use editpad pro to to convert character strings so they will fit into a mainframe.
For instance, I want to convert a column of words from excel into an IN clause for sql. The column is 5000 words or so.
I can easily copy and paste that into the text editor and then using find and replace convert that from a column of words to a single row with ',' separating each word.
Once that's done, though I want to use a regex to split this row before or after a comma after 70 characters have gone by.
(?P<start>^.{0,70})
This will give me the first 70 characters, but then I get stuck as I can't figure out how to create the next group to find all the characters up to the next comma so I can refer to it like this
(?P<start>^.{0,70})(?P<next>????,)
If I could get that, then I could create do a find and replace that would break it after the first comma that appears after the 70th character.
I know given the rest of the day I could figure it out, but I need to move on. I've tried this before. I would even be willing to only find the first 7o characters and then next few characters until the comma and then have to repeat the replace and find multiple times, if necessary, but I can not get the regex to work.
Any assistance with this would be greatly appreciated.
Here is some sample data that I have added line breaks into as an example of what I want it to look like after the regex runs.
'Ability','Absence','Absolute','Absorb','Accident','Acclaim','Accompany',
'Accomplish','Achievement','Acquaintance','Acquire','Across','Acting','Address',
'Admire','Adorable','Advance','Advertisement','Afraid','Agriculture','Align',
'All','Allow','Allowance','Allowed','Alone','Aluminium','Always','America',
'Analyze','Android','Angle','Announce','Annual','Ant','Antarctica','Antler',
I think you should consider restricting your initial concatenation, but here's a solution to your specific implementation :
^.{0,70}[^,]*
This will select the first 70 characters (if available), then every character up to the one before the next comma.
I don't think you need groups here, but you can obviously add them to the regex :
(?P<start>^.{0,70})(?P<next>[^,]*)

Find/Match every similar words in word list in notepad++

I have a word list in alphabetical order.
It is ranked as a column.
I do not use any programming languages.
The list in notepad format.
I need to match every similar words and take them on same line.
I use regex but I can't achieve correct results.
First list is like:
accept
accepted
accepts
accepting
calculate
calculated
calculates
calculating
fix
fixed
A list I want:
accept accepted accepts accepting
calculate calculated calculates calculating
fix fixed
This seems to work, but you will have to do Replace All multiple times:
Find (^(.+?)\s*?.*?)\R\2 and replace with \1\t\2. . matches newline should be disabled.
How it works:
It finds some characters at the start of line ^(.+?), then any linebreak \R, and those same characters again \2.
\s*?.*? is used to skip unnecessary characters after multiple Replace All. \s*? skips the first whitespace, and .*? any remaining chars on the line.
Match is replaced with \1\t\2, where \1 is anything matched in (^(.+?)\s*?.*?), and \2 is anything matched with (.+?). \t is used to insert tab character to replace linebreak.
How it breaks:
Note that this will not work well with different words with similar prefix, like:
hand
hands
handle
handles
This will be hand hands handle handles after 2 replaces.
I can imagine doing this programatically with limited success (take first word which comes as a root and if derived word with this root follows, place it on the same line, else take the word as a new root and put it to new line). This will still fail at irregular words where root is not the same for all forms.
Without programming there is a way only with (manual) preprocessing – if there are less than 4 forms for given word in the list, you insert blank line for each missing verb form, so there are always 4 lines for each word. Then you can use regex to get each such a quadruple into one line.

Insertion syntax for regex in Notepad++ or Perl

Shortform: searching:
"{,[0-9][0-9]," inserting Space+00... getting replaced string segment:
"{,SPACE00[0-9][0-9]," or other so-garbaged data for found [0-9][0-9] sequence ... so how do I search with a regex and insert in the middle???
Longform question:
I'm trying to do a series of simple character insertions -- digits actually -- in a series of mixed model CSV profiling data (five files each with different model parameters, several hundred lines each).
I'm visually challenged and desire to insert padding characters to columize data, so I can focus on tweaking key values, not keeping place data file to data file.
This need where the CSV data lines format are:
*Variable_symbolic-name*,{##,##,* ... ('Set of CSV Numerical Data lists' ...},\n*
an actual data line:
61,parameter17,{,70,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
to be morphed to:
61,parameter17,\t\t{, 0070,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Give or take a tab character to align all the { numeric field starts...
I've found searching: "{,[0-9][0-9]," failed but "\{,[0-9][0-9]," succeeds for the find part of the search and replace operation... but have hit a proverbial brick wall in how to do the actual replace (with an insert) of such a short length. (Obviously with so many parameters and files, I'm moving cautiously!)
However, This Perl Help tutorial leaves me in the dark as to how to keep the found ranges and insert padding before (Space, zero, zero to be specific if positive, '-00' if negative) In short, I need to know how to insert 2-3 places in the replace field in Notepad++... and retain the original data without prejudicing it!
Articles herein have cited replacing paragraphs and lines, adding newlines, etc. but this simple insertion alteration seems too simple for you all. But it's been several hours of frustration for me!
Thanks! // Frank
Resolved:
Good news: ({,)([0-9][0-9],) and \1 xx\2 works fine as does ({,)(#[0-9][0-9],) and replacing with \1 xx#\2 ... whether or not tabs are utilized. Obviously the key was ([0-9][0-9],) which included the discrimination of the comma... though I have no idea why that seemed to fail an hour ago with trials made using Sobrinho's help. Must have not tried the sequence. Thanks all!
Try to type this in the search box:
(.+)(\{,[0-9][0-9].*)
And in the replace:
\1\t\t\2
When you have things between parenthesis, they are "stored" by Notepad++ and can be reused in the replace box.
The order of the parenthesis starts with one and are accessed as \1, \2, ...
You tagged it as Perl, so here is how you do it in Perl ...
I prefer to use lookahead assertions rather than backreferences
s/(?= {,[0-9][0-9], ) /\t\t/x
Alternatively, $& contains the matched string ($0 is something different)
s/ {,[0-9][0-9], /\t\t$&/x
You will need a backreference here, meaning something which, in the replace part, will be equal to what you have matched.
Usually, the whole matched part is stored in the $0 backreference. (You can get $1 with a capture group too, and up to $2 with two capture groups, etc)
Back to your question, you could try this:
Find:
(\{,)([0-9][0-9],)
Replace by:
\t\t$1 00$2
This will insert two tab characters before the part that matched \{,[0-9][0-9], (or in other words, replace the part that matched by 2 tab characters and what you matched), then put the first captured part ({,) and then the space and double 0's and then the second captured part, the two digits and following comma.
regex101 demo

Regex: How to dynamically get words after the first word and not the last word in a '_' separated string?

Working on a migrations class in php.
If I have a string like this:
create_users_roles_table
and I want to get the words between the first and the last word correctly, plus being able to get the word correct if there's only one word inbetween like:
create_users_table
How do I go about that?
I've done:
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)_table
and that works fine when I do create_users_roles_table
and produces users and roles.
But when only doing create_users_table it produces user and s.
Obviously I need it to produce only users.
Anyone?
I think it should read
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)?_table
But this won't work if there are three words in between. I'd suggest stripping the words and then splitting them separately, since I don't think regular expressions can handle variable number of capture groups.
If you can be sure of how many words there can be, you can always hard code this. For tree or less words you can use
(\B)_([a-zA-Z]+)(?:_([a-zA-Z]+))?(?:_([a-zA-Z]+))?_table