REGEX to find first instance after set length - regex

I'm probably going to get pilloried for asking this question, but after searching and trying to figure out this regex on my own, I'm just tired of wasting time trying to figure out. Here's the problem I'm trying to solve. I frequently use editpad pro to to convert character strings so they will fit into a mainframe.
For instance, I want to convert a column of words from excel into an IN clause for sql. The column is 5000 words or so.
I can easily copy and paste that into the text editor and then using find and replace convert that from a column of words to a single row with ',' separating each word.
Once that's done, though I want to use a regex to split this row before or after a comma after 70 characters have gone by.
(?P<start>^.{0,70})
This will give me the first 70 characters, but then I get stuck as I can't figure out how to create the next group to find all the characters up to the next comma so I can refer to it like this
(?P<start>^.{0,70})(?P<next>????,)
If I could get that, then I could create do a find and replace that would break it after the first comma that appears after the 70th character.
I know given the rest of the day I could figure it out, but I need to move on. I've tried this before. I would even be willing to only find the first 7o characters and then next few characters until the comma and then have to repeat the replace and find multiple times, if necessary, but I can not get the regex to work.
Any assistance with this would be greatly appreciated.
Here is some sample data that I have added line breaks into as an example of what I want it to look like after the regex runs.
'Ability','Absence','Absolute','Absorb','Accident','Acclaim','Accompany',
'Accomplish','Achievement','Acquaintance','Acquire','Across','Acting','Address',
'Admire','Adorable','Advance','Advertisement','Afraid','Agriculture','Align',
'All','Allow','Allowance','Allowed','Alone','Aluminium','Always','America',
'Analyze','Android','Angle','Announce','Annual','Ant','Antarctica','Antler',

I think you should consider restricting your initial concatenation, but here's a solution to your specific implementation :
^.{0,70}[^,]*
This will select the first 70 characters (if available), then every character up to the one before the next comma.
I don't think you need groups here, but you can obviously add them to the regex :
(?P<start>^.{0,70})(?P<next>[^,]*)

Related

Find all semicolons surrounded by numbers in MS Office WORD

In a word document, I'd like to find all semicolons that are surrounded by numbers. so I need the Find/Replace DialogBox to find and select ; in 123;4 and 1;234 (but not example;example or not example; example) .
Please note that in the example above, I just need the semicolon selected for formatting.
I know the basics of RegEx but WORD's so called Wildcards are different. All I could do so far was to find the whole string (eg 123;4) using ([0-9]{1,};[0-9]{1,}) but like I said I only need ; so that I change it's font size, color etc.
Please help. I don't like to spend the whole day on a stupid document.
You can work around Microsoft Word's limitations by inserting a custom sequence of characters to mark the instances of ; that you want to format, and then deleting the custom sequence of characters. In this example I will use the character sequence zzzzz as this likely doesn't appear anywhere else in your document.
The Process
First do a find-and-replace:
Find what: ([0-9]{1,});([0-9]{1,})
Replace with: \1zzzzz;\2
Now you can select all instances of zzzzz; and apply the desired format.
Lastly, do another find-and-replace:
Find what: zzzzz
Replace with: nothing

Validating a list using Regex

I have been asked to help sort out a system that validates some returned MS Excel forms.
A Python program already exists that takes a returned Excel form, and another MS Excel workbook with looks much like the form, but the fields are filled with Regexes.
The Python code then validates the form by testing the returned form, against the Regexes in turn from the Regex form.
This all works as expected, and pumps out a workbook with a list of problems where a match doesn't occur.
However the validation wasn't always returning results as expected and I was asked to try sorting it out.
I have been reviewing the regexes against a document describes what the valid responses in the form should be. I can cope with most of them, but a couple of them got me to thinking. These are ones where the valid entry in the form is a list of items. E.g. a list of words seperated by commas or newlines.
Everytime I come across one of these I have been using the following approach:
^[A-Z]{0,10}((, *|\n)[A-Z]{0,10})*$
So this will match the first uppercase word up to 10 letters, then the rest of the list with each entry preceeded by a , or <CR>. It works, but I wonder if there is a better way? The reason I am thinking this is because the matching pattern for each list entry has to be in the regex twice. So if a problem is spotted, it has to be corrected in two places.
Is there a better way?
Use
^\b((^|, *|\n)[A-Z]{1,10})+$
It makes sure the text starts with a word character (i.e. a-zA-Z0-9_) by checking for a word boundary. Thus preventing the text to start with , or an empty line.
Then, repeating at least once, there should be a line beginning with one of
Start of text
A comma followed by any number of spaces
a newline
then, at least one, at the most ten, uppercase characters.
The first start alternative is obviously only possible for the start of the line, the second for intermediate words and the final for the last word on a line.
See it here at regex101.

Regular expression for rest of line after first x characters

I have a bunch of lines with IDs as the first six characters, and data I don't need after. Is there a way to identify everything after the ID section so Find and Replace can replace it with whitespace?
/.{6}\K.*//
If you want something more specific, please be more specific in your question.

Regex: How to dynamically get words after the first word and not the last word in a '_' separated string?

Working on a migrations class in php.
If I have a string like this:
create_users_roles_table
and I want to get the words between the first and the last word correctly, plus being able to get the word correct if there's only one word inbetween like:
create_users_table
How do I go about that?
I've done:
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)_table
and that works fine when I do create_users_roles_table
and produces users and roles.
But when only doing create_users_table it produces user and s.
Obviously I need it to produce only users.
Anyone?
I think it should read
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)?_table
But this won't work if there are three words in between. I'd suggest stripping the words and then splitting them separately, since I don't think regular expressions can handle variable number of capture groups.
If you can be sure of how many words there can be, you can always hard code this. For tree or less words you can use
(\B)_([a-zA-Z]+)(?:_([a-zA-Z]+))?(?:_([a-zA-Z]+))?_table

How to Use Regex to Ensure Complete Words While Adding a Character Limit to Yahoo Pipes?

I'm pretty new to this, so excuse me if my question isn't that clear. I'm pulling an RSS Feed into Yahoo Pipes and using Regex to modify it. Here's what I'm trying to do:
Limit the number of characters in an entry, but...
Make sure the item includes complete words, and...
If the item is shortened, add an ellipses, but...
If it falls within the limits nothing should be done to it
So, if a feed's Title is: "This article is important" and the limit is 20 characters, the result should be "This article is..." But if the Title is "Good Article," nothing should happen to it.
After doing some research I think that I want to combine an if/then statement with lookahead, i.e. go to the character limit and if there is a character following it that is a space, add an ellipses, if it is a number or letter, go to the final space within the limit and add an ellipses, but if there isn't any character following it, don't do anything. Does this make sense? Is there an easier way to do what I'm going for?
I would really appreciate any help you could provide. Thanks!
Try replacing the title using the following pattern:
^(?=.{23})(.{0,20})(?=\s).*$
With the string
$1...
Working example: http://pipes.yahoo.com/pipes/pipe.info?_id=04158a7a5ea390b1b0b78ebccadcec79
How does it work?
(?=.{23}) - First, we check the length is at least 23 (that's for 20 + '...', you can play with that)
(.{0,20}) - Match at most 20 characters on the first group.
(?=\s) - Make sure there's a space after the last character. If not, it will try to match fewer characters.
.* - Match all the way to the end, so the rest of the line is removed.
An edge case here is a single word longer than 20 characters. If that's a problem, you can solve it by using:
^(?=.{23})(.{0,20}(?=\s)|\S{20}).*$