Regular Expression to Clean a numbered list - regex

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".
I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:
^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))
and the following as the replace for each of the above:
$1. $2 $3 $4$5
My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...
An example of the text I want to clean up:
1 This is the first step in the list
2 Second lot if instructions to run through
3 Doing more of the recipe instruction
4 Half way through cooking up a storm
5 almost finished the recipe
6 Serve and eat
And what I want it to look like:
1. This is the first step in the list
2. Second lot if instructions to run through
3. Doing more of the recipe instruction
4. Half way through cooking up a storm
5. almost finished the recipe
6. Serve and eat
Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?

dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:
^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search
$1. $2\r\n\r\n // replace
If you're not using Windows, remove the \rs from the replace string.
Explanation:
^ // beginning of the line
(\d+) // capture group 1. one or more digits
\s+ // any spaces after the digit. don't capture
([^\r\n]+) // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture
Replace:
$1. // group 1 (the digit), then period and a space
$2 // group 2
\r\n\r\n // two EOLs, to create a blank line
// (remove both \r for Linux)

What about this?
1 Tbsp salt
2 Tsp sugar
3 Eggs
You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.
I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.

Related

Regex parsing of digits and alphabets of variable lengths and frequency

I am still a newbie to regex and find it rather steep in grasping it all in one go. Hence, I reaching out to you all to understand how I can grab the first group of digits or alphabets in the following example
01_crop_and_animal
02_03_forestry_fishing
05_09_13_15_19_23_31_39_other_location
68201_68202_operation_of_dwellings
a_agriculture_forestry_and_hunting_01_03
b_f_secondary_production_05_43
Digits seems to appear multiple times, and can have length of 2 to 5. Alphabets occur once or twice. I would essentially like to see the output as:
01
0203
0509131519233139
6820168202
a
bf
Thanks for your help!
Rob
It can be done in 2 step.
1rst step, match the digits/letters:
^([a-z](?:_[a-z])?|\d{2,5}(?:_\d{2,5})*)(?![a-z\d])
Demo & explanation
2nd step, remove the underscores.
You will have to do in two steps, first select
^([0-9_]+|[a-z](_[a-z])?_)
Then, delete all the _ from your resulting strings.
See https://regex101.com/r/jY9Y2I/1

Bash: Regex for SVN Conflicts

So I'm trying to write a regex to use for a grep command on an SVN status command. I want only files with conflicts to be displayed, and if it's a tree conflict, the extra information SVN provides about it (which is on a line with a > character).
So, here's my description of how SVN outputs lines with conflicts, and then I'll show my regex:
[Single Char Code][Spaces][Letter "C"][Space]Filename
[Spaces][Letter "C"][Space]Filename
[Letter "C"][Space]Filename
This is what I have so far to try and get the proper regex. The second part, after the OR condition, works fine to get the tree conflict extra line. It's the first part, where I'm trying to get lines with the letter C under very specific conditions.
Anyway, I'm not exactly the greatest with Regex, so some help here (plus an explanation of what I'm doing wrong, so I can learn from this) would be great.
CONFLICTS=($(svn status | grep "^(.)*C\s\|>"))
Thanks.
This regex should match your lines :
CONFLICTS=$(svn status | grep '^[ADMRCXI?!~ ]\? *C')
^[ADMRCXI?!~ ]\?: lines starting with zero or one \?status character ^[ADMRCXI?!~ ]
*zero or more spaces
character C
I removed the extra parenthesis surrounding the command substitution.
You have to read description of svn st output more deeply and try to get at least one Tree Conflict.
I'll start it for you:
> The first seven columns in the output are each one character wide:
>...
> Seventh column: Whether the item is the victim of a tree conflict
>...
> 'C' tree-Conflicted
and note: theoretically any of these 7 columns can be non-empty
status for tree-conflict
M wc/bar.c
! C wc/qaz.c
> local missing, incoming edit upon update
D wc/qax.c
Dirty lazy draft of regexp
^[enumerate_all_chars_here]{6}C\s

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

EditPad: Need a regex that handles multiple possible data formats

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.
I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.
Here is some sample data:
3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre
Here is how the above data needs to end up:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
So, there are numerous rules, such as:
In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
If a question mark is present, everything from the mark onwards should be removed.
If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
If the string contains 2 underscores, averything after the first numeric string up to a dash
(if present) should be removed, and the string "sm-" should be prepended.
Additionally there is other data that must be left untouched, including but not limited to:
113535|24905|24905
as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)
This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.
Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.
That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.
Here is how.
The solution below is tested in EPP.
Regex 1
Press Ctrl + Sh + F to enter Search / Replace mode
Enter the following Search and Replace in the appropriate boxes
At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1
Search:
(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)
Replace:
\1\2\3\4\5
Regex 2
Same 1-2-3 steps as above.
Search
^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
Replace
sm-\1\2
Chaining Regex 1 and Regex 2
Top menu: Macros, Record Macro, give it a name.
Click the Favorite searches pulldown, select Regex 1
Hit Replace All.
Click the Favorite searches pulldown, select Regex 2
Hit Replace All.
Macros, Stop recording.
Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.
Testing This
I have tested my "Jonathan macro" on your input. Here is the result:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
Try this:
Toggle the Search Panel : SHIFT+CTRL+F
SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
REPLACE: $1
Check REGEX and WORDS
Click Replace All or Hit CTRL+ALT+F3
Check the image below:

Separating out a list with regex?

I have a CSV file which has been generated by a system. The problem is with one of the fields which used to be a list of items. An example of the original list is below....
The serial number of the desk is 45TYTU
This is the second item in the list
The colour of the apple is green
The ID code is 489RUI
This is the fourth item in the list.
And unfortunately the system spits out the code below.....
The serial number of the desk is 45TYTUThis is the second item in the listThe colour of the apple is greenThe ID code is 489RUIThis is the fourth item in the list.
As you can see, it ignores the line breaks and just bunches everything up. I am unable to modify the system that generates this output so what I am trying to do is come up with some sort of regex find and replace expression that will separate them out.
My original though would be to try and detect when an upper case letter is in the middle of a lower case word, but as in one of the items in the example, when a serial number is used it throws this out.
Anyone any suggestions? Is regex the way to go?
--- EDIT ---
I think i need to simplify things for myself, if I ignore the fact that lines that end in a serial number will break things for now. I need to just create an expression that will insert a line break if it detects that an upper case letter is being used after a lower case one
--- EDIT 2 ---
Using the example given by fardjad everything works for the sample data given, the strong was...
(.(?=[A-Z][a-z]))
Now as I test with more data I can see an issue appearing, certain lines begin with numbers so it is seeing these as serial numbers, you can see an example of this at http://regexr.com?2vfi5
There are only about 10 known numbers it uses at the start of the lines such as 240v, 120v etc...
Is there a way to exclude these?
That won't be a robust solution but this is what you asked. It matches the character before an uppercase letter followed by a lowercase one. You can simply use regex replace and append a new line character:
(.(?=[A-Z][a-z]))
see this demo.
You could search for this
(?<=\p{Ll})(?=\p{Lu})
and replace with a linebreak. The regex matches the empty space between a lowercase letter \p{Ll} and an uppercase letter \p{Lu}.
This assumes you're using a Unicode-aware regex engine (.NET, PCRE, Perl for example). If not, you might also get away with
(?<=[a-z])(?=[A-Z])
but this of course only detects lower-/uppercase changes in ASCII words.