Regex parsing of digits and alphabets of variable lengths and frequency - regex

I am still a newbie to regex and find it rather steep in grasping it all in one go. Hence, I reaching out to you all to understand how I can grab the first group of digits or alphabets in the following example
01_crop_and_animal
02_03_forestry_fishing
05_09_13_15_19_23_31_39_other_location
68201_68202_operation_of_dwellings
a_agriculture_forestry_and_hunting_01_03
b_f_secondary_production_05_43
Digits seems to appear multiple times, and can have length of 2 to 5. Alphabets occur once or twice. I would essentially like to see the output as:
01
0203
0509131519233139
6820168202
a
bf
Thanks for your help!
Rob

It can be done in 2 step.
1rst step, match the digits/letters:
^([a-z](?:_[a-z])?|\d{2,5}(?:_\d{2,5})*)(?![a-z\d])
Demo & explanation
2nd step, remove the underscores.

You will have to do in two steps, first select
^([0-9_]+|[a-z](_[a-z])?_)
Then, delete all the _ from your resulting strings.
See https://regex101.com/r/jY9Y2I/1

Related

Regular expression string division, priorize the part lengths

I have this string
0Sc-a+nn1.ed_AI&AO1301#89
That has to be split in three parts
0Sc-a+nn1.ed_AI&AO
1301
89
I am using this RE (?P<prefix>[a-z\.\_\-\+(\&)]+\W?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?)) in python, but for now, testing in https://regex101.com/.
I am having problem to identify the first part. If I try "Sc-a+nn.ed_AI&AO1301#89" works fine, but adding the numbers to the first part, as the example, don't.
How to priory the second and the third part to be the maximum length allowed around the # and the first one () allow numbers in the beginning and middle (never at the end because will be in part two)? ? is there because sometimes the precedent element doesn't exist.
Use [a-zA-Z]{2} to capture the string after & and specify the length for each part i.e [\d]{4}
(?P<prefix>[A-Za-z0-9._\-+&;]+[a-zA-Z]{2}?)(?P<num>((?P<ref_num>\d+)(#(?P<subpart_num>\d+))?))

How do I find strings that only differ by their diacritics?

I'm comparing three lexical resources. I use entries from one of them to create queries — see first column — and see if the other two lexicons return the right answers. All wrong answers are written to a text file. Here's a sample out of 3000 lines:
réincarcérer<IND><FUT><REL><SG><1> réincarcèrerais réincarcérerais réincarcérerais
réinsérer<IND><FUT><ABS><PL><1> réinsèrerons réinsérerons réinsérerons
macérer<IND><FUT><ABS><PL><3> macèreront macéreront macéreront
répéter<IND><FUT><ABS><PL><1> répèterons répéterons répéterons
The first column is the query, the second is the reference. The third and fourth columns are the results returned by the lexicons. The values are tab-separated.
I'm trying to identify answers that only differ from the reference by their diacritics. That is, répèterons répéterons should match because the only difference between the two is that the second part has an acute accent on the e rather than a grave accent.
I'd like to match the entire line. I'd be grateful for a regex that would also identify answers that differ by their gemination — the following two lines should match because martellerait has two ls while martèlerait only has one.
modeler<IND><FUT><ABS><SG><2> modelleras modèleras modèleras
marteler<IND><FUT><REL><SG><3> martellerait martèlerait martèlerait
The last two values will always be identical. You can focus on values #2 and 3.
The first part can be achieved by doing a lossy conversion to ASCII and then doing a direct string comparison. Note, converting to ASCII effectively removes the diacritics.
To do the second part is not possible (as far as I know) with a regex pattern. You will need to do some research into things like the Levenshtein distance.
EDIT:
This regex will match duplicate consonants. It might be helpful for your gemination problem.
([b-df-hj-np-tv-xz])\\1+
Which means:
([b-df-hj-np-tv-xz]) # Match only consonants
\\1+ # Match one or times again what was captured in the first capture group

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Regular Expression to Clean a numbered list

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".
I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:
^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))
and the following as the replace for each of the above:
$1. $2 $3 $4$5
My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...
An example of the text I want to clean up:
1 This is the first step in the list
2 Second lot if instructions to run through
3 Doing more of the recipe instruction
4 Half way through cooking up a storm
5 almost finished the recipe
6 Serve and eat
And what I want it to look like:
1. This is the first step in the list
2. Second lot if instructions to run through
3. Doing more of the recipe instruction
4. Half way through cooking up a storm
5. almost finished the recipe
6. Serve and eat
Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?
dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:
^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search
$1. $2\r\n\r\n // replace
If you're not using Windows, remove the \rs from the replace string.
Explanation:
^ // beginning of the line
(\d+) // capture group 1. one or more digits
\s+ // any spaces after the digit. don't capture
([^\r\n]+) // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture
Replace:
$1. // group 1 (the digit), then period and a space
$2 // group 2
\r\n\r\n // two EOLs, to create a blank line
// (remove both \r for Linux)
What about this?
1 Tbsp salt
2 Tsp sugar
3 Eggs
You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.
I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.

Separating out a list with regex?

I have a CSV file which has been generated by a system. The problem is with one of the fields which used to be a list of items. An example of the original list is below....
The serial number of the desk is 45TYTU
This is the second item in the list
The colour of the apple is green
The ID code is 489RUI
This is the fourth item in the list.
And unfortunately the system spits out the code below.....
The serial number of the desk is 45TYTUThis is the second item in the listThe colour of the apple is greenThe ID code is 489RUIThis is the fourth item in the list.
As you can see, it ignores the line breaks and just bunches everything up. I am unable to modify the system that generates this output so what I am trying to do is come up with some sort of regex find and replace expression that will separate them out.
My original though would be to try and detect when an upper case letter is in the middle of a lower case word, but as in one of the items in the example, when a serial number is used it throws this out.
Anyone any suggestions? Is regex the way to go?
--- EDIT ---
I think i need to simplify things for myself, if I ignore the fact that lines that end in a serial number will break things for now. I need to just create an expression that will insert a line break if it detects that an upper case letter is being used after a lower case one
--- EDIT 2 ---
Using the example given by fardjad everything works for the sample data given, the strong was...
(.(?=[A-Z][a-z]))
Now as I test with more data I can see an issue appearing, certain lines begin with numbers so it is seeing these as serial numbers, you can see an example of this at http://regexr.com?2vfi5
There are only about 10 known numbers it uses at the start of the lines such as 240v, 120v etc...
Is there a way to exclude these?
That won't be a robust solution but this is what you asked. It matches the character before an uppercase letter followed by a lowercase one. You can simply use regex replace and append a new line character:
(.(?=[A-Z][a-z]))
see this demo.
You could search for this
(?<=\p{Ll})(?=\p{Lu})
and replace with a linebreak. The regex matches the empty space between a lowercase letter \p{Ll} and an uppercase letter \p{Lu}.
This assumes you're using a Unicode-aware regex engine (.NET, PCRE, Perl for example). If not, you might also get away with
(?<=[a-z])(?=[A-Z])
but this of course only detects lower-/uppercase changes in ASCII words.