Regex how find text to the first occurrence backwords - regex

I want to prepare regex to match paragraph containing specified word (the Agency).
Currently my regex is match text too early. I think that i should use [^] somehow but I have no idea how.
Can you help me with that?
\n(\d+.\s+.?the Agency.?)(\n\d+.\s) (https://regex101.com/r/osJPVK/1)
I want to match text from "3". to "4." because it contains "the Agency" phrase.
where sucts.
1. The objectives of the STh other arrangements are not inconsistent or in conflict with this licence
or the STC or other relevant statutory requirements.
3. The objectives of the STC referred to in sub-paragraph 1(c) are the:
(a) efficient discharge of the obligations imposed upon transmission licensees by
transmission licences and the Act;
(b) development, maintenance and operation of an efficient, economical consistent therewith) facilitating such competition in the sion Licence: Standard Conditions – 1 April 2022
91
(g) compliance with the Electricity Regulation and any relevant legally binding
decision of the European Commission and/or the Agency.
4. The STC shall provide for:
(a) there to be referred to the Authority for determination such matters arising under
the STC as may be specified in the STC;
(b) a copy of the STC or any part(s) thereof
\n(\d+.\s+.?the Agency.?)(\n\d+.\s) (https://regex101.com/r/osJPVK/1)

^\d+\.((?!^\d+\.).)*the Agency((?!^\d+\.).)*
The ^ is just to say our match has to be at the beginning of a line.
The tricky part is this:
((?!^\d+\.).)* : (?!...) says to not look for token ... in the next token. Here is a nicely detailed answer. Here we basically say to not have any beginning of line followed by "\d+\." in the match
https://regex101.com/r/tW3SyX/1

Related

Can you limit the words between two capturing groups in Regex

I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.

stop short of multiple strings and characters using '^'

I'm doing a regex operation that to stop short of either character sets { or \t\t{.
the first is ok, but the second cannot be achieved using the ^ symbol the way I have been.
My current regex is [\t+]?{\d+}[^\{]*
As you can see, I've used ^ effectively with a single character, but I cannot apply it to a string of characters like \t\t\{
How can the current regex be applied to consider both of these possibilities?
Example text:
{1} The words of the blessing of Enoch, wherewith he blessed the elect and righteous, who will be living in the day of tribulation, when all the wicked and godless are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, which the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, [even] on Mount Sinai,
And appear from His camp
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
When I do this as a multi-line extract, the indendantation does not maintain for the first line of each block. Ideally the extract should stop short of the \t\t{ allowing it to be picked up properly in the next extract, creating perfectly indented blocks. The reason for this is when they are taken from the database, the \t\t should be detected at the first line to allow dynamic formatting.
[\t+]?{\d+}[\s\S]*?(?=\s*{|$)
You can use this.See demo.
https://regex101.com/r/nNUHJ8/1

Regex to match proper nouns + numbers

I am trying to make a regex that matches proper nouns including numbers (if there are any) i.e. Fifa 2017
I have this:
(?:\s*\b([A-Z][a-z]+)\b)+
...which gets the string without numbers.
Test at: http://regexr.com/3dmuo
I've fiddled around with so many approaches but Regex is dare I say slightly beyond my ability.
Thanks in advance for any advice.
This solution shows how to match a single-word resembling a "proper noun" followed by a number. This explicitly matches a word-like string starting with a capital letter, followed by any number of letters or digits until a space is reached, and then any number of digits.
data = [
"I am reviewing Fifa 2017",
"I am reviewing Mighty No 9",
"I am writing about Unreal Engine",
"Are you interested in MotoGP 2017?",
"When does NASCAR 2017 start?",
"Can Team Ferrari win Formula1 2017?",
"Or will Red Bull take the Formula 1 2017 win?",
"I plan to see F-1 2019, so I best start planning now!",
"Have you used an Apple Mac Book Pro lately?",
"Microsoft makes consumer operating systems"
];
for (var i in data) {
var match = data[i].match(/(?:\b[A-Z][A-Za-z0-9]+\b)(?:\s*\b[A-Z][A-Za-z0-9]+\b)*(?:\s*\d+)?/g);
if (match) {
console.log(data[i], " match: ", match)
} else {
console.log(data[i], " doesn't match!")
}
}
The data used is taken as a riff on the original example of "Fifa 2017", and other major sporting seasons are also represented. There are a variety of requirements represented here.
One failing examples is presented for "F-1 2019", since it fails to meet the original specification. Matching that case, as well, would not be difficult, but the specification would need to be expanded to suit.
There are also a few false matches, due to the specification. These matches are either due to matching text that looks like a "proper noun" (e.g. "When", "Or", "Have") or numbers within the "proper noun", but separated by space (e.g. "Formula 1 2017" matches "Formula 1", but not the "2017"). These may or may not be able to be handled strictly by a regex, and might even be too complex for solving in the general case.
If the input text is suitably constrained, this sort of searching can work, but there may be exceptions that occur unexpectedly.
I looked at the rules for proper nouns in Wikipedia: Letter Case to create a fairly comprehensive english language proper-noun-finder. I have no formal training in regex so please point out any mistakes (still don't know how \b works, ha ha).
(\b(the\b\s\b)?((([A-Z]('[A-Z])?[a-z]+)-*)+\b((\s\b(of|the|de|los|e|van|der|von|zu|d|di|ibn)\b)*(\s\b([A-Z]('[A-Z])?[a-z]+)-*)+\b)*)+)+
The major issue with this parser is that it recognizes the beginning of each sentence as a proper noun.
This may be desirable to you, because the alternative is to sometimes recognize broken words. For example, Jade Smith is swell. Tim van Smythe isn't will recognize Smith and Smythe as the only proper nouns, if you implement my below solution.
If your parser support negative lookbehind, you can prepend (?<!([.!?;]['"]?\s\b)|^) to the regex string.
Some parsers (such as python's re module) will treat the ^ (beginning of string) as a non-fixed width, and reject your search. My solution to this problem was to remove it (making the prepend be (?<![.!?;]['"]?\s\b) instead), and prepend . to the input string.
This matches all(most) words that begin with a single capital letter. It does allow for complicated names, but obviously doesn't take everything into account. I was fairly rigid in only matching correctly-capitalized proper nouns, but regex has limitations and I'm not very good at it in the first place.
For example, here's a list of potential matches:
Tam O'Shanter or Tam-O-Shanter but not Tam o'Shanter or Tam-o-Shanter
Rio di Janeiro or Rio Di Janeiro (both correct as far as I know)
Ludvig van Halen
Shea D'Angelo and Shea d Angelo but not Shea d'Angelo
It does not match acronyms, such as NBA, FIFA, or NHL. Importantly, this means that it will not match Jonah J. Jamieson as a full proper noun (it will match Jonah and Jamieson as two separate nouns). It cannot handle single-letter proper nouns.
Try this:
(?:\s*\b([A-Z][a-z]+)\b)+\s?(\d+)?

Weighted disjunction in Perl Regular Expressions?

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)
How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.

What would be the best (runtime performance) application or pattern or code or library for matching string patterns

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.
The simplest thougt is that there are some specified patterns and we want to know which of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.
I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.
Scenario:
Data File:
Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:
A1- Will be "3 digits" followed by a
"." follwed by "2 digits" -
[0-9]{3}.[0-9]{2}
A2- Will be "1
character" follwoed by "digits" -
[A-Z][0-9]{4}
The sample would be something like:
<Data>
<R1>
<A1>123.45</A1>
<A2>A5567</A2>
<A4>456EV</A4>
<An>xxx</An>
</R1>
</Data>
Rule File:
Rule ID A1 A2
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}
Rule Location - I am planning to store the Rule IDs in some sort of bit mask.
So the rule IDs are then listed as location on a string
Rule ID Location (from left to right)
1001 1
2002 2
3055 3
Pattern file: (This is not the final structure, but just a thought)
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011
Now let's assume that SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100
Doing AND on each of the loctions
gives me the location 1 - 1001 -
Complete match.
Doing XOR on each of the loctions
gives me the location 3 - 3055 -
Partial match. (I am purposely not
doing an OR, because that would have
returned 1001 and 3055 as the result
which would be wrong for partial
match)
The final reulsts I am looking for are:
1001 - Complete Match
3055 - Partial Match
Start Edit_1: Explanation on Matching results
Complete Match - This occurs when all
of the patterns in given Rule are
matched.
Partial Match - This ocurrs when NOT
all of the patterns in given Rule are
matched, but atleast one pattern
matches.
Example Complete Match (AND):
Rule ID 1001 matched for A1(101) and A2 (100). If you look at the first charcter in 101 and 100 it is "1". When you do an AND - 1 AND 1 the result is 1. Thus position 1 i.e. 1001 is a Complete Match.
Exmple Partial Match (XOR):
Rule ID 3055 matched for A1(101). If you look at the last character in 101 and 100 it is "1" and "0". When you do an XOR - 1 XOR 0 the result is 1. Thus position 3 i.e. 3055 is Partial Match.
End Edit_1
Input:
The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.
Rules:
The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.
Output:
Need to know which rules matched completely and partially.
Preferences:
I would prefer doing as much of the coding as I can in C#. However if there is a major performance boost, I can use a different language.
The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.
Questions:
Are there any existing pattern or
framewroks to do this?
Is using Regex the right path
specifically Assembly Compiled
Regex?
If I end up using Regex how can I
specify for A1 patterns to only
match against A1 column?
If I do specify rule locations in a
bit type pattern. How do I process
ANDs and XORs when it grows to be
100K charcter long?
I am looking for any suggestions or options that I should consider.
Thanks..
The regular expression API only tells you when they fully matched, not when they partially matched. What you therefore need is some variation on a regular expression API that lets you try to match multiple regular expressions at once, and at the end can tell you which matched fully, and which partially matched. Ideally one that lets you precompile a set of patterns so you can avoid compilation at runtime.
If you had that then you could match your A1 patterns against the AI column, A2 columns against the A2 pattern, and so on. Then do something with the list of partial and full regular expressions.
The bad news is that I don't know of any software out there that implements this.
The good news is that the strategy described in http://swtch.com/~rsc/regexp/regexp1.html should be able to implement this. In particular the State sets can be extended to have information about your current state in multiple patterns at the same time. This extended set of State sets will result in a more complex state diagram (because you're tracking more stuff), and a more complex return at the end (you're returning a set of State sets), but runtime won't be changed a bit, whether you're matching one pattern or 50.