I am a Python beginner. I tried to figure this out, but I failed. I need to find a keyword in a text file. If there is the keyword in any part of the whole text, then I need to select sentences surrounding the keyword, including the keyword. The number of sentences is arbitrary so it could be 5 or 10. There could be a blank line between sentences so I need to include the blank line as well.
For example:
Let keyword be: compensation
Let input text is:
"The costs incidental to our solicitation and obtaining of proxies, including the cost of reimbursing banks and brokers for forwarding proxy materials to their principals, will be borne by us. Proxies may be solicited, without extra compensation, by our officers and employees, both in person and by mail, telephone and other methods of communication."
The output I want for example: "The costs incidental... compensation... communication."
I tried to use this: p = re.compile( r'[^.]compensation[^.]+.') p.findall(text)
Using the above code, I can select only the sentence that contains the keyword. What I need is to select sentences surrounding the keyword. I need to control the number of sentences before and after the sentence containing the keyword. SO for example, if I want to select two sentences before the sentence containing the keyword, the sentence containing the keyword, and two sentences after the sentence containing the keyword, what should I do?
Assuming your input is structured such: <sentence> <period> <sentence> <period>
Then you need to first select the full sentence which may start by your keyword, end by your keyword, start AND end by your keyword (although unlikely) for each match. Then you select the number of <sentence> <period> before and the same goes for after.
import re
s = open('text.txt', 'r').read()
p = re.compile(r'(([^\.]*\.){2}[^\.]*compensation[^\.]*\.([^\.]*\.){3})')
for i in p.findall(s):
print("match='" + i[0] + "'")
Because we are using the group metacharacters '(' and ')', findall() will return a list of tuple of those, not what we want. So we add a group around the whole regex (which will necessarily be the first group as it is the outermost one).
EDIT: Another possibility is to use non-capturing groups (?:...). findall() will only return the full matches with those.
Allowing the number or sentences matched before (2) and after (3) to vary is left as an exercice (this should be easy to do using the string formatting facilities of Python).
Output
match=' Holy bacon. The costs incidental to our solicitation and
obtaining of proxies, including the cost of reimbursing banks and
brokers for forwarding proxy materials to their principals, will be
borne by us. Proxies may be solicited, without extra compensation, by
our officers and employees, both in person and by mail, telephone and
other methods of communication. My. Oh My. God.'
match=' C. D. My compensation is your compensation. E. F. G.'
Text used
Golly. Jeez. Holy bacon. The costs incidental to our solicitation and
obtaining of proxies, including the cost of reimbursing banks and
brokers for forwarding proxy materials to their principals, will be
borne by us. Proxies may be solicited, without extra compensation, by
our officers and employees, both in person and by mail, telephone and
other methods of communication. My. Oh My. God. Feels. Good. To be.
King of the Jungle.
A. B. C. D. My compensation is your compensation. E. F. G. Hi. Ijjk.
Lllme.
Related
I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.
I have a plain text and need to extract company names. It's a huge document including company names, financial reports and lots of text.Here are examples of company names:
Big laundry, a.s.
AVERA, s.r.o.
Airoflot Airlines, a.s.
Is it even possible to make regex like this? I'm complete beginner to regex and have no idea how to create this one. Thanks for any help.
Example of text:
`There are many competitors of AVERA, s.r.o. the main one is Airflot Airlines, a.s. and Big laundry, s.r.o. These organisations hold main share of market.
Another companies:
a. Big Company, a.s.
b. Smaller company, s.r.o.
c. Huge company, a.s.`
As the question currently stands, no it is not possible to create a regex for company names.
It would be possible if you are able to create a PATTERN.
Means e.g. A company name always:
starts with an uppercase letter
has a comma
after the comma there is always one of "a.s." or "s.r.o."
So, the difficulties that I see here are:
How many words before the comma belong to the name?
Is there always a comma with following abbreviation?
Names are always difficult to match because a name can be nearly everything, especially company names.
The examples you give follow this pattern : ([A-Z][A-Za-z]+ ?)+, (\w\.)+
The matching operation will be dependent of the tool you use.
For example in JavaScript :
var line = "some name is Airoflot Airlines, a.s. in this line";
var m = line.match(/([A-Z][A-Za-z]+ ?)+, (\w\.)+/);
if (m.length) console.log(m[0]);
This logs
"Airoflot Airlines, a.s."
But this isn't a very reliable solutions : many real company names wouldn't fit and, more importantly perhaps, this would match sentences that aren't company names. So this can only be used as an help in a solution which also incorporates some kind of validation (human or dictionary based).
I use this
(?:\s*[a-zA-Z0-9,_\.\077\0100\*\+\&\#\'\~\;\-\!\#\;]{2,}\s*)*
it matches all a-z, A-Z,0-9 and some special characters which Quickbook supports.
https://community.intuit.com/articles/1146006-acceptable-characters-in-the-company-name-in-quickbooks-online
with your given examples, this regexp would match
Big laundry, a\.s\.|AVERA, s\.r\.o\.|Airoflot Airlines, a\.s\.
The trick is to use the alternation operator | on a set of strings
You may wish to consider missing punctuation and white space in the company names too
I'm trying to split a string (a persons name) into components: prefix (Dr, Mr, Miss, etc), given, middle, family, and suffix (Jr, III, etc...).
Prefixes and suffixes can be a known list of options.
Edge cases for double barreled family names like 'da Vinci' or 'di Caprio' don't really bother me too much. The da's and di's will just be dropped in the middle name, or if a middle is given (i.e. 4 names are found that don't match a prefix or suffix) then everything after the second name is dropped in the family name.
I'm thinking about writing the regex myself... but before I go and reinvent the wheel, I wonder if anyone has something that works I can use?
Thanks.
Here is a proposal in perl (I did not find a language or regex flavor requirement).
Perl supports non-capturing groups, e.g. "(?:\w+)", which I consider needed to stay below 10 captured groups.
I am using "\w+" almost everywhere, for simplicity. Names can therefor contain "_" and digits. If you do not like that, use "[[:alpha:]]+" instead.
perl -pe"s/(?:(Dr\.|Mr\.) )?(?:(\w+)(?: (\w+(?: \w+)*))? )?(?:(\w+) (Jr\.|I+))|(?:(Dr\.|Mr\.) )?(?:(\w+)(?: (\w+(?: \w+)*))? )?(\w+)/pre\1\6 give\2\7 middle\3\8 fam\4\9 post\5/"
For demonstration purposes, the code replaces, while inserting field names.
Please extract the requested regex and fill in the missing pres and posts.
What I consider the trick is to have one big alternative "|", which prefers matches with a postfix.
The fields are filled by using two groups each, one from the first, one from the second alternative. Only one of each pair is non-empty.
I tested with a test text file, containing combination of
prefix present
postfix present
given present
middle present (assuming that more middles work too)
second middle present
All test cases have a family name.
"Superman II" and "Madonna" would both only have a family name, hope that is OK, the super hero movie gets a suffix.
"Dr. Who" has a prefix and a family name.
I.e. I ignored the "Di"s, as you permitted.
I consider the output plausible.
I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)
How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.
Although this seems like a trivial question, I am quite sure it is not :)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
^[a-z -']+$
However, I need to support also these cases:
other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
no numbers or symbols or unnecessary punctuation or runes, etc..
titles, middle initials, suffixes are not part of this data
names are already separated by surnames.
we are prepared to force ultra rare names to be simplified (there's a person named '#' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
note that many countries have laws about names so there are standards to follow
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?
I would be looking for something similar to the many "email address" regexes that you can find on google.
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //' does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, #"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
I would just allow everything (except an empty string) and assume the user knows what his name is.
There are 2 common cases:
You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
In case (1), you can allow all characters because you're checking against a paper document.
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".
I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like ##$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.
EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:
s/[\<\>\"\'\%\;\(\)\&\+]//g;
"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)
v3.72, 2015-09-19 is a more recent version.
BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.
I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.
You could use the following regex code to validate 2 names separeted by a space with the following regex code:
^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$
or just use:
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-ZÀ-Ú]
[[:alpha:]] = [A-Za-zÀ-ú]
[[:alnum:]] = [A-Za-zÀ-ú0-9]
It's a very difficult problem to validate something like a name due to all the corner cases possible.
Corner Cases
Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.
A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
This one worked perfectly for me in JavaScript:
^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$
Here is the method:
function isValidName(name) {
var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
return found > -1;
}
Steps:
first remove all accents
apply the regular expression
To strip the accents:
private static string RemoveAccents(string s)
{
s = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
}
return sb.ToString();
}
This somewhat helps:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$
This one should work
^([A-Z]{1}+[a-z\-\.\']*+[\s]?)*
Add some special characters if you need them.