Trim pattern in a text between \n\n\n\n

Trim pattern in a text between \n\n\n\n - regex

I am cleaning text in R. My text has the form
but he could not avoid the subject FULLSTOP \n\n\n\n\nsimilar pieces
by the author\n\n\nlife is great 13022015\nreal men don t eath quiche
22042013\nback to the future 01072012\n\n\n\n and as he takes the
stage here wednesday night to rally democrats around hillary clinton
mr FULLSTOP obama will revisit his own promise to guide the nation
into an era of reconciliation and unity harking back to the themes
that propelled his improbable rise but that seem even more out of
reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for
a divided nation \n\n\n\n we get frustrated with political gridlock worry
about racial divisions are shocked and saddened by the madness of
orlando or nice mr FULLSTOP
I'm trying to get rid of
\n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n
so to obtain something like
but he could not avoid the subject FULLSTOP and as he takes the stage
here wednesday night to rally democrats around hillary clinton mr
FULLSTOP obama will revisit his own promise to guide the nation into
an era of reconciliation and unity harking back to the themes that
propelled his improbable rise but that seem even more out of reach
today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a
divided nation \n\n\n\n we get frustrated with political gridlock
worry about racial divisions are shocked and saddened by the madness
of orlando or nice mr FULLSTOP
I'm trying with something like
gsub("\\\n{3,}(similar pieces)?.*\\\n{3,}", "", my_string)or gsub("\\\n{3,}(similar pieces)?.*?\\\n{3,}", "", my_string)
But it overtrims or does not work.
Any help (as well as an explanation of what I'm doing wrong and why the alternative works) would be very appreciated.

You need to match everything between the first 5 newline symbols up to the first 4 newline symbols.
I suggest a *\n{5}.*?\n{4} * regex:
* - zero or more literal spaces
\n{5} - 5 newline symbols
.*? - zero or more any characters up to the first....
\n{4} - 4 LF symbols
* - zero or more literal spaces (just to trim the match)
and replace with a space.
Use sub since you only need 1 replacement:
sub(" *\n{5}.*?\n{4} *", " ", s)
See R demo

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.

You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.

Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

Regex for name with non-latin characters in python [duplicate]

For website validation purposes, I need first name and last name validation.
For the first name, it should only contain letters, can be several words with spaces, and has a minimum of three characters, but a maximum at top 30 characters. An empty string shouldn't be validated (e.g. Jason, jason, jason smith, jason smith, JASON, Jason smith, jason Smith, and jason SMITH).
For the last name, it should be a single word, only letters, with at least three characters, but at most 30 characters. Empty strings shouldn't be validated (e.g. lazslo, Lazslo, and LAZSLO).

Don't forget about names like:
Mathias d'Arras
Martin Luther King, Jr.
Hector Sausage-Hausen
This should do the trick for most things:
/^[a-z ,.'-]+$/i
OR Support international names with super sweet unicode:
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/u

You make false assumptions on the format of first and last name. It is probably better not to validate the name at all, apart from checking that it is empty.

After going through all of these answers I found a way to build a tiny regex that supports most languages and only allows for word characters. It even supports some special characters like hyphens, spaces and apostrophes. I've tested in python and it supports the characters below:
^[\w'\-,.][^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]]{2,}$
Characters supported:
abcdefghijklmnopqrstwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
áéíóúäëïöüÄ'
陳大文
łŁőŐűŰZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųū
ÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁ
ŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ.-
ñÑâê都道府県Федерации
আবাসযোগ্য জমির걸쳐 있는

I have created a custom regex to deal with names:
I have tried these types of names and found working perfect
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King
Ai Wong
Chao Chang
Alzbeta Bara
My RegEx looks like this:
^([a-zA-Z]{2,}\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\s?([a-zA-Z]{1,})?)
MVC4 Model:
[RegularExpression("^([a-zA-Z]{2,}\\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\\s?([a-zA-Z]{1,})?)", ErrorMessage = "Valid Charactors include (A-Z) (a-z) (' space -)") ]
Please note the double \\ for escape characters
For those of you that are new to RegEx I thought I'd include a explanation.
^ // start of line
[a-zA-Z]{2,} // will except a name with at least two characters
\s // will look for white space between name and surname
[a-zA-Z]{1,} // needs at least 1 Character
\'?-? // possibility of **'** or **-** for double barreled and hyphenated surnames
[a-zA-Z]{2,} // will except a name with at least two characters
\s? // possibility of another whitespace
([a-zA-Z]{1,})? // possibility of a second surname

I have searched and searched and played and played with it and although it is not perfect it may help others making the attempt to validate first and last names that have been provided as one variable.
In my case, that variable is $name.
I used the following code for my PHP:
if (preg_match('/\b([A-Z]{1}[a-z]{1,30}[- ]{0,1}|[A-Z]{1}[- \']{1}[A-Z]{0,1}
[a-z]{1,30}[- ]{0,1}|[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}){2,5}/', $name)
# there is no space line break between in the above "if statement", any that
# you notice or perceive are only there for formatting purposes.
#
# pass - successful match - do something
} else {
# fail - unsuccessful match - do something
I am learning RegEx myself but I do have the explanation for the code as provided by RegEx buddy.
Here it is:
Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference number 1
«([A-Z]{1}[a-z]{1,30}[- ]{0,1}|[A-Z]{1}[- \']{1}[A-Z]{0,1}[a-z]{1,30}[- ]{0,1}|[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}){2,5}»
Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
* I NEED SOME HELP HERE WITH UNDERSTANDING THE RAMIFICATIONS OF THIS NOTE *
Note: I repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «{2,5}»
Match either the regular expression below (attempting the next alternative only if this one fails) «[A-Z]{1}[a-z]{1,30}[- ]{0,1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
Match a single character present in the list “- ” «[- ]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[A-Z]{1}[- \']{1}[A-Z]{0,1}[a-z]{1,30}[- ]{0,1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character present in the list below «[- \']{1}»
Exactly 1 times «{1}»
One of the characters “- ” «- » A ' character «\'»
Match a single character in the range between “A” and “Z” «[A-Z]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
Match a single character present in the list “- ” «[- ]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}»
Match a single character in the range between “a” and “z” «[a-z]{1,2}»
Between one and 2 times, as many times as possible, giving back as needed (greedy) «{1,2}»
Match a single character in the range between “ ” and “'” «[ -\']{1}»
Exactly 1 times «{1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
I know this validation totally assumes that every person filling out the form has a western name and that may eliminates the vast majority of folks in the world. However, I feel like this is a step in the proper direction. Perhaps this regular expression is too basic for the gurus to address simplistically or maybe there is some other reason that I was unable to find the above code in my searches. I spent way too long trying to figure this bit out, you will probably notice just how foggy my mind is on all this if you look at my test names below.
I tested the code on the following names and the results are in parentheses to the right of each name.
STEVE SMITH (fail)
Stev3 Smith (fail)
STeve Smith (fail)
Steve SMith (fail)
Steve Sm1th (passed on the Steve Sm)
d'Are to Beaware (passed on the Are to Beaware)
Jo Blow (passed)
Hyoung Kyoung Wu (passed)
Mike O'Neal (passed)
Steve Johnson-Smith (passed)
Jozef-Schmozev Hiemdel (passed)
O Henry Smith (passed)
Mathais d'Arras (passed)
Martin Luther King Jr (passed)
Downtown-James Brown (passed)
Darren McCarty (passed)
George De FunkMaster (passed)
Kurtis B-Ball Basketball (passed)
Ahmad el Jeffe (passed)
If you have basic names, there must be more than one up to five for the above code to work, that are similar to those that I used during testing, this code might be for you.
If you have any improvements, please let me know. I am just in the early stages (first few months of figuring out RegEx.
Thanks and good luck,
Steve

I've tried almost everything on this page, then I decided to modify the most voted answer which ended up working best. Simply matches all languages and includes .,-' characters.
Here it is:
/^[\p{L} ,.'-]+$/u

First name would be
"([a-zA-Z]{3,30}\s*)+"
If you need the whole first name part to be shorter than 30 letters, you need to check that seperately, I think. The expression ".{3,30}" should do that.
Your last name requirements would translate into
"[a-zA-Z]{3,30}"
but you should check these. There are plenty of last names containing spaces.

As maček said:
Don't forget about names like:
Mathias d'Arras
Martin Luther King, Jr.
Hector Sausage-Hausen
and to remove cases like:
..Mathias
Martin king, Jr.-
This will cover more cases:
^([a-z]+[,.]?[ ]?|[a-z]+['-]?)+$

This regex work for me (was using in Angular 8) :
([a-zA-Z',.-]+( [a-zA-Z',.-]+)*){2,30}
It will be invalid if there is:-
Any whitespace start or end of the name
Got symbols e.g. #
Less than 2 or more than 30
Example invalid First Name (whitespace)
Example valid First Name :

I'm working on the app that validates International Passports (ICAO). We support only english characters. While most foreign national characters can be represented by a character in the Latin alphabet e.g. è by e, there are several national characters that require an extra letter to represent them such as the German umlaut which requires an ‘e’ to be added to the letter e.g. ä by ae.
This is the JavaScript Regex for the first and last names we use:
/^[a-zA-Z '.-]*$/
The max number of characters on the international passport is up to 31.
We use maxlength="31" to better word error messages instead of including it in the regex.
Here is a snippet from our code in AngularJS 1.6 with form and error handling:
class PassportController {
constructor() {
this.details = {};
// English letters, spaces and the following symbols ' - . are allowed
// Max length determined by ng-maxlength for better error messaging
this.nameRegex = /^[a-zA-Z '.-]*$/;
}
}
angular.module('akyc', ['ngMessages'])
.controller('PassportController', PassportController);
.has-error p[ng-message] {
color: #bc111e;
}
.tip {
color: #535f67;
}
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular.min.js"></script>
<script src="https://code.angularjs.org/1.6.6/angular-messages.min.js"></script>
<main ng-app="akyc" ng-controller="PassportController as $ctrl">
<form name="$ctrl.form">
<div name="lastName" ng-class="{ 'has-error': $ctrl.form.lastName.$invalid} ">
<label for="pp-last-name">Surname</label>
<div class="tip">Exactly as it appears on your passport</div>
<div ng-messages="$ctrl.form.lastName.$error" ng-if="$ctrl.form.$submitted" id="last-name-error">
<p ng-message="required">Please enter your last name</p>
<p ng-message="maxlength">This field can be at most 31 characters long</p>
<p ng-message="pattern">Only English letters, spaces and the following symbols ' - . are allowed</p>
</div>
<input type="text" id="pp-last-name" ng-model="$ctrl.details.lastName" name="lastName"
class="form-control" required ng-pattern="$ctrl.nameRegex" ng-maxlength="31" aria-describedby="last-name-error" />
</div>
<button type="submit" class="btn btn-primary">Test</button>
</form>
</main>

Read almost all highly voted posts (only some are good). After understanding the problem in detail & doing research, here are the tight regexes:
1). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)$
name Z is allowed contrary to the assumption made by some in the thread.
No leading or trailing spaces are allowed, empty string is NOT allowed, string containing only spaces is NOT allowed
Supports English alphabets only
Supports hyphens (Some-Foobarbaz-name, Some foobarbaz-Name), apostrophes (David D'Costa, David D'costa, David D'costa R'Costa p'costa), periods (Dr. L. John, Robert Downey Jr., Md. K. P. Asif) and commas (Martin Luther, Jr.).
First alphabet of only the first word of a name MUST be capital.
NOT Allowed: John sTeWaRT, JOHN STEWART, Md. KP Asif, John Stewart PhD
Allowed: John Stewart, John stewart, Md. K P Asif
you can easily modify this condition.
If you also want to allow names like Queen Elizabeth 2 or Henry IV:
2). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*([.]?| (-----)| [1-9][0-9]*)$
replace ----- with roman numeral's regex (which itself is long) OR you can use this alternative regex which is based on KISS philosophy [IVXLCDM]+ (here I, V, X, ... in ANY random order will satisfy the regex).
I personally suggest to use this regex:
3). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)( [IVXLCDM]+)?$
Feel free to try this regex HERE & make any modifications of your choice.
I have provided with tight regex which covers every possible name I found on my research with no bug. Modify these regexes to relax some of the unwanted constraints.
[UPDATE - March, 2022]
Here are 4 more regexes:
^[A-Za-z]+(([,.] |[ '-])[A-Za-z]+)*([.,'-]?)$
^((([,.'-]| )(?<!( {2}|[,.'-]{2})))*[A-Za-z]+)+[,.'-]?$
^( ([A-Za-z,.'-]+|$))+|([A-Za-z,.'-]+( |$))+$
^(([ ,.'-](?<!( {2}|[,.'-]{2})))*[A-Za-z])+[ ,.'-]?$
It's been a while since I looked back at these 4 regexes so I forgot their specifications. These 4 regexes are not tight, unlike the previous ones but do the job very well. These regexes distinguish 3 parts of a name: English alphabet, space and special character. Which one you need out of these 4 depends on your answer (Yes/No) to these questions:
have at least 1 alphabet?
can start with a space or a special character?
can end with a space or a special character?
are 2 consecutive spaces allowed?
are 2 consecutive special characters allowed?
Note: name validation should ONLY serve as a warning NOT a necessity a name should fulfill because there is no fixed naming pattern, if there is one it can change overnight and thus, any tight regex you come across will become obsolete somewhere in future.

There is one issue with the top voted answer here which recommends this regex:
/^[a-z ,.'-]+$/i
It takes spaces only as a valid name!
The best solution in my opinion is to add a negative look forward to the beginning:
^(?!\s)([a-z ,.'-]+)$/i

I use:
/^(?:[\u00c0-\u01ffa-zA-Z'-]){2,}(?:\s[\u00c0-\u01ffa-zA-Z'-]{2,})+$/i
And test for maxlength using some other means

I didn't find any answer helpful for me simply because users can pick a non-english name and simple regex are not helpful. In fact it's actually very hard to find the right expression that works for all languages.
Instead, I picked a different approach and negated all characters that should not be in the name for the valid match. Below pattern negates numerical, special characters, control characters and '\', '/'
Final regex
without punctuations: ["] ['] [,] [.], etc. :
^([^\p{N}\p{S}\p{C}\p{P}]{2,20})$
with punctuations:
^([^\p{N}\p{S}\p{C}\\\/]{2,20})$
With this, all these names are valid:
alex junior
沐宸
Nick
Sarah's Jane ---> with punctuation support
ביממה
حقیقت
Виктория
And following names become invalid:
🤣 Maria
k
١١١١١
123John
This means all names that don't have numerical characters, emojis, \ and are between 2-20 characters are allowed. You can edit the above regex if you want to add more characters to exclusion list.
To get more information about available patterns to include / exclude checkout this:
https://www.regular-expressions.info/unicode.html#prop

^\p{L}{2,}$
^ asserts position at start of a line.
\p{L} matches any kind of letter from any language
{2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
So it should be a name in any language containing at least 2 letters(or symbols) without numbers or other characters.

If you are searching a simplest way, just check almost 2 words.
/^[^\s]+( [^\s]+)+$/
Valid names
John Doe
pedro alberto ch
Ar. Gen
Mathias d'Arras
Martin Luther King, Jr.
No valid names
John
陳大文

For simplicities sake, you can use:
(.*)\s(.*)
The thing I like about this is that the last name is always after the first name, so if you're going to enter this matched groups into a database, and the name is John M. Smith, the 1st group will be
John M., and the 2nd group will be Smith.

So, with customer we create this crazy regex:
(^$)|(^([^\-!#\$%&\(\)\*,\./:;\?#\[\\\]_\{\|\}¨ˇ“”€\+<=>§°\d\s¤®™©]| )+$)

For first and last names theres are really only 2 things you should be looking for:
Length
Content
Here is my regular expression:
var regex = /^[A-Za-z-,]{3,20}?=.*\d)/
1. Length
Here the {3,20} constrains the length of the string to be between 3 and 20 characters.
2. Content
The information between the square brackets [A-Za-z] allows uppercase and lowercase characters. All subsequent symbols (-,.) are also allowed.

The following expression will work on any language supported by UTF-16 and will ensure that there's a minimum of two components to the name (i.e. first + last), but will also allow any number of middle names.
/^(\S+ )+\S+$/u
At the time of this writing it seems none of the other answers meet all of that criteria. Even ^\p{L}{2,}$, which is the closest, falls short because it will also match "invisible" characters, such as U+FEFF (Zero Width No-Break Space).

Try these solutions, for maximum compatibility, as I have already posted here:
JavaScript:
var nm_re = /^(?:((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-.\s])){1,}(['’,\-\.]){0,1}){2,}(([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-. ]))*(([ ]+){0,1}(((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){1,})(['’\-,\.]){0,1}){2,}((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){2,})?)*)$/;
HTML5:
<input type="text" name="full_name" id="full_name" pattern="^(?:((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-.\s])){1,}(['’,\-\.]){0,1}){2,}(([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-. ]))*(([ ]+){0,1}(((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){1,})(['’\-,\.]){0,1}){2,}((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){2,})?)*)$" required>

This is what I use.
This regex accepts only names with minimum characters, from A-Z a-z ,space and -.
Names example:
Ionut Ionete, Ionut-Ionete Cantemir, Ionete Ionut-Cantemirm Ionut-Cantemir Ionete-Second
The limit of name's character is 3. If you want to change this, modify {3,} to {6,}
([a-zA-Z\-]+){3,}\s+([a-zA-Z\-]+){3,}

This seems to do the job for me:
[\S]{2,} [\S]{2,}( [\S]{2,})*

I usually write:
return /^[a-zA-Z\-\s\.\'\`\u00E0-\u00FC]+$/.test(firstName);

Fullname with only one whitespace:
^[a-zA-Z'\-\pL]+(?:(?! {2})[a-zA-Z'\-\pL ])*[a-zA-Z'\-\pL]+$

A simple function using preg_match in php
<?php
function name_validation($name) {
if (!preg_match("/^[a-zA-Z ]*$/", $name) === false) {
echo "$name is a valid name";
} else {
echo "$name is not a valid name";
}
}
//Test
name_validation('89name');
?>

If you want the whole first name to be between 3 and 30 characters with no restrictions on individual words, try this :
[a-zA-Z ]{3,30}
Beware that it excludes all foreign letters as é,è,à,ï.
If you want the limit of 3 to 30 characters to apply to each individual word, Jens regexp will do the job.

var name = document.getElementById('login_name').value;
if ( name.length < 4 && name.length > 30 )
{
alert ( 'Name length is mismatch ' ) ;
}
var pattern = new RegExp("^[a-z\.0-9 ]+$");
var return_value = var pattern.exec(name);
if ( return_value == null )
{
alert ( "Please give valid Name");
return false;
}

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb

This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.

You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.

As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."

Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']

Match string of all uppercase that can be 1 or more words has and has at least 2 spaces of white space before it

I need to parse out names out from a PDF file. The names are always all uppercase and could be 1 or more words, i.e. CHUBBY BOY or MIKE. They are also indented so there are spaces before the names.
preg_match('/(?=[A-Z]{2,})([A-Z]+)/', $removedStar, $mymatches)
is getting pretty close.
if ( preg_match('/(?=[A-Z]{2,})([A-Z]+)/', $removedStar, $mymatches)) {
$name_value = $removedStar;
$nameValues[$nameCount] = $mymatches[0];
$nameCount++;
}
These lines are the output of $removedStar
EXT. HOME FOR INSANE - PARKING LOT - SAME
12 12
A OLD DATSUN is BLASTING hard rock on shitty speakers-
the same song- Enjoyed!
YOUNG TEENS: a Drunk, a Jock, and a Chubby boy,
headbanging and drunk in the parking lot-
CHUBBY BOY
Shh quiet! Listen-
5.
THEY HEAR THE SCREAM IN THE DISTANCE. THEY GET QUIET- the
Drunk kid holding a beer GIGGLES-
DRUNK KID
Trippy.
this is actually getting even closer
strtoupper($removedStar) == $removedStar

Are you looking for something like this?
(?: )([A-Z]+[A-Z ]*[A-Z])
Two groups, the first is two spaces, but not captured
At least one capital letter
Followed by as many other capitals or spaces
Ending in a capital
https://regexr.com/4n4u6
It won't match single characters, but it sounds like you don't need that.

Create Fill-In-The-Blanks Text From Text Chunks using Regex and Replace

I trying to create a fill-in-the-blanks worksheet from a chunk of text, and I think regex and a replace function in a text editor will greatly expedite my project.
Example text:
HAMLET O, that this too too solid flesh would melt Thaw and resolve
itself into a dew! Or that the Everlasting had not fix'd His canon
'gainst self-slaughter! O God! God! How weary, stale, flat and
unprofitable, Seem to me all the uses of this world! Fie on't! ah fie!
'tis an unweeded garden, That grows to seed; things rank and gross in
nature Possess it merely. That it should come to this! But two months
dead: nay, not so much, not two: So excellent a king; that was, to
this, Hyperion to a satyr; so loving to my mother That he might not
beteem the winds of heaven Visit her face too roughly. Heaven and
earth! Must I remember? why, she would hang on him, As if increase of
appetite had grown By what it fed on: and yet, within a month-- Let me
not think on't--Frailty, thy name is woman!-- A little month, or ere
those shoes were old With which she follow'd my poor father's body,
Like Niobe, all tears:--why she, even she-- O, God! a beast, that
wants discourse of reason, Would have mourn'd longer--married with my
uncle, My father's brother, but no more like my father Than I to
Hercules: within a month: Ere yet the salt of most unrighteous tears
Had left the flushing in her galled eyes, She married. O, most wicked
speed, to post With such dexterity to incestuous sheets! It is not nor
it cannot come to good: But break, my heart; for I must hold my
tongue.
Replace alternate text sets with a blank "__" a character length equal to that of the length that has been replaced, where a text set is defined as group of words ending with a "!", "," "--", "?" etc.
So the above text from Hamlet becomes like
HAMLET O, ___________________ Or that the
Everlasting had not fix'd His canon 'gainst self-slaughter! __
God! _____, stale, ________ ......
What is the regex that I should use to achieve this end?

Here is an attempt using perl regex:
perl -pe 's/(.*?)([\!\?\,;\.]|--)(.*?)([\!\?\,;\.]|--)/\1\2________________\4/g' file
Output:
HAMLET O,_______! Or that the Everlasting had not fix'd His
canon 'gainst self-slaughter!_______! God!_______,
stale,_______, Seem to me all the uses of this
world!_______! ah fie!_______, That grows to
seed;_______. That it should come to this!_______,
not so much,_______; that was,_______, Hyperion to a
satyr;_______. Heaven and earth!_______?
why,_______, As if increase of appetite had grown By what it
fed on: and yet,_______-- Let me not think
on't--_______, thy name is woman!_______-- A little
month,_______, Like Niobe,_______--why
she,_______-- O,_______! a beast,_______,
Would have mourn'd longer--_______, My father's
brother,_______, She married._______, most wicked
speed,_______! It is not nor it cannot come to good: But
break,_______; for I must hold my tongue.
This solution replaces fix number of '__' and I am yet to figure out how to replace with matching charater length.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trim pattern in a text between \n\n\n\n - regex

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

Regex for name with non-latin characters in python [duplicate]

Regex match characters when not preceded by a string

Match string of all uppercase that can be 1 or more words has and has at least 2 spaces of white space before it

Create Fill-In-The-Blanks Text From Text Chunks using Regex and Replace

Categories

Resources