Data :
col 1
AL GHAITHA
AL ASEEL
EMARAT AL
LOREAL
ISLAND CORAL
My code :
def remove_words(df, col, letters):
regular_expression = '^' + '|'.join(letters)
df[col] = df[col].apply(lambda x: re.sub(regular_expression, "", x))
Desired output :
col 1
GHAITHA
ASEEL
EMARAT
LOREAL
ISLAND CORAL
SUNRISE
Function call :
letters = ['AL','SUPERMARKET']
remove_words(df=df col='col 1',letters=remove_letters)
Basically, i wanted remove the letters provided either at the start or end. ( note : it should be seperate string)
Fog eg : "EMARAT AL" should become "EMARAT"
Note "LOREAL" should not become "LORE"
Code to build the df :
raw_data = {'col1': ['AL GHAITHA', 'AL ASEEL', 'EMARAT AL', 'LOREAL UAE',
'ISLAND CORAL','SUNRISE SUPERMARKET']
}
df = pd.DataFrame(raw_data)
You may use
pattern = r'^{0}\b|\b{0}$'.format("|".join(map(re.escape, letters)))
df['col 1'] = df['col 1'].str.replace(pattern, r'\1').str.strip()
The (?s)^{0}\b|(.*)\b{0}$'.format("|".join(map(re.escape, letters)) pattern will create a pattern like (?s)^word\b|(.*)\bword$ and it will match word as a whole word at the start and end of the string.
When checking the word at the end of the string, the whole text before it will be captured into Group 1, hence the replacement pattern contains the \1 placeholder to restore that text in the resulting string.
If your letters list contains items only composed with word chars you may omit map with re.escape, replace map(re.escape, letters) with letters.
The .str.strip() will remove any resulting leading/trailing whitespaces.
See the regex demo.
How to remove numbers form a text in Scala ?
for example i have this text:
canon 40 22mm lens lock strength plenty orientation 321 .
after removing :
canon lens lock strength plenty orientation .
Please, try filter or filterNot
val text = "canon 40 22mm lens lock strength plenty orientation 321 ."
val without_digits = text.filter(!_.isDigit)
or
val text = "canon 40 22mm lens lock strength plenty orientation 321 ."
val without_digits = text.filterNot(_.isDigit)
\\d+\\S*\\s+
Try this.Replace by empty string.See demo.
https://regex101.com/r/tS1hW2/1
Since it is apparent, you want to remove all words that contain a number, because in your example mm is also gone, because it is prefixed by a number.
val s = "That's 22m, which is gr8."
s.split(" ").filterNot(_.exists(_.isDigit)).mkString(" ")
res8: String = That's which is
I Need to identify a string in a text and replace it with null string. Problem is, it is not always present as a word itself. There will be space character present between each letter or set of letters. For example:
For word "Decent", I may face the following values.
D ec ent,
De ce nt,
De ce n t .
Is there a way to identify these strings using "Decent" word as input with any regular expression?
I am very new to regular expressions. Please help!!
TIA!
\bD\s*e\s*c\s*e\s*n\s*t\s*
so you match D ec ent, De ce nt, De ce n t, decent Decent
but not blade centimeter
If you use
'D ?e ?c ?e ?n ?t ?'
it will match the word with extra spaces
The expression "D\s*e\s*c\s*e\s*n\s*t" will do it. Each letter is followed by zero or more spaces. Actually \s is "whitespace characters." You could replace \s* with * (space followed by an asterisk) if you just want literal spaces.
first a bit of code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordsWithSpaces {
public static void main(String[] args) {
String test = "Descent D escent De s cent desce nd";
String word = "descent";
String pattern = "";
for(int i=0; i<word.length();i++) {
pattern = pattern+word.charAt(i)+"\\s*";
}
System.err.println("pattern is: "+pattern);
Pattern p = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(test);
while(m.find()) {
String found = test.substring(m.start(),m.end());
System.err.println(found+" matches");
}
}
}
now for the explanation: \s is a character class for whitespace. this includes spaces and tabs and (possibly) linebreaks. in this piece of code, i take every character of the word i am looking for, and append "\s", with "*" meaning 0 or mor occurences.
also, to avoid it being case sensitive, i set the CASE_INSENSITIVE flag on the pattern.
character classes may not have the same name in your programming language of choice, but there should be one for whitespace. check your documentation.
I'm really bad with regular expressions and find them to be too complex. However, I need to use them to do some string manipulation in classic asp.
Input String :
"James John Junior
S.D. Industrial Corpn
D-2341, Focal Point, Phase 4-a,
Sarsona, Penns
Japan
Phone : 92-161-4633248 Fax : 92-161-253214
email : swerte_60#laher.com"
Desired Output string:
"JXXXX JXXX JXXXXX
S.X. IXXXXXXXXX CXXXX
D-XXXX, FXXXX PXXXX, PXXXX 4-X,
SXXXXXX, PXXXX
JXXXX
PXXXX : 9X-XXX-XXXXXXX Fax : 9X-XXX-XXXXXX
eXXXX : sXXXXX_XX#XXXXX.XXX"
Note: We need to split the original string into words based on a single space Then, in those words, we need to replace all letters (lower and upper case) and numbers except for the first character in each word with an "X"
I know its sort of difficult, but a seasoned RegEx expert could nail this pretty easily I would think. No?
Edit:
I've made some progress. Found a function (http://www.addedbytes.com/lab/vbscript-regular-expressions/) that sort of does the job. But needs a little refinement, if anyone can help
function ereg_replace(strOriginalString, strPattern, strReplacement, varIgnoreCase)
' Function replaces pattern with replacement
' varIgnoreCase must be TRUE (match is case insensitive) or FALSE (match is case sensitive)
dim objRegExp : set objRegExp = new RegExp
with objRegExp
.Pattern = strPattern
.IgnoreCase = varIgnoreCase
.Global = True
end with
ereg_replace = objRegExp.replace(strOriginalString, strReplacement)
set objRegExp = nothing
end function
Im calling it like so -
orgstr = ereg_replace(orgstr, "\w", "X", True)
However, the result looks like -
XXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXX.
XX, XXXXX XXXX, XXXXXX XXXXXX, XXXXXXX XXXXXXX, XXXXXXXXX
XXXXX : XXX-XXX-XXXX
XXX :
XXXXX : XXXXXX#XXXXXX.XX
I'd like this to show the first character in every word. Any help out there?
This approach gets close:
Function AnonymiseWord(m, p, s)
AnonymiseWord = Left(m, 1) & String(Len(m) - 1, "X")
End Function
Function AnonymiseText(input)
Dim rgx: Set rgx = new RegExp
rgx.Global = True
rgx.Pattern = "\b\w+?\b"
AnonymiseText = rgx.Replace(input, GetRef("AnonymiseWord"))
End Function
This might get you close enough to what you need otherwise the basic approach is sound but you may need to fiddle with that pattern to get it match exactly the stretches of text you want to put through AnonymiseWord.
Well, in .NET it would be easy:
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that there is before the current position...
\b # a word boundary
\w # one alphanumeric character (= first letter/digit/underscore)
[\w.#-]* # any number of alnum characters or ., # or -
) # End of lookbehind
[\p{L}\p{N}] # Match any letter or digit to be replaced",
"X", RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
The result, though, would be slightly different than what you wrote:
"JXXXX JXXX JXXXXX
S.X. IXXXXXXXXX CXXXX
D-XXXX, FXXXX PXXXX, PXXXX 4-X,
SXXXXXX, PXXXX
JXXXX
PXXXX : 9X-XXX-XXXXXXX FXX : 9X-XXX-XXXXXX
eXXXX : sXXXXX_XX#XXXXX.XXX"
(observe that Fax has also been changed to FXX)
Without .NET, you could try something like
orgstr = ereg_replace("\b(\w)[\w.#-]*", "\1XXXX", True); // not sure about the syntax here, you possibly need double backslashes
which would give you
"JXXXX JXXXX JXXXX
SXXXX IXXXX CXXXX
DXXXX, FXXXX PXXXX, PXXXX 4XXXX,
SXXXX, PXXXX
JXXXX
PXXXX : 9XXXX FXXXX : 9XXXX
eXXXX : sXXXX"
You won't get it better than that with a single regex.
I have no idea about classic ASP, but if it does support (negative) lookbehinds and the only problem is the quantifier in the lookbehind, then why not turn it around and do it this way:
(?<!^)(?<!\s)[a-zA-Z0-9]
and replace with "X".
Means, replace every letter and number if there is not a whitespace or not the start of the string/row before.
See it here on Regexr
Although I love regular expressions, you could do it without them, especially because VBScript does not support look behind.
Dim mystring, myArray, newString, i, j
Const forbiddenChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
myString = "James John Junior S.D. Industrial Corpn D-2341, Focal Point, Phase 4-a, Sarsona, Penns Japan Phone : 92-161-4633248 Fax : 92-161-253214 email : swerte_60#laher.com"
myArray = split(myString, " ")
For i = lbound(myArray) to ubound(myArray)
newString = left(myArray(i), 1)
For j = 2 to len(myArray(i))
If instr(forbiddenChars, mid(myArray(i), j, 1)) > 0 Then
newString = newString & "X"
else
newString = newString & mid(myArray(i), j, 1)
End If
Next
myArray(i) = newString
Next
myString = join(myArray, " ")
It doesn't cope with the VbNewLine character, but you will get the idea. You can do an extra split on the VbNewLine character, iterate through all elements and split each element on the space for example.
I'm trying to create a regular expression for matching latitude/longitude coordinates. For matching a double-precision number I've used (\-?\d+(\.\d+)?), and tried to combine that into a single expression:
^(\-?\d+(\.\d+)?),\w*(\-?\d+(\.\d+)?)$
I expected this to match a double, a comma, perhaps some space, and another double, but it doesn't seem to work. Specifically it only works if there's NO space, not one or more. What have I done wrong?
This one will strictly match latitude and longitude values that fall within the correct range:
^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$
Matches
+90.0, -127.554334
45, 180
-90, -180
-90.000, -180.0000
+90, +180
47.1231231, 179.99999999
Doesn't Match
-90., -180.
+90.1, -100.111
-91, 123.456
045, 180
I am using these ones (decimal format, with 6 decimal digits):
Latitude
^(\+|-)?(?:90(?:(?:\.0{1,6})?)|(?:[0-9]|[1-8][0-9])(?:(?:\.[0-9]{1,6})?))$
Longitude
^(\+|-)?(?:180(?:(?:\.0{1,6})?)|(?:[0-9]|[1-9][0-9]|1[0-7][0-9])(?:(?:\.[0-9]{1,6})?))$
Here is a gist that tests both, reported here also, for ease of access. It's a Java TestNG test. You need Slf4j, Hamcrest and Lombok to run it:
import static org.hamcrest.Matchers.*;
import static org.hamcrest.MatcherAssert.*;
import java.math.RoundingMode;
import java.text.DecimalFormat;
import lombok.extern.slf4j.Slf4j;
import org.testng.annotations.Test;
#Slf4j
public class LatLongValidationTest {
protected static final String LATITUDE_PATTERN="^(\\+|-)?(?:90(?:(?:\\.0{1,6})?)|(?:[0-9]|[1-8][0-9])(?:(?:\\.[0-9]{1,6})?))$";
protected static final String LONGITUDE_PATTERN="^(\\+|-)?(?:180(?:(?:\\.0{1,6})?)|(?:[0-9]|[1-9][0-9]|1[0-7][0-9])(?:(?:\\.[0-9]{1,6})?))$";
#Test
public void latitudeTest(){
DecimalFormat df = new DecimalFormat("#.######");
df.setRoundingMode(RoundingMode.UP);
double step = 0.01;
Double latitudeToTest = -90.0;
while(latitudeToTest <= 90.0){
boolean result = df.format(latitudeToTest).matches(LATITUDE_PATTERN);
log.info("Latitude: tested {}. Result (matches regex): {}", df.format(latitudeToTest), result);
assertThat(result, is(true));
latitudeToTest += step;
}
latitudeToTest = -90.1;
while(latitudeToTest >= -200.0){
boolean result = df.format(latitudeToTest).matches(LATITUDE_PATTERN);
log.info("Latitude: tested {}. Result (matches regex): {}", df.format(latitudeToTest), result);
assertThat(result, is(false));
latitudeToTest -= step;
}
latitudeToTest = 90.01;
while(latitudeToTest <= 200.0){
boolean result = df.format(latitudeToTest).matches(LATITUDE_PATTERN);
log.info("Latitude: tested {}. Result (matches regex): {}", df.format(latitudeToTest), result);
assertThat(result, is(false));
latitudeToTest += step;
}
}
#Test
public void longitudeTest(){
DecimalFormat df = new DecimalFormat("#.######");
df.setRoundingMode(RoundingMode.UP);
double step = 0.01;
Double longitudeToTest = -180.0;
while(longitudeToTest <= 180.0){
boolean result = df.format(longitudeToTest).matches(LONGITUDE_PATTERN);
log.info("Longitude: tested {}. Result (matches regex): {}", df.format(longitudeToTest), result);
assertThat(result, is(true));
longitudeToTest += step;
}
longitudeToTest = -180.01;
while(longitudeToTest >= -300.0){
boolean result = df.format(longitudeToTest).matches(LONGITUDE_PATTERN);
log.info("Longitude: tested {}. Result (matches regex): {}", df.format(longitudeToTest), result);
assertThat(result, is(false));
longitudeToTest -= step;
}
longitudeToTest = 180.01;
while(longitudeToTest <= 300.0){
boolean result = df.format(longitudeToTest).matches(LONGITUDE_PATTERN);
log.info("Longitude: tested {}. Result (matches regex): {}", df.format(longitudeToTest), result);
assertThat(result, is(false));
longitudeToTest += step;
}
}
}
Whitespace is \s, not \w
^(-?\d+(\.\d+)?),\s*(-?\d+(\.\d+)?)$
See if this works
Actually Alix Axel, above regex is wrong in latitude, longitude ranges point of view.
Latitude measurements range from –90° to +90°
Longitude measurements range from –180° to +180°
So the regex given below validates more accurately.
Also, as per my thought no one should restrict decimal point in latitude/longitude.
^([-+]?\d{1,2}([.]\d+)?),\s*([-+]?\d{1,3}([.]\d+)?)$
OR for Objective C
^([-+]?\\d{1,2}([.]\\d+)?),\\s*([-+]?\\d{1,3}([.]\\d+)?)$
^-?[0-9]{1,3}(?:\.[0-9]{1,10})?$
Regex breakdown:
^-?[0-9]{1,3}(?:\.[0-9]{1,10})?$
-? # accept negative values
^ # Start of string
[0-9]{1,3} # Match 1-3 digits (i. e. 0-999)
(?: # Try to match...
\. # a decimal point
[0-9]{1,10} # followed by one to 10 digits (i. e. 0-9999999999)
)? # ...optionally
$ # End of string
#macro-ferrari I did find a way to shorten it, and without look aheads in the light of all recent talks about regex engines
const LAT_RE = /^[+-]?(([1-8]?[0-9])(\.[0-9]{1,6})?|90(\.0{1,6})?)$/;
const LONG_RE = /^[+-]?((([1-9]?[0-9]|1[0-7][0-9])(\.[0-9]{1,6})?)|180(\.0{1,6})?)$/;
Here is a more strict version:
^([-+]?\d{1,2}[.]\d+),\s*([-+]?\d{1,3}[.]\d+)$
Latitude = -90 -- +90
Longitude = -180 -- +180
Try this:
^(\()([-+]?)([\d]{1,2})(((\.)(\d+)(,)))(\s*)(([-+]?)([\d]{1,3})((\.)(\d+))?(\)))$
Check it out at:
http://regexpal.com/
Paste the expression in the top box, then put things like this in the bottom box:
(80.0123, -34.034)
(80.0123)
(80.a)
(980.13, 40)
(99.000, 122.000)
Regex breakdown:
^ # The string must start this way (there can't be anything before).
(\() # An opening parentheses (escaped with a backslash).
([-+]?) # An optional minus, or an optional plus.
([\d]{1,2}) # 1 or 2 digits (0-9).
( # Start of a sub-pattern.
( # Start of a sub-pattern.
(\.) # A dot (escaped with a backslash).
(\d+) # One or more digits (0-9).
(,) # A comma.
) # End of a sub-pattern.
) # End of a sub-pattern.
(\s*) # Zero or more spaces.
( # Start of a sub-pattern.
([-+]?) # An optional minus, or an optional plus.
([\d]{1,3}) # 1 to 3 digits (0-9).
( # Start of a pattern.
(\.) # A dot (escaped with a backslash).
(\d+) # One or more digits (0-9).
)? # End of an optional pattern.
(\)) # A closing parenthesis (escaped with a backkslash).
) # End of a pattern
$ # The string must end this way (there can't be anything after).
Now, what this does NOT do is restrict itself to this range:
(-90 to +90, and -180 to +180)
Instead, it simple restricts itself to this range:
(-99 to +99, -199 to +199)
But the point is mainly just to break down each piece of the expression.
Python:
Latitude: result = re.match("^[+-]?((90\.?0*$)|(([0-8]?[0-9])\.?[0-9]*$))", '-90.00001')
Longitude: result = re.match("^[+-]?((180\.?0*$)|(((1[0-7][0-9])|([0-9]{0,2}))\.?[0-9]*$))", '-0.0000')
Latitude should fail in the example.
Regex shorten #marco-ferrari solution by replacing the multiple use of [0-9] with subset of [0-9]. Also removed unnecessary quantifiers such as ?: from various places
lat "^([+-])?(?:90(?:\\.0{1,6})?|((?:|[1-8])[0-9])(?:\\.[0-9]{1,6})?)$";
long "^([+-])?(?:180(?:\\.0{1,6})?|((?:|[1-9]|1[0-7])[0-9])(?:\\.[0-9]{1,6})?)$";
**Matches for Lat**
Valid between -90 to +90 with up to 6 decimals.
**Matches for Long**
Valid between -180 to +180 with up to 6 decimals.
This would work for format like this: 31 ͦ 37.4' E
^[-]?\d{1,2}[ ]*ͦ[ ]*\d{1,2}\.?\d{1,2}[ ]*\x27[ ]*\w$
I believe you're using \w (word character) where you ought to be using \s (whitespace). Word characters typically consist of [A-Za-z0-9_], so that excludes your space, which then further fails to match on the optional minus sign or a digit.
Ruby
Longitude -179.99999999..180
/^(-?(?:1[0-7]|[1-9])?\d(?:\.\d{1,8})?|180(?:\.0{1,8})?)$/ === longitude.to_s
Latitude -89.99999999..90
/^(-?[1-8]?\d(?:\.\d{1,8})?|90(?:\.0{1,8})?)$/ === latitude.to_s
A complete and simple method in objective C for checking correct pattern for latitude and longitude is:
-( BOOL )textIsValidValue:(NSString*) searchedString
{
NSRange searchedRange = NSMakeRange(0, [searchedString length]);
NSError *error = nil;
NSString *pattern = #"^[-+]?([1-8]?\\d(\\.\\d+)?|90(\\.0+)?),\\s*[-+]?(180(\\.0+)?|((1[0-7]\\d)|([1-9]?\\d))(\\.\\d+)?)$";
NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern: pattern options:0 error:&error];
NSTextCheckingResult *match = [regex firstMatchInString:searchedString options:0 range: searchedRange];
return match ? YES : NO;
}
where searchedString is the input that user would enter in the respective textfield.
PHP
Here is the PHP's version (input values are: $latitude and $longitude):
$latitude_pattern = '/\A[+-]?(?:90(?:\.0{1,18})?|\d(?(?<=9)|\d?)\.\d{1,18})\z/x';
$longitude_pattern = '/\A[+-]?(?:180(?:\.0{1,18})?|(?:1[0-7]\d|\d{1,2})\.\d{1,18})\z/x';
if (preg_match($latitude_pattern, $latitude) && preg_match($longitude_pattern, $longitude)) {
// Valid coordinates.
}
this one enforces 3 numbers after the comma to avoid false matches:
(?<latitude>-?\d+\.\d{3,10}),(?<longitude>-?\d+\.\d{3,10})
THIS WILL PERFECTLY WORK ACCORDING TO LAT & LONG STANDARD's
*Latitude Validation Criteria:.
Valid between -90 to +90 upto 9 decimals.
*Longitude Validation Criteria:.
Valid between -180 to +180 upto 9 decimals.
latitude : "^([+-])?(?:90(?:\.0{1,6})?|((?:|[1-8])[0-9])(?:\.[0-9]{1,9})?)$";
longitude : "^([+-])?(?:180(?:\.0{1,6})?|((?:|[1-9]|1[0-7])[0-9])(?:\.[0-9]{1,9})?)$";
You can try this:
var latExp = /^(?=.)-?((8[0-5]?)|([0-7]?[0-9]))?(?:\.[0-9]{1,20})?$/;
var lngExp = /^(?=.)-?((0?[8-9][0-9])|180|([0-1]?[0-7]?[0-9]))?(?:\.[0-9]{1,20})?$/;
Try this:
(?<!\d)([-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)),\s*([-+]?(?:180(?:\.0+)?|(?:(?:1[0-7]\d)|(?:[1-9]?\d))(?:\.\d+)?))(?!\d)`
Try this:
^[-+]?(([0-8]\\d|\\d)(\\.\\d+)?|90(\\.0+)?)$,\s*^[-+]?((1[0-7]\\d(\\.\\d+)?)|(180(\\.0+)?)|(\\d\\d(\\.\\d+)?)|(\\d(\\.\\d+)?))$