RegExReplace - A few examples to get me started, please - regex

I'm trying to use RegExReplace to pre-process some text before it gets parsed for use in an Access database. Currently I have been defining a growing number of string patterns into a table, then use the stock Replace() function in VBA using that table. Works OK, but misses the mark in a few areas; I am pretty sure regular expressions will be a better long-term solution for me, but I am completely clueless how to construct them.
I'd like to see if the smart folks here can give me a leg up on the task using a few actual examples from my data, by illustrating the regex strings that will produce the desired result:
1. 6 IN 6IN
2. 12.3 IN X 2 YD 12.3IN_X_2YD
3. 6IN X 4IN 6IN_X_4IN
4. 8X120MM 8_X_120MM
5. 1 1/2" 1.5IN
6. CAT, DOG CAT DOG
7. CAT,DOG CAT DOG
8. CAT ,DOG CAT DOG
9. CAT , DOG CAT DOG
My patterns fail in ways like: CATHETER INFUSION => CATHETERINFUSION
I will be using a multi-pass approach vs attempting to come-up with some terribly complex expressions.
Can anyone offer some initial guidance to any of these samples. I'm confident I will be able to leverage these samples to extend as needed.
[Edit:] I did just find a few helpful examples:
NewStr := RegExReplace("abc123123", "123$", "xyz") ; Returns "abc123xyz" because the $ allows a match only at the end.
NewStr := RegExReplace("abc123", "i)^ABC") ; Returns "123" because a match was achieved via the case-insensitive option.
NewStr := RegExReplace("abcXYZ123", "abc(.*)123", "aaa$1zzz") ; Returns "aaaXYZzzz" by means of the $1 backreference.
NewStr := RegExReplace("abc123abc456", "abc\d+", "", ReplacementCount) ; Returns "" and stores 2 in ReplacementCount.
[Edit 2]: Making good progress!
strText = "BANDAGE, ADHESIVE, 2 FT X 3.5 IN X 0.25MM, LATEX-FREE"
strResult = RegExReplace(strText, "(,|\s+)", " ", True)
strResult = RegExReplace(strResult, "\s+(IN|FT|YD)\s+", "$1 ", True)
strResult = RegExReplace(strResult, "\s+X\s+", "_X_", True)
Produces:
BANDAGE ADHESIVE 2FT_X_3.5IN_X_0.25MM LATEX-FREE

Some regexps that might be useful:
/\s+IN/IN/
/\s+X\s+/_X_/
/(?:\d)X(?:\d)/_X_/

Related

Replacing regex match in pandas column with modified regex

I am trying to replace a regular expression match with modified regular expression.
Following is the column in my DataFrame.
df['newcolumn']
0 Ther was a quick brown appl_product_type in ("eds") where blah blan appl_Cust_type =("value","value")
1 Ther was a quick brown appl_product_type = ("EDS") where blah blan appl_Cust_type =("value","value")
2 Ther was a quick brown appl_product_type in ("eds") where blah b
3 Ther was a quick brown appl_product_type in = ("EDS") where blah blan appl_Cust_type = ("value")
4 Ther was a quick brown where blah blan appl_Cust_type
Name: newcolumn, dtype: object
i want to replace every occurrence of strings like "appl_product_type = ('EDS')' to 'upper(appl_product_type) = ('EDS')'
i am using following code but getting error
newcolumn.replace(value='upper\[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)', regex='[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)')
error: bad escape \w at position 7
is there a way to solve this ?? Please Help.
A couple of things -
you cant use \w in your replacement value and expect it to know what to fill in
your regex as is, is badly formatted. use r'' to make simpler regex strings
your question is unclear as you are asking one specific format while your regex is attempting to catch a lot more.
I have a slightly more clear solution to what you have attempted, but am unsure if this is exactly what you wanted given the ambiguity in you question. -
df['newcolumn'] = df['newcolumn'].replace({r'([\w_]+\s+(?:in|=|\s)+\(\"(?:\w+\"(?:\,)?(?:\s+)?)+\))' : r'upper(\1)'}, regex=True)

Search a string by a mix of syntactical and regex patterns

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-search).
For instance, a search pattern could be: 'NOUNPHRASE be|is|was ADJECTIVE than NOUNPHRASE', and should return all strings containing structures like: "a cat is faster than a dog".
I know that packages like openNLP and qdap offer convenient POS-tagging. Has anyone been using the output of it for this kind of pattern maching ?
As a starter, using koRpus and TreeTagger:
library(koRpus)
library(tm)
mytxt <- c("This is my house.", "A house is better than no house.", "A cat is faster than a dog.")
pattern <- "Noun, singular or mass.*?Adjective, comparative.*?Noun, singular or mass"
tagged.results <- treetag(file = mytxt, treetagger="C:/TreeTagger/bin/tag-english.bat", lang="en", format="obj", stopwords=stopwords("en"))
tagged.results <- kRp.filter.wclass(tagged.results, "stopword")
taggedText(tagged.results)$id <- factor(head(cumsum(c(0, taggedText(tagged.results)$desc == "Sentence ending punctuation")) + 1, -1))
setNames(mytxt, grepl(pattern, aggregate(desc~id, taggedText(tagged.results), FUN = paste0)$desc))
# FALSE TRUE TRUE
# "This is my house." "A house is better than no house." "A cat is faster than a dog."

Excel RegEx Functions in R

I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!
I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.

Negating Alternation In Regular Expressions

I can use "Alternation" in a regular expression to match any occurance of "cat" or "dog" thusly:
(cat|dog)
Is it possible to NEGATE this alternation, and match anything that is NOT "cat" or "dog"?
If so, how?
For Example:
Let's say I'm trying to match END OF SENTENCE in English, in an approximate way.
To Wit:
(\.)(\s+[A-Z][^.]|\s*?$)
With the following paragraph:
The quick brown fox jumps over the lazy dog. Once upon a time Dr. Sanches, Mr. Parsons and Gov. Mason went to the store. Hello World.
I incorrectly find "end of sentence" at Dr., Mr., and Gov.
(I'm testing using http://regexpal.com/ in case you want to see what I'm seeing with the above example)
Since this is incorrect, I would like to say something like:
!(Dr\.|Mr\.|Gov\.)(\.)(\s+[A-Z][^.]|\s*?$)
Of course, this isn't working, which is why I seek help.
I also tried !/(Dr.|Mr.|Gov.)/, and !~ which were no help whatsoever.
How can I avoid matches for "Dr.", "Mr." and "Gov.", etc?
Thanks in advance.
It is not possible. You would normally do this using negative lookbehind (?<!…), but JavaScript's regex flavor does not support this. Instead, you will have to filter the matches after the fact to discard those you don't want.
In language like Perl/awk, there's the !~ operator
$string !~ /(cat|dog)/
In Actionscript, you can just use NOT operator ! to negate a match. See here for reference. Also here for regex flavors comparison
You can do this:
!/(cat|dog)/
EDIT: You should've included the programming language on your question. Its Actionscript right? I'm not an actionscript coder but AFAIK its done like this:
var pattern2:RegExp = !/(cat|dog)/;
(?!NotThisStuff) is what you want, otherwise known as a negative lookahead group.
Unfortunately, it will not work as you intend. /(?!Dr\.)(\.)/ will still return the periods that belong to "Dr. Sanches" because of the second grouping. The Regex parser will say to itself, "Yep, this '.' isn't 'Dr.'" /((?!Dr).)/ won't work either, though I believe it should.
And what's more, you'll end up looking through all the sentence "ends" anyway. Actionscript doesn't have a "match all," only a match first. You have to set the global flag (or add g to the end of your regex) and call exec until your result object is null.
var string = 'The quick brown fox jumps over the lazy dog. Once upon a time Dr. Sanches, Mr. Parsons and Gov. Mason went to the store. Hello World.';
var regx:RegExp = /(?!Dr\.)(\.)/g;
var result:Object = regx.exec(string);
for (var i = 0; i < 10; i++) { // paranoia
if (result == null || result.index == 0) break; // again: paranoia
trace(result.index, result);
result = regx.exec(string);
}
// trace results:
//43 .,.
//64 .,.
//77 .,.
//94 .,.
//119 .,.
//132 .,.

A regex for version number parsing

I have a version number of the following form:
version.release.modification
where version, release and modification are either a set of digits or the '*' wildcard character. Additionally, any of these numbers (and any preceding .) may be missing.
So the following are valid and parse as:
1.23.456 = version 1, release 23, modification 456
1.23 = version 1, release 23, any modification
1.23.* = version 1, release 23, any modification
1.* = version 1, any release, any modification
1 = version 1, any release, any modification
* = any version, any release, any modification
But these are not valid:
*.12
*123.1
12*
12.*.34
Can anyone provide me a not-too-complex regex to validate and retrieve the release, version and modification numbers?
I'd express the format as:
"1-3 dot-separated components, each numeric except that the last one may be *"
As a regexp, that's:
^(\d+\.)?(\d+\.)?(\*|\d+)$
[Edit to add: this solution is a concise way to validate, but it has been pointed out that extracting the values requires extra work. It's a matter of taste whether to deal with this by complicating the regexp, or by processing the matched groups.
In my solution, the groups capture the "." characters. This can be dealt with using non-capturing groups as in ajborley's answer.
Also, the rightmost group will capture the last component, even if there are fewer than three components, and so for example a two-component input results in the first and last groups capturing and the middle one undefined. I think this can be dealt with by non-greedy groups where supported.
Perl code to deal with both issues after the regexp could be something like this:
#version = ();
#groups = ($1, $2, $3);
foreach (#groups) {
next if !defined;
s/\.//;
push #version, $_;
}
($major, $minor, $mod) = (#version, "*", "*");
Which isn't really any shorter than splitting on "."
]
Use regex and now you have two problems. I would split the thing on dots ("."), then make sure that each part is either a wildcard or set of digits (regex is perfect now). If the thing is valid, you just return correct chunk of the split.
Thanks for all the responses! This is ace :)
Based on OneByOne's answer (which looked the simplest to me), I added some non-capturing groups (the '(?:' parts - thanks to VonC for introducing me to non-capturing groups!), so the groups that do capture only contain the digits or * character.
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Many thanks everyone!
This might work:
^(\*|\d+(\.\d+){0,2}(\.\*)?)$
At the top level, "*" is a special case of a valid version number. Otherwise, it starts with a number. Then there are zero, one, or two ".nn" sequences, followed by an optional ".*". This regex would accept 1.2.3.* which may or may not be permitted in your application.
The code for retrieving the matched sequences, especially the (\.\d+){0,2} part, will depend on your particular regex library.
My 2 cents: I had this scenario: I had to parse version numbers out of a string literal.
(I know this is very different from the original question, but googling to find a regex for parsing version number showed this thread at the top, so adding this answer here)
So the string literal would be something like: "Service version 1.2.35.564 is running!"
I had to parse the 1.2.35.564 out of this literal. Taking a cue from #ajborley, my regex is as follows:
(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)
A small C# snippet to test this looks like below:
void Main()
{
Regex regEx = new Regex(#"(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)", RegexOptions.Compiled);
Match version = regEx.Match("The Service SuperService 2.1.309.0) is Running!");
version.Value.Dump("Version using RegEx"); // Prints 2.1.309.0
}
I had a requirement to search/match for version numbers, that follows maven convention or even just single digit. But no qualifier in any case. It was peculiar, it took me time then I came up with this:
'^[0-9][0-9.]*$'
This makes sure the version,
Starts with a digit
Can have any number of digit
Only digits and '.' are allowed
One drawback is that version can even end with '.' But it can handle indefinite length of version (crazy versioning if you want to call it that)
Matches:
1.2.3
1.09.5
3.4.4.5.7.8.8.
23.6.209.234.3
If you are not unhappy with '.' ending, may be you can combine with endswith logic
Don't know what platform you're on but in .NET there's the System.Version class that will parse "n.n.n.n" version numbers for you.
I've seen a lot of answers, but... i have a new one. It works for me at least. I've added a new restriction. Version numbers can't start (major, minor or patch) with any zeros followed by others.
01.0.0 is not valid
1.0.0 is valid
10.0.10 is valid
1.0.0000 is not valid
^(?:(0\\.|([1-9]+\\d*)\\.))+(?:(0\\.|([1-9]+\\d*)\\.))+((0|([1-9]+\\d*)))$
It's based in a previous one. But i see this solution better... for me ;)
Enjoy!!!
I tend to agree with split suggestion.
Ive created a "tester" for your problem in perl
#!/usr/bin/perl -w
#strings = ( "1.2.3", "1.2.*", "1.*","*" );
%regexp = ( svrist => qr/(?:(\d+)\.(\d+)\.(\d+)|(\d+)\.(\d+)|(\d+))?(?:\.\*)?/,
onebyone => qr/^(\d+\.)?(\d+\.)?(\*|\d+)$/,
greg => qr/^(\*|\d+(\.\d+){0,2}(\.\*)?)$/,
vonc => qr/^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$/,
ajb => qr/^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$/,
jrudolph => qr/^(((\d+)\.)?(\d+)\.)?(\d+|\*)$/
);
foreach my $r (keys %regexp){
my $reg = $regexp{$r};
print "Using $r regexp\n";
foreach my $s (#strings){
print "$s : ";
if ($s =~m/$reg/){
my ($main, $maj, $min,$rev,$ex1,$ex2,$ex3) = ("any","any","any","any","any","any","any");
$main = $1 if ($1 && $1 ne "*") ;
$maj = $2 if ($2 && $2 ne "*") ;
$min = $3 if ($3 && $3 ne "*") ;
$rev = $4 if ($4 && $4 ne "*") ;
$ex1 = $5 if ($5 && $5 ne "*") ;
$ex2 = $6 if ($6 && $6 ne "*") ;
$ex3 = $7 if ($7 && $7 ne "*") ;
print "$main $maj $min $rev $ex1 $ex2 $ex3\n";
}else{
print " nomatch\n";
}
}
print "------------------------\n";
}
Current output:
> perl regex.pl
Using onebyone regexp
1.2.3 : 1. 2. 3 any any any any
1.2.* : 1. 2. any any any any any
1.* : 1. any any any any any any
* : any any any any any any any
------------------------
Using svrist regexp
1.2.3 : 1 2 3 any any any any
1.2.* : any any any 1 2 any any
1.* : any any any any any 1 any
* : any any any any any any any
------------------------
Using vonc regexp
1.2.3 : 1.2. 3 any any any any any
1.2.* : 1. 2 .* any any any any
1.* : any any any 1 any any any
* : any any any any any any any
------------------------
Using ajb regexp
1.2.3 : 1 2 3 any any any any
1.2.* : 1 2 any any any any any
1.* : 1 any any any any any any
* : any any any any any any any
------------------------
Using jrudolph regexp
1.2.3 : 1.2. 1. 1 2 3 any any
1.2.* : 1.2. 1. 1 2 any any any
1.* : 1. any any 1 any any any
* : any any any any any any any
------------------------
Using greg regexp
1.2.3 : 1.2.3 .3 any any any any any
1.2.* : 1.2.* .2 .* any any any any
1.* : 1.* any .* any any any any
* : any any any any any any any
------------------------
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Perhaps a more concise one could be :
^(?:(\d+)\.){0,2}(\*|\d+)$
This can then be enhanced to 1.2.3.4.5.* or restricted exactly to X.Y.Z using * or {2} instead of {0,2}
This should work for what you stipulated. It hinges on the wild card position and is a nested regex:
^((\*)|([0-9]+(\.((\*)|([0-9]+(\.((\*)|([0-9]+)))?)))?))$
For parsing version numbers that follow these rules:
- Are only digits and dots
- Cannot start or end with a dot
- Cannot be two dots together
This one did the trick to me.
^(\d+)((\.{1}\d+)*)(\.{0})$
Valid cases are:
1, 0.1, 1.2.1
Another try:
^(((\d+)\.)?(\d+)\.)?(\d+|\*)$
This gives the three parts in groups 4,5,6 BUT:
They are aligned to the right. So the first non-null one of 4,5 or 6 gives the version field.
1.2.3 gives 1,2,3
1.2.* gives 1,2,*
1.2 gives null,1,2
*** gives null,null,*
1.* gives null,1,*
My take on this, as a good exercise - vparse, which has a tiny source, with a simple function:
function parseVersion(v) {
var m = v.match(/\d*\.|\d+/g) || [];
v = {
major: +m[0] || 0,
minor: +m[1] || 0,
patch: +m[2] || 0,
build: +m[3] || 0
};
v.isEmpty = !v.major && !v.minor && !v.patch && !v.build;
v.parsed = [v.major, v.minor, v.patch, v.build];
v.text = v.parsed.join('.');
return v;
}
Sometimes version numbers might contain alphanumeric minor information (e.g. 1.2.0b or 1.2.0-beta). In this case I am using this regex:
([0-9]{1,4}(\.[0-9a-z]{1,6}){1,5})
(?ms)^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$
Does exactly match your 6 first examples, and rejects the 4 others
group 1: major or major.minor or '*'
group 2 if exists: minor or *
group 3 if exists: *
You can remove '(?ms)'
I used it to indicate to this regexp to be applied on multi-lines through QuickRex
This matches 1.2.3.* too
^(*|\d+(.\d+){0,2}(.*)?)$
I would propose the less elegant:
(*|\d+(.\d+)?(.*)?)|\d+.\d+.\d+)
Keep in mind regexp are greedy, so if you are just searching within the version number string and not within a bigger text, use ^ and $ to mark start and end of your string.
The regexp from Greg seems to work fine (just gave it a quick try in my editor), but depending on your library/language the first part can still match the "*" within the wrong version numbers. Maybe I am missing something, as I haven't used Regexp for a year or so.
This should make sure you can only find correct version numbers:
^(\*|\d+(\.\d+)*(\.\*)?)$
edit: actually greg added them already and even improved his solution, I am too slow :)
It seems pretty hard to have a regex that does exactly what you want (i.e. accept only the cases that you need and reject all others and return some groups for the three components). I've give it a try and come up with this:
^(\*|(\d+(\.(\d+(\.(\d+|\*))?|\*))?))$
IMO (I've not tested extensively) this should work fine as a validator for the input, but the problem is that this regex doesn't offer a way of retrieving the components. For that you still have to do a split on period.
This solution is not all-in-one, but most times in programming it doesn't need to. Of course this depends on other restrictions that you might have in your code.
Specifying XSD elements:
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\..*)?"/>
</xs:restriction>
</xs:simpleType>
One more solution:
^[1-9][\d]*(.[1-9][\d]*)*(.\*)?|\*$
I found this, and it works for me:
/(\^|\~?)(\d|x|\*)+\.(\d|x|\*)+\.(\d|x|\*)+
/^([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})(\-(alpha|beta|rc|HP|CP|SP|hp|cp|sp)[1-9]\d*)?(\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?(\.[0-9a-zA-Z]+)?$/
A normal version: ([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})
A Pre-release or patched version: (\-(alpha|beta|rc|EP|HP|CP|SP|ep|hp|cp|sp)[1-9]\d*)? (Extension Pack, Hotfix Pack, Coolfix Pack, Service Pack)
Customized version: (\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?
Internal version: (\.[0-9a-zA-Z]+)?