I am trying to replace a regular expression match with modified regular expression.
Following is the column in my DataFrame.
df['newcolumn']
0 Ther was a quick brown appl_product_type in ("eds") where blah blan appl_Cust_type =("value","value")
1 Ther was a quick brown appl_product_type = ("EDS") where blah blan appl_Cust_type =("value","value")
2 Ther was a quick brown appl_product_type in ("eds") where blah b
3 Ther was a quick brown appl_product_type in = ("EDS") where blah blan appl_Cust_type = ("value")
4 Ther was a quick brown where blah blan appl_Cust_type
Name: newcolumn, dtype: object
i want to replace every occurrence of strings like "appl_product_type = ('EDS')' to 'upper(appl_product_type) = ('EDS')'
i am using following code but getting error
newcolumn.replace(value='upper\[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)', regex='[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)')
error: bad escape \w at position 7
is there a way to solve this ?? Please Help.
A couple of things -
you cant use \w in your replacement value and expect it to know what to fill in
your regex as is, is badly formatted. use r'' to make simpler regex strings
your question is unclear as you are asking one specific format while your regex is attempting to catch a lot more.
I have a slightly more clear solution to what you have attempted, but am unsure if this is exactly what you wanted given the ambiguity in you question. -
df['newcolumn'] = df['newcolumn'].replace({r'([\w_]+\s+(?:in|=|\s)+\(\"(?:\w+\"(?:\,)?(?:\s+)?)+\))' : r'upper(\1)'}, regex=True)
Related
ISBN numbers come with random dash positions
978-618-81543-7-7
9786-18-81-5437-7
97-86-18-81-5437-7
How could i get them every time without knowing dash positions?
Just delete every - with your language of choice.
With Ruby :
"978-618-81543-7-7".delete('-')
#=> "9786188154377"
If you really want to use a regex :
"978-618-81543-7-7".gsub(/-/,'')
If you have multiple lines with isbns :
isbns = "978-618-81543-7-7
9786-18-81-5437-7
97-86-18-81-5437-7"
p isbns.scan(/\b[-\d]+\b/).map{|number_and_dash| number_and_dash.delete('-')}
#=> ["9786188154377", "9786188154377", "9786188154377"]
Pretty easy use of regex, google it to learn more. In Python:
import re
nums=re.findall('\d+',isbnstring)
This will give a list of the numbers. To join them to a string:
isbn=''.join(nums)
As per comments below, if you're working a file you could work it line by line:
with open(isbnfile) as desc:
for isbnstring in desc:
#Do the above and more.
As one example. There are a ton of ways to do this. I just realized from the command line sed is a good choice as well:
sed 's/-//g' isbnfile > newisbnfile
I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!
I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.
I'm trying to use RegExReplace to pre-process some text before it gets parsed for use in an Access database. Currently I have been defining a growing number of string patterns into a table, then use the stock Replace() function in VBA using that table. Works OK, but misses the mark in a few areas; I am pretty sure regular expressions will be a better long-term solution for me, but I am completely clueless how to construct them.
I'd like to see if the smart folks here can give me a leg up on the task using a few actual examples from my data, by illustrating the regex strings that will produce the desired result:
1. 6 IN 6IN
2. 12.3 IN X 2 YD 12.3IN_X_2YD
3. 6IN X 4IN 6IN_X_4IN
4. 8X120MM 8_X_120MM
5. 1 1/2" 1.5IN
6. CAT, DOG CAT DOG
7. CAT,DOG CAT DOG
8. CAT ,DOG CAT DOG
9. CAT , DOG CAT DOG
My patterns fail in ways like: CATHETER INFUSION => CATHETERINFUSION
I will be using a multi-pass approach vs attempting to come-up with some terribly complex expressions.
Can anyone offer some initial guidance to any of these samples. I'm confident I will be able to leverage these samples to extend as needed.
[Edit:] I did just find a few helpful examples:
NewStr := RegExReplace("abc123123", "123$", "xyz") ; Returns "abc123xyz" because the $ allows a match only at the end.
NewStr := RegExReplace("abc123", "i)^ABC") ; Returns "123" because a match was achieved via the case-insensitive option.
NewStr := RegExReplace("abcXYZ123", "abc(.*)123", "aaa$1zzz") ; Returns "aaaXYZzzz" by means of the $1 backreference.
NewStr := RegExReplace("abc123abc456", "abc\d+", "", ReplacementCount) ; Returns "" and stores 2 in ReplacementCount.
[Edit 2]: Making good progress!
strText = "BANDAGE, ADHESIVE, 2 FT X 3.5 IN X 0.25MM, LATEX-FREE"
strResult = RegExReplace(strText, "(,|\s+)", " ", True)
strResult = RegExReplace(strResult, "\s+(IN|FT|YD)\s+", "$1 ", True)
strResult = RegExReplace(strResult, "\s+X\s+", "_X_", True)
Produces:
BANDAGE ADHESIVE 2FT_X_3.5IN_X_0.25MM LATEX-FREE
Some regexps that might be useful:
/\s+IN/IN/
/\s+X\s+/_X_/
/(?:\d)X(?:\d)/_X_/
With the ~ operator, it is simple to find all lines in a table for which a column matches a given regexp pattern :
SELECT description from book where description ~ 'hell?o'
matches lines containing hello or helo
Instead of the description, I would like to SELECT a snippet of text around each occurences of the pattern, so if a line contains
description = "aaaheloaaabbbhellobbbcccheloccc"
I would like 3 lines as output :
"aaaheloaaa"
"bbbhellobbb"
"cccheloccc"
which I call a "grep-like" query because it can show extracts of the column where the match is found.
Try something like:
SELECT
regexp_matches(description,'(.{0,3})('||'hell?o'||')(.{0,3})','g')
FROM
book
WHERE description ~ 'hell?o'
Without the WHERE clause you will get null in rows, where were no matches for regexp.
I think you need the regexp_split_to_table function:
http://www.postgresql.org/docs/current/static/functions-matching.html
And you can use it like this:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumped over the lazy dog', E'\\s+') AS foo;
The return:
foo
--------
the
quick
brown
fox
jumped
over
the
lazy
dog
(9 rows)
So in your case this would look like:
select res from book, regexp_split_to_table(book.description, E'...hell?o...') res;
I can use "Alternation" in a regular expression to match any occurance of "cat" or "dog" thusly:
(cat|dog)
Is it possible to NEGATE this alternation, and match anything that is NOT "cat" or "dog"?
If so, how?
For Example:
Let's say I'm trying to match END OF SENTENCE in English, in an approximate way.
To Wit:
(\.)(\s+[A-Z][^.]|\s*?$)
With the following paragraph:
The quick brown fox jumps over the lazy dog. Once upon a time Dr. Sanches, Mr. Parsons and Gov. Mason went to the store. Hello World.
I incorrectly find "end of sentence" at Dr., Mr., and Gov.
(I'm testing using http://regexpal.com/ in case you want to see what I'm seeing with the above example)
Since this is incorrect, I would like to say something like:
!(Dr\.|Mr\.|Gov\.)(\.)(\s+[A-Z][^.]|\s*?$)
Of course, this isn't working, which is why I seek help.
I also tried !/(Dr.|Mr.|Gov.)/, and !~ which were no help whatsoever.
How can I avoid matches for "Dr.", "Mr." and "Gov.", etc?
Thanks in advance.
It is not possible. You would normally do this using negative lookbehind (?<!…), but JavaScript's regex flavor does not support this. Instead, you will have to filter the matches after the fact to discard those you don't want.
In language like Perl/awk, there's the !~ operator
$string !~ /(cat|dog)/
In Actionscript, you can just use NOT operator ! to negate a match. See here for reference. Also here for regex flavors comparison
You can do this:
!/(cat|dog)/
EDIT: You should've included the programming language on your question. Its Actionscript right? I'm not an actionscript coder but AFAIK its done like this:
var pattern2:RegExp = !/(cat|dog)/;
(?!NotThisStuff) is what you want, otherwise known as a negative lookahead group.
Unfortunately, it will not work as you intend. /(?!Dr\.)(\.)/ will still return the periods that belong to "Dr. Sanches" because of the second grouping. The Regex parser will say to itself, "Yep, this '.' isn't 'Dr.'" /((?!Dr).)/ won't work either, though I believe it should.
And what's more, you'll end up looking through all the sentence "ends" anyway. Actionscript doesn't have a "match all," only a match first. You have to set the global flag (or add g to the end of your regex) and call exec until your result object is null.
var string = 'The quick brown fox jumps over the lazy dog. Once upon a time Dr. Sanches, Mr. Parsons and Gov. Mason went to the store. Hello World.';
var regx:RegExp = /(?!Dr\.)(\.)/g;
var result:Object = regx.exec(string);
for (var i = 0; i < 10; i++) { // paranoia
if (result == null || result.index == 0) break; // again: paranoia
trace(result.index, result);
result = regx.exec(string);
}
// trace results:
//43 .,.
//64 .,.
//77 .,.
//94 .,.
//119 .,.
//132 .,.