With the ~ operator, it is simple to find all lines in a table for which a column matches a given regexp pattern :
SELECT description from book where description ~ 'hell?o'
matches lines containing hello or helo
Instead of the description, I would like to SELECT a snippet of text around each occurences of the pattern, so if a line contains
description = "aaaheloaaabbbhellobbbcccheloccc"
I would like 3 lines as output :
"aaaheloaaa"
"bbbhellobbb"
"cccheloccc"
which I call a "grep-like" query because it can show extracts of the column where the match is found.
Try something like:
SELECT
regexp_matches(description,'(.{0,3})('||'hell?o'||')(.{0,3})','g')
FROM
book
WHERE description ~ 'hell?o'
Without the WHERE clause you will get null in rows, where were no matches for regexp.
I think you need the regexp_split_to_table function:
http://www.postgresql.org/docs/current/static/functions-matching.html
And you can use it like this:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumped over the lazy dog', E'\\s+') AS foo;
The return:
foo
--------
the
quick
brown
fox
jumped
over
the
lazy
dog
(9 rows)
So in your case this would look like:
select res from book, regexp_split_to_table(book.description, E'...hell?o...') res;
Related
I am trying to replace a regular expression match with modified regular expression.
Following is the column in my DataFrame.
df['newcolumn']
0 Ther was a quick brown appl_product_type in ("eds") where blah blan appl_Cust_type =("value","value")
1 Ther was a quick brown appl_product_type = ("EDS") where blah blan appl_Cust_type =("value","value")
2 Ther was a quick brown appl_product_type in ("eds") where blah b
3 Ther was a quick brown appl_product_type in = ("EDS") where blah blan appl_Cust_type = ("value")
4 Ther was a quick brown where blah blan appl_Cust_type
Name: newcolumn, dtype: object
i want to replace every occurrence of strings like "appl_product_type = ('EDS')' to 'upper(appl_product_type) = ('EDS')'
i am using following code but getting error
newcolumn.replace(value='upper\[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)', regex='[\w]+\s+[in=]+[\s+\([\"\w+\,+\s+]+\)')
error: bad escape \w at position 7
is there a way to solve this ?? Please Help.
A couple of things -
you cant use \w in your replacement value and expect it to know what to fill in
your regex as is, is badly formatted. use r'' to make simpler regex strings
your question is unclear as you are asking one specific format while your regex is attempting to catch a lot more.
I have a slightly more clear solution to what you have attempted, but am unsure if this is exactly what you wanted given the ambiguity in you question. -
df['newcolumn'] = df['newcolumn'].replace({r'([\w_]+\s+(?:in|=|\s)+\(\"(?:\w+\"(?:\,)?(?:\s+)?)+\))' : r'upper(\1)'}, regex=True)
I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!
I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.
I'm trying to use RegExReplace to pre-process some text before it gets parsed for use in an Access database. Currently I have been defining a growing number of string patterns into a table, then use the stock Replace() function in VBA using that table. Works OK, but misses the mark in a few areas; I am pretty sure regular expressions will be a better long-term solution for me, but I am completely clueless how to construct them.
I'd like to see if the smart folks here can give me a leg up on the task using a few actual examples from my data, by illustrating the regex strings that will produce the desired result:
1. 6 IN 6IN
2. 12.3 IN X 2 YD 12.3IN_X_2YD
3. 6IN X 4IN 6IN_X_4IN
4. 8X120MM 8_X_120MM
5. 1 1/2" 1.5IN
6. CAT, DOG CAT DOG
7. CAT,DOG CAT DOG
8. CAT ,DOG CAT DOG
9. CAT , DOG CAT DOG
My patterns fail in ways like: CATHETER INFUSION => CATHETERINFUSION
I will be using a multi-pass approach vs attempting to come-up with some terribly complex expressions.
Can anyone offer some initial guidance to any of these samples. I'm confident I will be able to leverage these samples to extend as needed.
[Edit:] I did just find a few helpful examples:
NewStr := RegExReplace("abc123123", "123$", "xyz") ; Returns "abc123xyz" because the $ allows a match only at the end.
NewStr := RegExReplace("abc123", "i)^ABC") ; Returns "123" because a match was achieved via the case-insensitive option.
NewStr := RegExReplace("abcXYZ123", "abc(.*)123", "aaa$1zzz") ; Returns "aaaXYZzzz" by means of the $1 backreference.
NewStr := RegExReplace("abc123abc456", "abc\d+", "", ReplacementCount) ; Returns "" and stores 2 in ReplacementCount.
[Edit 2]: Making good progress!
strText = "BANDAGE, ADHESIVE, 2 FT X 3.5 IN X 0.25MM, LATEX-FREE"
strResult = RegExReplace(strText, "(,|\s+)", " ", True)
strResult = RegExReplace(strResult, "\s+(IN|FT|YD)\s+", "$1 ", True)
strResult = RegExReplace(strResult, "\s+X\s+", "_X_", True)
Produces:
BANDAGE ADHESIVE 2FT_X_3.5IN_X_0.25MM LATEX-FREE
Some regexps that might be useful:
/\s+IN/IN/
/\s+X\s+/_X_/
/(?:\d)X(?:\d)/_X_/
Given a string with certain words surrounded by stars, e.g.
The *quick* *brown* fox jumped over the *lazy* dog
can you transform the words surrounded by stars into uppercase versions, i.e.
The QUICK BROWN fox jumped over the LAZY dog
Given text in a column 'sentence' in table 'sentences', I can mark/extract the words as follows:
SELECT regexp_replace(sentence,'\*(.*?)\*','STARTUPPER\1ENDUPPER','g') FROM sentences;
but my first attempt at uppercase transforms doesn't work:
select regexp_replace(sentence,'\*(.*?)\*','' || upper('\1'),'g') from sentences;
I thought of using substring() to split the parts up after replacing the stars with start and end markers, but that would fail if there was more than one word starred.
You can create a PL/pgSQL function like:
CREATE FUNCTION upper_asterisk(inp_str varchar)
RETURNS varchar AS $$
DECLARE t_str varchar;
BEGIN
FOR t_str IN (SELECT regexp_matches(inp_str,'\*.+\*','g'))
BEGIN
inp_str := replace(inp_str, t_str, upper(t_str));
END;
RETURN inp_str;
END;
$$ LANGUAGE plpgsql;
(Havent tested, may have bugs).
Or use any available language to write such function inside DB.
Answer from the Postgresql mailing list:
Yeah, you cannot embed a function-call result in the "replace with" section;
it has to be a literal (with the group insertion meta-sequences allowed of
course).
I see two possible approaches.
1) Use pl/perl (or some variant thereof) which has facilities to do just
this.
2) Use regexp_matches(,,'g') to explode the input string into its components
parts. You can explode it so every character of the original string is in
the output with the different columns containing the "raw" and "to modify"
parts of each match. This would be done in a sub-query and then in the
parent query you would "string_agg(...)" the matches back together while
manipulating the columns needed "i.e., string_agg(c1 || upper(c3))"
HTH
David J.
SQL version, something like this:
SELECT string_agg(m, '')
FROM (
SELECT CASE WHEN n % 2 = 0 THEN upper(m) ELSE m END m
FROM (
SELECT m, row_number() OVER () n
FROM regexp_split_to_table( 'The *quick* *brown* fox j*ump*ed over the *lazy* dog' , '\*') m
) a
) a
split the string into a table with '*' and keep track of the row number
even rows should be UPPERized, if the first char is a '*', an empty row will be produced
glue all rows with string_agg
I have a string like this in PHP
<li>bla bla bla bla</li>
<li>hello hello</li>
<li>the brown fox didnt jump</li>
.
.
.
<li>aaaaaaarghhhh</li>
I want to get a string wich contains the first X li's (li tags included in the result string)
<li>.....first one....</li>
<li>.....second one....</li>
<li>.................</li>
<li>.....X one....</li>
How can this be done with REGEX or something else????
I could remove the
</li>
then explode by
<li>
and get the first X elements of array and then adding again li tags at beginning and at end of each element, but i think its too dirty...
Any better ideas?
See if this regex works for you (replace the number 2 with the required number of li):
((?:<li>.*?<\/li>){2})
How about parsing it as actual DOM elements using PHP Simple HTML DOM Parser
You can download the script from here: http://sourceforge.net/projects/simplehtmldom/files/
If you load that script in to your current script like this:
include_once("simple_html_dom.php");
Then it's as simple as:
$html = "<li>bla bla bla bla</li>, etc, etc ......";
$list_array = array();
foreach($html->find('li') as $element) {
$list_array[] = $element->innertext;
}