Why does mariadb regex give contrary result? - regex

I have a column that contains following content:
+------------+
| name |
+------------+
| 你好世界 |
| HelloWorld |
| Hello世界  |
+------------+
and I hope
SELECT `name` FROM `table` WHERE `name` REGEXP '[u4e00-u9fa5]';
gives me only Chinese contained row like this:
+------------+
| name |
+------------+
| 你好世界 |
+------------+
but it actually gives me contrary result:
+------------+
| name |
+------------+
| HelloWorld |
| Hello世界  |
+------------+
I knew that:
SELECT `name` FROM `table` WHERE `name` NOT REGEXP '[u4e00-u9fa5]';
can work as expected,but I want to know why mysql regexp gives contrary result?Is this the default setting?Or I made a mistake.Thanks in advance.

If you are checking to see if a utf8 string has CJK characters in it:
WHERE HEX(name) REGEXP '^(..)*E[456789]'
That will not include the Chinese characters that are not in the "BMP" plane.

Related

Extract multiple values from a string for each id

I want to extract matches from a string column for each id. How can I achieve that?
+--------+---------------------------------------+
| id | text |
+--------+---------------------------------------+
| fsaf12 | Other Questions,Missing Document |
| sfas11 | Others,Missing Address,Missing Name |
+--------+---------------------------------------+
Desired output:
+--------+------------------+
| id | extracted |
+--------+------------------+
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
+--------+------------------+
Here is the query for sample data: FIDDLE
You can use regexp_split_to_table for your requirement like below:
WITH t1 AS (
SELECT 'fsaf12' AS id, 'Other Questions,Missing Document' AS text UNION ALL
SELECT 'sfas11', 'Others,Missing Address,Missing Name'
)
SELECT id, regexp_split_to_table(text,',')
FROM t1
OUTPUT
| id | extracted |
|-----------|-----------------------|
| fsaf12 | Other Questions |
| fsaf12 | Missing Document |
| sfas11 | Others |
| sfas11 | Missing Address |
| sfas11 | Missing Name |
DEMO
Postgres is not my forte at all but based on this older post on SO you could try to use unnest(). I included a TRIM() to remove possible railing spaces after a split:
SELECT id, TRIM(unnest(string_to_array(text, ','))) as "extracted" FROM t1;
Or, if you want to use regexp_split_to_table():
SELECT id, regexp_split_to_table(text, '\s*,\s*') as "extracted" FROM t1;
Here we matches 0+ whitespace characters, a literal comma and again 0+ whitespace characters.

Regex priority of match (forward and rear looking regex)

I have a monster regex at the moment, and am currently looking at how this best functions.
My regex is listed below and I am curious if there is a way to prioritize the regex in one function rather than just look for a specific match whereever it may exist.
Example:
If in my string i have a match for ([\d]+/[\d]+) or ([\d]+ / [\d]+) it would pick that first.
If this match above does not exist then but these existed ([\d]+-[\d]+) or ([\d]+ - [\d]+) it would pick that match
After that if ([\d]+) then it would pick that match as the end marker. If none of those existed it would then just move on to any of the other matches.
So my question is:
With Regex is there any way to prioritize which match to take first?
example: Some of my address strings are in the format of 1 - 12 example street,
often the regex will pull 12 example street rather than taking 1 - 12 example street.
Thanks!
The full regex is listed below:
New Regex("( ([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+) | CAR
SMOULDERING | GAS BOTTLE EXPLOSION | INPUT | OFF | OPPOSITE | CNR |
SPARKING | INCIC1 | INCIC3 | STRUC1 | STRUC3 | G&SC1 | G&SC3 | ALARC1 |
ALARC3 | NOSTC1| NOSTC3 | RESCC1 | RESCC3 | HIARC1 | HIARC3 | CAR
ACCIDENT - POSS PERSON TRAPPED | EXPLOSIONS HEARD | WASHAWAY AS A
RESULT OF ACCIDENT | ENTRANCE | ENT |FIRE| LHS | RHS | POWER LINES
ARCING AND SPARKING | SMOKE ISSUING FROM FAN | CAR FIRE | FIRE ALARM
OPERATING | GAS LEAK | GAS PIPE | NOW OUT | ACCIDENT | SMOKING | ROOF |
GAS | REQUIRED | FIRE | LOCKED IN CAR | SMOKE RISING | SINGLE CAR
ACCIDENT | ACCIDENT | FIRE)(.*?)(?=\SVSE| M | SVC | SVSW | SVNE | SVNW
)", RegexOptions.RightToLeft)
Change the order of the 3 first:
(\d+-\d+) | (\d+ - \d+) | (\d+ )
instead of:
([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+)

Regex to capture dialog in Virginia Woolf's novel The Waves?

A bunch of us English grad students are studying dialog in Virginia Woolf's novel The Waves, and I've been trying to mark up the novel in TEI. To do this, it would be useful to write a regex that captures the dialog. Thankfully, The Waves is extremely regular, and almost all the dialog is in the form:
'Now they have all gone,' said Louis. 'I am alone. They have gone into the house for breakfast,'
But could continue for several paragraphs. I'm trying to write a regex to match all the paragraphs of a given speaker.
This is discussed briefly in Chris Foster's blog post, where he suggests something like /'([\^,]+,)' said Louis, '(*)'/, although this would only match single paragraphs, I think. This is how I'm thinking through it:
For every paragraph containing the text "said Louis" (or any other character's name) in the first line of the paragraph, match every line until reaching another character's speech, i.e. "said Rhodha."
I could probably do this with a ton of awkward python, but I'd love to know whether this is possible with regex.
It seems, from your link, that the text follows the following rules.
Each "line" is indeed a line in the strict sense, i.e. separated by \n.
Paragraphs are demarcated by two or more consecutive new lines, _i.e. \n\n+.
Only the non-directional single quote ' is used to demarcate speech.
Here's a quick attempt (scroll all the way down to view the match groups)—flawed, I'm sure—but there's enough here that should lead you in the right direction. Note how if you concatenate the three capture groups, idiomatically known as $1, $2, and $3, you get each character's speech, including punctuation between the "said" separator. However, notice how certain quirks of language throw this regular expression off—for example, the fact that we do not close quotes at the end of paragraphs, yet open new quotes if the speech continues into the next paragraph, throws off the whole balanced-quotes strategy—and so do apostrophes.
\n\n.*?'([^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:([.]) |, )'([^^]+?)'(?=[^']*(?:'[^']')*[^']*\n\n.*'(?:[^^]+?[?]?),?' said (?:[A-Z][a-z]+)(?:[.] |, ))
| | | <----><--> <>|<-------------------><------------>| <----> |<--------------------------------------------------------------------------------->
| | | | | | || | | | ||
| | | | | | || | | | |assert that this end-quote is followed by a string of non-quote characters, then
| | | | | | || | | | |zero or more strings of quoted non-quote characters, then another string of non-
| | | | | | || | | | |quote characters, a new paragraph, and the next "said Bernard"; otherwise fail.
| | | | | | || | | | |
| | | | | | || | | | match an (end-)quote
| | | | | | || | | |
| | | | | | || | | match any character as needed (but no more than needed)
| | | | | | || | |
| | | | | | || | match a (start-)quote
| | | | | | || |
| | | | | | || match either a period followed by two spaces, or a comma followed by one space
| | | | | | ||
| | | | | | |match the "said Bernard"
| | | | | | |
| | | | | | match an (end-)quote
| | | | | |
| | | | | match a comma, optionally
| | | | |
| | | | match a question mark, optionally
| | | |
| | | match any character as needed (but no more than needed)
| | |
| | match a (start-)quote
| |
| match as many non-newline characters as needed (but no more than needed)
|
new paragraph
Rubular matches (an excerpt):
Match 3
1. But when we sit together, close
2.
3. we melt into each
other with phrases. We are edged with mist. We make an
unsubstantial territory.
Match 4
1. I see the beetle
2. .
3. It is black, I see; it is green,
I see; I am tied down with single words. But you wander off; you
slip away; you rise up higher, with words and words in phrases.

Regex named grouping

Can you have dynamic naming in regex groups? Something like
reg = re.compile(r"(?PText|Or|Something).*(?PTextIWant)")
r = reg.find("TextintermingledwithTextIWant")
r.groupdict()["Text"] == "TextIWant"
So that depending on what the beggining was, group["Text"] == TextIWant
Updated to make the quesetion more clear.
Some regex engines support this, some don't. This site says that Perl, Python, PCRE (and thus PHP), and .NET support it, all with slightly different syntax:
+--------+----------------------------+----------------------+------------------+
| Engine | Syntax | Backreference | Variable |
+--------+----------------------------+----------------------+------------------+
| Perl | (?<name>...), (?'name'...) | \k<name>, \k'name' | %+{name} |
| | (?P<name>...) | \g{name}, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
| Python | (?P<name>...) | (?P=name), \g<name> | m.group('name') |
+--------+----------------------------+----------------------+------------------+
| .NET | (?<name>...), (?'name'...) | \k<name>, \k'name' | m.Groups['name'] |
+--------+----------------------------+----------------------+------------------+
| PCRE | (?<name>...), (?'name'...) | \k<name>, \k'name' | Depends on host |
| | (?P<name>...) | \g{name}, \g<name>* | language. |
| | | \g'name'*, (?&name)* | |
| | | (?P>name)* | |
+--------+----------------------------+----------------------+------------------+
This is not a complete list, but it's what I could find. If you know more flavors, add them! The backreference forms with a * are those which are "recursive" as opposed to just a back-reference; I believe this means they match the pattern again, not what was matched by the pattern. Also, I arrived at this by reading the docs, but there could well be errors—this includes some languages I've never used and some features I've never used. Let me know if something's wrong.
Your question is worded kind of funny, but I think what you are looking for is a non-capturing group. Make it like this:
(?:Must_Match_This_First)What_You_Want(?:Must_Match_This_Last)
The ?: is what designates a that a group matches, but does not capture.
You could first build the string in a dynamic way and then pass it to the Regex engine.

How do regular expressions work in selenium?

I want to store part of an id, and throw out the rest. For example, I have an html element with an id of 'element-12345'. I want to throw out 'element-' and keep '12345'. How can I accomplish this?
I can capture and echo the value, like this:
| storeAttribute | //pathToMyElement#id | myId |
| echo | ${!-myId-!} | |
When I run the test, I get something like this:
| storeAttribute | //pathToMyElement#id | myId |
| echo | ${myId} | element-12345 |
I'm recording with the Selenium IDE, and copying the test over into Fitnesse, using the Selenium Bridge fixture. The problem is I'm using a clean database each time I run the test, with random ids that I need to capture and use throughout my test.
The solution is to use the JavaScript replace() function with storeEval:
| storeAttribute | //pathToMyElement#id | elementID |
| storeEval | '${elementID}'.replace("element-", "") | myID |
Now if I echo myID I get just the ID:
| echo | ${myID} | 12345 |
/element-(\d+)/i
That's a regular expression that would capture the numbers after the dash.
Something like this might work:
| storeAttribute | fn:replace(//pathToMyElement#id,"^element-","") | myId |
To do regex requires XPath 2.0 - not sure which version Selenium implements.