return nth match from string using regex

return nth match from string using regex - regex

I am using Tableau to create a visualization and need to apply Regex to string values in my data set. I'm trying to use Regex to return the nth match of this string of data: b29f3b2f2b2f3b3f1r2f3+b3x#. The data will always be in one line and I need to break the data out into substrings each time the characters b,s,f, or d are encountered and I need to match the nth occurrence returned. For example, when identifying which number match to return the following will match:
n=1 matches b29
n=2 matches f3
n=3 matches b2
n=4 matches f2
n=5 matches b2
n=6 matches f3
n=7 matches b3
n=8 matches f1r2
n=9 matches f3+
n=10 matches b3x#
I can get the n=1 match to return the proper value using bfsd(?=[bfsd]) and have tried to get the subsequent values to return using lookahead, but can't find a regex which works. Any help is appreciated.

Your item pattern is [bfsd][^bfsd]*.
You may use ^(?:.*?([bfsd][^bfsd]*)){n} to get what you need, just update the n variable with the number you need to get.
This pattern will get you the second value:
^(?:.*?([bfsd][^bfsd]*)){2}
See regex demo.
Details
^ - start of string
(?:.*?([bfsd][^bfsd]*)){2} - two occurrences of
.*? - any 0+ chars, as few as possible
([bfsd][^bfsd]*) - b, f, s or d followed with 0+ chars othet than b, f, s and d.

You can use this regex:
[bsfd][^bsfd]*
Use the 'global' flag.
This will create matches that start with one of the four letters, followed by any number of other characters.
The result will be an array with all the matches. Note the Array will start with index 0 (not 1).

if you have gawk, this will partition the input field as your spec
$ awk -v FPAT='[a-f][0-9rx#+]+' '{$1=$1}1'
$ echo "b29f3b2f2b2f3b3f1r2f3+b3x#" |
awk -v FPAT='[a-f][0-9rx#+]+' '{for(i=1;i<=NF;i++) print i " -> " $i}'
1 -> b29
2 -> f3
3 -> b2
4 -> f2
5 -> b2
6 -> f3
7 -> b3
8 -> f1r2
9 -> f3+
10 -> b3x#

Related

Select only letters which are followed by a number

I am trying to select some codes from a PostgreSQl table.
I only want the codes that have numbers in them e.g
GD123
GD564
I don't want to pick any codes like `GDTG GDCNB
Here's my query so far:
select regexp_matches(no_, '[a-zA-Z0-9]*$')
from myschema.mytable
which of course doesn't work.
Any help appreciated.

The pattern to match a string that has at least 1 letter followed by at least 1 number is '[A-Za-z]+[0-9]+'.
Now, if the valid patterns had to start with two letters, and then have 3 digits after as your examples show, then replace the + with {2} & {4} respectively, and enclose the pattern in ^$, like this: '^[A-Za-z]{2}[0-9]{3}$'
The regex match operator is ~ which you can use in the where clause:
SELECT no_
FROM myschema.mytable
WHERE no_ ~ '[A-Za-z]+[0-9]+'

You may use
CREATE TABLE tb1
(s character varying)
;
INSERT INTO tb1
(s)
VALUES
('GD123'),
('12345'),
('GDFGH')
;
SELECT * FROM tb1 WHERE s ~ '^(?![A-Za-z]+$)[a-zA-Z0-9]+$';
Result:
Details
^ - start of string
(?![A-Za-z]+$) - a negative lookahead that fails the match if there are only letters to the end of the string
[a-zA-Z0-9]+ - 1 or more alphanumeric chars
$ - end of string.
If you want to avoid matching 12345, use
'^(?![A-Za-z]+$)(?![0-9]+$)[a-zA-Z0-9]+$'
Here, (?![0-9]+$) will similarly fail the match if, from the string start, all chars up to the end of the string are digits. Result:

smth like:
so=# with c(v) as (values('GD123'),('12345'),('GD ERT'))
select v ~ '[A-Z]{1,}[0-9]+', v from c;
?column? | v
----------+--------
t | GD123
f | 12345
f | GD ERT
(3 rows)
?..

If the format of the data you want to obtain is a set of characters follewd by a set of digits (i.e., GD123) you can use the regex:
[a-zA-Z0-9]+[0-9]

This captures every digit and letter which is in front of the digits:
([A-z]+\d+)

Reverse reading via regex or grabbing last match with no tail

I need to grab the 2 character letter and numbers after that:
What I want is:
AB 12 CD-12345-67 -> CD-12345
AB12 CD 12345-67 -> CD 12345
AB-12CD12345-6 -> CD12345
ABC1234556 -> no match, as I look for 2 character letter and numbers after that.
ABC-1234556 -> no match, as I look for 2 character letter and numbers after that.
A1-BC-12D345-56 -> no match, after 2 characters letter, numbers must come
I used this regex
[A-Z]{2}[ |\-]?\d+
Which grabs CD-12345 and AB 12, in the first example. I just need CD-12345. Also it grabs BC1234556,BC-1234556, BC-12 in the last three example which i don't want.
Sometimes, space,no space or - character placed between numbers and letters block.
Thank you very much.

based on what you posted
^.*(?<![A-Z])([A-Z]{2}[- ]?\d++)(?![A-Z])
Demo

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1

I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here

If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.

Remove last 4 digits from string if pattern matches

I'm trying to remove the last 4 digits from a string in Postgres if and only if they match a certain pattern: [0][1-9][0][1-9].
Example:
1031610101 -> 103161
1234 -> 1234
123456 -> 123456
123405 -> 123405
I've tried a few approaches using substring, but somehow can't get this to work.
The length of the string is variable.
So far I've tried:
substring(value from '([\d](3,6}[0][1-9][0][1-9])') as "Result"

Easier with regexp_replace():
SELECT regexp_replace(col, '0[1-9]0[1-9]$', '')
FROM tbl;
$ .. end of string

SELECT SUBSTR('ABCDEFGHIJKLMNOP', 1, LENGTH('ABCDEFGHIJKLMNOP') - 4);
Syntax: SUBSTR('string',from_postion,length)

Regular expression for bit strings with even number of 1s

Let L= { w in (0+1)* | w has even number of 1s}, i.e. L is the set of all bit strings with even number of 1s. Which one of the regular expressions below represents L?
A) (0*10*1)*
B) 0*(10*10*)*
C) 0*(10*1)* 0*
D) 0*1(10*1)* 10*
According to me option D is never correct because it does not represent the bit string with zero 1s. But what about the other options? We are concerned about the number of 1s(even or not) not the number of zeros doesn't matter.
Then which is the correct option and why?

A if false. It doesn't get matched by 0110 (or any zeros-only non-empty string)
B represents OK. I won't bother proving it here since the page margins are too small.
C doesn't get matched by 010101010 (zero in the middle is not matched)
D as you said doesn't get matched by 00 or any other # with no ones.
So only B

To solve such a problem you should
Supply counterexample patterns to all "incorrect" regexps. This will be either a string in L that is not matched, or a matched string out of L.
To prove the remaining "correct" pattern, you should answer two questions:
Does every string that matches the pattern belong to L? This can be done by devising properties each of matched strings should satisfy--for example, number of occurrences of some character...
Is every string in L matched by the regexp? This is done by dividing L into easily analyzable subclasses, and showing that each of them matches pattern in its own way.
(No concrete answers due to [homework]).

Examining the pattern B:
^0*(10*10*)*$
^ # match beginning of string
0* # match zero or more '0'
( # start group 1
10* # match '1' followed by zero or more '0'
10* # match '1' followed by zero or more '0'
)* # end group 1 - match zero or more times
$ # end of string
Its pretty obvious that this pattern will only match strings who have 0,2,4,... 1's.

Look for examples that should match but don't. 0, 11011, and 1100 should all match, but each one fails for one of those four

C is incorrect because it does not allow any 0s between the second 1 of one group and the first 1 of the next group.

This answer would be best for this language
(0*10*10*)

a quick python script actually eliminated all the possibilities:
import re
a = re.compile("(0*10*1)*")
b = re.compile("0*(10*10*)*")
c = re.compile("0*(10*1)* 0*")
d = re.compile("0*1(10*1)* 10*")
candidates = [('a',a),('b',b),('c',c),('d',d)]
tests = ['0110', '1100', '0011', '11011']
for test in tests:
for candidate in candidates:
if not candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it failed on %s" % (candidate[0], test)
ntests = ['1', '10', '01', '010', '10101']
for test in ntests:
for candidate in candidates:
if candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it matched on %s" % (candidate[0], test)
the output:
removed c because it failed on 0110
removed d because it failed on 0110
removed a because it matched on 1
removed b because it matched on 10

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

return nth match from string using regex - regex

You can use this regex: [bsfd][^bsfd]* Use the 'global' flag. This will create matches that start with one of the four letters, followed by any number of other characters. The result will be an array with all the matches. Note the Array will start with index 0 (not 1).

Related

Select only letters which are followed by a number

Reverse reading via regex or grabbing last match with no tail

Convert a regex expression to erlang's re syntax?

Remove last 4 digits from string if pattern matches

Regular expression for bit strings with even number of 1s

Categories

Resources