Golang regex : Ignore multiple occurrences - regex

I've got a simple need.
Giving this input (string) : 10 20 30 40 65 45 44 67 100 200 65 40 66 88 65
I need to get all numbers between 65 and 66.
Problem is when we have multiple occurrence of each limit.
With a regex like : (65).+(66), I captured 65 45 44 67 100 200 65 40 66. But I would like to get only 40.
How could I achieve this ?
https://regex101.com/r/9HoKxr/1

Sounds like you want to exclude matching '65' inside the number of your pattern upto the 1st occurence of '66'? It's a bit verbose but what about:
\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b
See an online demo
\b65\s - Start with '65' between a word-boundary and a whitespace char;
( - Open capture group;
(?:\s - Non-capture group with the constant of a whitespace char;
(?:\d|[1-57-9]\d|6[0-46-9]|\d{3,}) - Nested non-capture group to match any integer but '65' or '66';
)+?) - Close non-capture group and match it at least once but as few times as possible. Then close the capture group;
\s66\b - Match another space followed by '66' and word-boundary.
Note:
We will handle leading spaces with the Trim() function through the strings package;
That in my examples I have used '10 20 30 40 65 45 44 40 66 200 65 40 66 88 65' which should return multiple matches. In such case it's established OP is looking for the 'shortest' matching substring;
By 'shortest' it's meant that we are looking for the least amount of elements when the substring is split with spaces (using 'Fields' function from above mentione strings package). Therefor '123456' is prefered above '1 2 3' despite being the 'longer' substring in terms of characters;
Try:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
s := `10 20 30 40 65 45 44 40 66 200 65 40 66 88 65`
re := regexp.MustCompile(`\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b`)
matches := re.FindAllStringSubmatch(s, -1) // Retrieve all matches
shortest := ``
for i, _ := range matches { // Loop over array
if shortest == `` || len(strings.Fields(matches[i][1])) < len(strings.Fields(shortest)) {
shortest = strings.Trim(matches[i][1], ` `)
}
}
fmt.Println(shortest)
}
Try it for yourself here.

Related

Extract number from a text after symbol "X"

I have the following text in a column, where I need to extract number next to second "X" or "x",
in the below text, it is 54.
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY
Some other sample texts:
21sOE X 12sFL 56 X 36 63" PLAIN # Result must be : 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Given:
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY # Result: 54
21sOE X 12sFL 56 X 36 63" PLAIN # Result: 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Use:
[Xx]\s?(\d+)(?:.(?![Xx]))*$
Demo and Explanation:
https://regex101.com/r/KshMUE/1
You didn't state which tool/language this is using, so it's hard to know for sure what to suggest.
However, if possible, I would consider splitting the string on the letter "x" (or "X") as this makes the regex part much easier to follow. For example, something like:
input = '40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY'
input.split(/x/i)[2][/\d+/]
By doing this split, we first extract only the desired section of the string (in this case, ' 54 54" AWM/C129-DOBY'), so the regex (/\d+/) becomes trivial.
Try this:
(?i)(?<=x[^x]{1,100}x.)\d+
(?i): case-insensitive
(?<=: start of positive look-behind
x[^x]{1,100}x.: an xfollowed by up to 100 any characters except x, followed by x and any one single character
): end of look-behind
\\d+: one or more digits

replacing whitespace characters with '\t' string

I am trying to replace whitespace characters with '\t' string. The text file looks like this:
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
My code looks like:
import re
with open('rgb.txt', 'r') as f:
for line in f:
print(re.sub(r'\s+', r'\\t', line))
The above code gives:
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark\tslate\tgray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim\tgray
However, I only want to replace the whitespaces which are after the first number until the color name. Also not in between the color. The output I want is:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdarkslategray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdimgray
You can match whitespace immediately following a digit, which should solve the problem:
>>> txt = """255 255 255 white
... 0 0 0 black
... 47 79 79 dark slate gray
... 47 79 79 DarkSlateGray
... 47 79 79 DarkSlateGrey
... 105 105 105 dim gray"""
>>> for line in txt.split('\n'):
... print(re.sub(r'[0-9]\s+', lambda m:m.group(0)[0]+r'\t', line))
...
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I couldn't find a quick way to just ignore the digit in the replacement, so I just made a lambda instead that takes the digit that was matched and appends a \t to it.
I suggest using nested re.subs:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line)
To get rid of spaces at start use line.lstrip() before running the regex:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line.lstrip())
The first ^[\d\s]+ matches all digits and spaces at the start of line and the second re.sub replaces whitespace strings with a single tab.
Output (for lines without .lstrip()):
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark slate gray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
Output (for lines with .lstrip()):
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I'm not familiar with python to quickly answer accurately in python, but here's javascript showing the regex implementation. If the first three parameters will always be strings of digits, you can use handle it this way.
var input = `255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray`
var output = input.replace(/(\d+)\s+/g, '$1\\t')
console.log(output)
You can do it in two passes:
import re
txt = """
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
"""
for line in txt.split('\n'):
line = re.sub(r'^\s+', '', line) # remove leading spaces
print(regex.sub(r'(?<![a-zA-Z])(\s+)', r'\\t', line)) # change other spaces by \t when not preceded by a letter
Output:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray

dart regex remove space phone

I tried all this regex solution but no match REGEX Remove Space
I work with dart and flutter and I tried to capture only digit of this type of string :
case 1
aaaaaaaaa 06 12 34 56 78 aaaaaa
case 2
aaaaaaaa 0612345678 aaaaaa
case 3
aaaaaa +336 12 34 56 78 aaaaa
I search to have only 0612345678 with no space and no +33. Just 10 digit in se case of +33 I need to replace +33 by 0
currently I have this code \D*(\d+)\D*? who run with the case 2
You may match and capture an optional +33 and then a digit followed with spaces or digits, and then check if Group 1 matched and then build the result accordingly.
Here is an example solution (tested):
var strs = ['aaaaaaaaa 06 12 34 56 78 aaaaaa', 'aaaaaaaa 0612345678 aaaaaa', 'aaaaaa +336 12 34 56 78 aaaaa', 'more +33 6 12 34 56 78'];
for (int i = 0; i < strs.length; i++) {
var rx = new RegExp(r"(?:^|\D)(\+33)?\s*(\d[\d ]*)(?!\d)");
var match = rx.firstMatch(strs[i]);
var result = "";
if (match != null) {
if (match.group(1) != null) {
result = "0" + match.group(2).replaceAll(" ", "");
} else {
result = match.group(2).replaceAll(" ", "");
}
print(result);
}
}
Returns 3 0612345678 strings in the output.
The pattern is
(?:^|\D)(\+33)?\s*(\d[\d ]*)(?!\d)
See its demo here.
(?:^|\D) - start of string or any char other than a digit
(\+33)? - Group 1 that captures +33 1 or 0 times
\s* - any 0+ whitespaces
(\d[\d ]*) - Group 2: a digit followed with spaces or/and digits
(?!\d) - no digit immediately to the right is allowed.
Spaces are removed from Group 2 with a match.group(2).replaceAll(" ", "") since one can't match discontinuous strings within one match operation.

How can I extract the first digits from string

I am trying to use powershell to extract the first digits from a long string. How can I use regex to only get the first numbers from a string?
String 1:
000660007501S W RUSSELL DLC NO 41 SLY 2.5 FT OF ELY 313.82 FT OF FOLLOWING DESCRIBED PORTION OF SAMUEL W RUSSELL DONATION CLAIM` 1000
string 2:
010454040006ALDERBROOK DIV NO 05 62000 14000040
string 3:
012000012000ALEXANDER ACRE TRS S 1/2 OF LOT 38 TGW LOT 39 TGW N 45.96 FT OF E 109.23 FT LOT 40 LESS ANY POR PLTD DEVON LANE 13000 38-39-40
I was able to do it like this:
$accountnumber = $p.Substring(0,16) -replace '\D+',''
$Parcelnumber = $accountnumber.Substring(0,10)

Regex to detect ASCII art on a single line.

Basically I want to find ASCII Art on one line. For me this is any 2 characters that are not alpha numeric ignoring whitespace. So a line might look like :
This is a !# Test of --> ASCII art detection ### <--
So the matches I should get are :
!#
-->
###
<--
I came up with this which still selects spaces :(
\b\W{2,}
Im using the following website for testing :
http://gskinner.com/RegExr/
Thanks for the help its much appreciated!!
I'd suggest something like this:
[^\w\s]{2,}
This will match any sequence of two or more characters that are not word characters (which include alphanumeric characters and underscores) or whitespace characters.
Demonstration
If you would also like to match underscores as part of your 'ASCII art', you'd have to be more specific:
[^a-zA-Z0-9\s]{2,}
Demonstration
I think this
((?=[\x21-\x7e])[\W_]){2,}
is probably equavalent to this
[[:punct:]]{2,}
Using POSIX, the supported punctuation is:
(to add more, just add it to the class [[:punct:]<add here>]{2,}
33 = !
34 = "
35 = #
36 = $
37 = %
38 = &
39 = '
40 = (
41 = )
42 = *
43 = +
44 = ,
45 = -
46 = .
47 = /
58 = :
59 = ;
60 = <
61 = =
62 = >
63 = ?
64 = #
91 = [
92 = \
93 = ]
94 = ^
95 = _
96 = `
123 = {
124 = |
125 = }
126 = ~