replacing whitespace characters with '\t' string - regex

I am trying to replace whitespace characters with '\t' string. The text file looks like this:
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
My code looks like:
import re
with open('rgb.txt', 'r') as f:
for line in f:
print(re.sub(r'\s+', r'\\t', line))
The above code gives:
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark\tslate\tgray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim\tgray
However, I only want to replace the whitespaces which are after the first number until the color name. Also not in between the color. The output I want is:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdarkslategray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdimgray

You can match whitespace immediately following a digit, which should solve the problem:
>>> txt = """255 255 255 white
... 0 0 0 black
... 47 79 79 dark slate gray
... 47 79 79 DarkSlateGray
... 47 79 79 DarkSlateGrey
... 105 105 105 dim gray"""
>>> for line in txt.split('\n'):
... print(re.sub(r'[0-9]\s+', lambda m:m.group(0)[0]+r'\t', line))
...
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
I couldn't find a quick way to just ignore the digit in the replacement, so I just made a lambda instead that takes the digit that was matched and appends a \t to it.

I suggest using nested re.subs:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line)
To get rid of spaces at start use line.lstrip() before running the regex:
re.sub(r'^[\d\s]+', lambda x: re.sub(r'\s+', '\t', x.group()), line.lstrip())
The first ^[\d\s]+ matches all digits and spaces at the start of line and the second re.sub replaces whitespace strings with a single tab.
Output (for lines without .lstrip()):
255\t255\t255\twhite
\t0\t0\t0\tblack
\t47\t79\t79\tdark slate gray
\t47\t79\t79\tDarkSlateGray
\t47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray
Output (for lines with .lstrip()):
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray

I'm not familiar with python to quickly answer accurately in python, but here's javascript showing the regex implementation. If the first three parameters will always be strings of digits, you can use handle it this way.
var input = `255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray`
var output = input.replace(/(\d+)\s+/g, '$1\\t')
console.log(output)

You can do it in two passes:
import re
txt = """
255 255 255 white
0 0 0 black
47 79 79 dark slate gray
47 79 79 DarkSlateGray
47 79 79 DarkSlateGrey
105 105 105 dim gray
"""
for line in txt.split('\n'):
line = re.sub(r'^\s+', '', line) # remove leading spaces
print(regex.sub(r'(?<![a-zA-Z])(\s+)', r'\\t', line)) # change other spaces by \t when not preceded by a letter
Output:
255\t255\t255\twhite
0\t0\t0\tblack
47\t79\t79\tdark slate gray
47\t79\t79\tDarkSlateGray
47\t79\t79\tDarkSlateGrey
105\t105\t105\tdim gray

Related

Golang regex : Ignore multiple occurrences

I've got a simple need.
Giving this input (string) : 10 20 30 40 65 45 44 67 100 200 65 40 66 88 65
I need to get all numbers between 65 and 66.
Problem is when we have multiple occurrence of each limit.
With a regex like : (65).+(66), I captured 65 45 44 67 100 200 65 40 66. But I would like to get only 40.
How could I achieve this ?
https://regex101.com/r/9HoKxr/1
Sounds like you want to exclude matching '65' inside the number of your pattern upto the 1st occurence of '66'? It's a bit verbose but what about:
\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b
See an online demo
\b65\s - Start with '65' between a word-boundary and a whitespace char;
( - Open capture group;
(?:\s - Non-capture group with the constant of a whitespace char;
(?:\d|[1-57-9]\d|6[0-46-9]|\d{3,}) - Nested non-capture group to match any integer but '65' or '66';
)+?) - Close non-capture group and match it at least once but as few times as possible. Then close the capture group;
\s66\b - Match another space followed by '66' and word-boundary.
Note:
We will handle leading spaces with the Trim() function through the strings package;
That in my examples I have used '10 20 30 40 65 45 44 40 66 200 65 40 66 88 65' which should return multiple matches. In such case it's established OP is looking for the 'shortest' matching substring;
By 'shortest' it's meant that we are looking for the least amount of elements when the substring is split with spaces (using 'Fields' function from above mentione strings package). Therefor '123456' is prefered above '1 2 3' despite being the 'longer' substring in terms of characters;
Try:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
s := `10 20 30 40 65 45 44 40 66 200 65 40 66 88 65`
re := regexp.MustCompile(`\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b`)
matches := re.FindAllStringSubmatch(s, -1) // Retrieve all matches
shortest := ``
for i, _ := range matches { // Loop over array
if shortest == `` || len(strings.Fields(matches[i][1])) < len(strings.Fields(shortest)) {
shortest = strings.Trim(matches[i][1], ` `)
}
}
fmt.Println(shortest)
}
Try it for yourself here.

Extract number from a text after symbol "X"

I have the following text in a column, where I need to extract number next to second "X" or "x",
in the below text, it is 54.
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY
Some other sample texts:
21sOE X 12sFL 56 X 36 63" PLAIN # Result must be : 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Given:
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY # Result: 54
21sOE X 12sFL 56 X 36 63" PLAIN # Result: 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Use:
[Xx]\s?(\d+)(?:.(?![Xx]))*$
Demo and Explanation:
https://regex101.com/r/KshMUE/1
You didn't state which tool/language this is using, so it's hard to know for sure what to suggest.
However, if possible, I would consider splitting the string on the letter "x" (or "X") as this makes the regex part much easier to follow. For example, something like:
input = '40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY'
input.split(/x/i)[2][/\d+/]
By doing this split, we first extract only the desired section of the string (in this case, ' 54 54" AWM/C129-DOBY'), so the regex (/\d+/) becomes trivial.
Try this:
(?i)(?<=x[^x]{1,100}x.)\d+
(?i): case-insensitive
(?<=: start of positive look-behind
x[^x]{1,100}x.: an xfollowed by up to 100 any characters except x, followed by x and any one single character
): end of look-behind
\\d+: one or more digits

How to match multiline if some string exist while some string don't

I wanna to match obj number in PDF. I just want the number before "obj" meet the condition that the string following it in the bracket << >> contains /ObjStm.
Desired match:
363 0
364 0
while 2 0 is not my desired obj number. How to reg match it?
%PDF-1.7
363 0 obj <<
/Filter
/FlateDecode
/First 55 /Length 339 /N 8 /Type
/ObjStm
>>stream
somestring fox jupm over dog.
endstream
2 0 obj <</Type sf >> endobj
364 0 obj <</Filter/FlateDecode/First 657/Length 1492/N
75/Type/ObjStm>>stream somestream.
https://regex101.com/r/wU700E/1
This should work if lookaheads are supported (often):
^\ *([\d ]+)obj\s+<<(?:(?!>>)[\s\S])+/ObjStm(?:(?!>>)[\s\S])*>>
See a demo on regex101.com.

Extract a complete sentence using a QRegularExpression

I am currently trying to extract the following sentence:
This is a rectangle. Its height is 193, its width is 193 and the word number is 12.
from the following line:
ID: 1 x: 1232 y: 2208 w: 193 h: 390 wn: 12 ln: 13 c: This is a rectangle. Its height is 193, its width is 193 and the word number is 12 !
I have to do this using QRegularExpressions. Therefore, my code is as following:
regularExpression.setPattern("[c:](?:\\s*)$");
QRegularExpressionMatch match = regularExpression.match("ID: 2 x: 845 y: 1633 w: 422 h: 491 wn: 78 ln: 12 c: qsdfgh");
if (match.hasMatch()) {
QString id = match.captured(0);
qDebug()<<"The annotation is:"<<id;
return id;
}
return 0;
However, it does not work at all and I do not understand why (maybe my regular expression is not correct).I am stuck in this problem from several days now.
Could you help me please ?
Use following regex to parse everything after c: and to also remove possible white space from the beginning of the string:
regularExpression.setPattern("c:\s*(.*$)");

Regex to detect ASCII art on a single line.

Basically I want to find ASCII Art on one line. For me this is any 2 characters that are not alpha numeric ignoring whitespace. So a line might look like :
This is a !# Test of --> ASCII art detection ### <--
So the matches I should get are :
!#
-->
###
<--
I came up with this which still selects spaces :(
\b\W{2,}
Im using the following website for testing :
http://gskinner.com/RegExr/
Thanks for the help its much appreciated!!
I'd suggest something like this:
[^\w\s]{2,}
This will match any sequence of two or more characters that are not word characters (which include alphanumeric characters and underscores) or whitespace characters.
Demonstration
If you would also like to match underscores as part of your 'ASCII art', you'd have to be more specific:
[^a-zA-Z0-9\s]{2,}
Demonstration
I think this
((?=[\x21-\x7e])[\W_]){2,}
is probably equavalent to this
[[:punct:]]{2,}
Using POSIX, the supported punctuation is:
(to add more, just add it to the class [[:punct:]<add here>]{2,}
33 = !
34 = "
35 = #
36 = $
37 = %
38 = &
39 = '
40 = (
41 = )
42 = *
43 = +
44 = ,
45 = -
46 = .
47 = /
58 = :
59 = ;
60 = <
61 = =
62 = >
63 = ?
64 = #
91 = [
92 = \
93 = ]
94 = ^
95 = _
96 = `
123 = {
124 = |
125 = }
126 = ~