In reference to a previous question
Python data extract from text file - script stops before expected data match
How can I capture a match and the previous two lines?
I tried this but get:
unterminated subpattern at position 0 (line 1, column 1)
output = re.findall('(.*\r\n{2}random data.',f.read(), re.DOTALL)
You may use
re.findall(r'(?:.*\r?\n){2}.*random data.*', s)
Note you can't use re.DOTALL or .* will match up to the end of the input and you will only get the last occurrence.
See the Python demo
Pattern details
(?:.*\r?\n){2} - 2 occurrences of a sequence of
.* - any 0+ chars other than line break chars, as many as possible (a line)
\r?\n - a line ending (CRLF or LF)
.*random data.* - a line containing random data substring.
See the regex demo.
Related
I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.
I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.
How do you use regex to insert | every two characters from a starting position to the end of the line?
Using regex on the following sample (tshark output of packet data), the regex inserts | after the first two characters and the next two characters, but does not apply the pattern to the rest of the line. I think the issue is with a repeated pattern on the 2nd grouping (or lackthereof).
Sample:
1478646603.255173000 10.10.10.1 0000000000000000000000
^(.{34})(..) replace with \1|\2| OR ^(.{34})(.*?(..)) replace with \1|\2
Produces this:
1478646603.255173000 10.10.10.1 00|00|000000000000000000
What I want is:
1478646603.255173000 10.10.10.1 00|00|00|00|00|00|00|00|00|00|00
You may use
(?:\G(?!^)|^.{36})\K..(?!$)
and replace with $&|.
Details:
(?:\G(?!^)|^.{36}) - matches the location at the end of the previous successful match (with \G(?!^)) or (|) the start of a line (^) and the first 36 characters other than linebreak chars (.{36})
\K - the match reset operator that discards the whole text matched so far
.. - any 2 chars other than linebreak chars
(?!$) - that are not at the end of the string.
The replacement pattern only contains the backreference to the whole match ($&) and a | pipe symbol (a literal symbol in the replacement pattern).
<.*>|\n.*\s.*\sid="(\w*)".*\n+|.*>\n|\n.+
and replace $1
This regex can take all id out from file
<a href="java" class="total" id="maker" placeholder="getTheResult('local6')">master6<a>
Result is maker
How can I extract getTheResult key name?
so my result will be local6
Tried <.*>|\n.*\s.*\sgetTheResult('(\w*)').*\n+|.*>\n|\n.+ but didn't helped
I assume that:
you have files with text like getTheResult('local6')
you may have several values like that on a line
you'd like to keep those text only, one value per line.
I suggest
getTheResult\('([^']*)'\)|(?:(?!getTheResult\(')[\s\S])*
and replace with $1\n. The \n will insert a newline between the values. You can then use ^\n regex (to replace with empty string) to remove empty lines.
Pattern details:
getTheResult\(' - matches getTheResult(' as a literal string (note the ( is escaped)
([^']*) - Group 1 capturing 0+ chars other than '
'\) - a literal ')
| - or
(?:(?!getTheResult\(')[\s\S])* - 0+ chars that are not starting chars of the getTheResult(' character sequence (this is a tempered greedy token).
I have a multiline string like this:
SA21 abcdef
BKxyz
SA21 abcdef
I need a regex that only matches if the line ^SA21 abcdef$ is present once. So it should not match for the first example but it should match for this one:
BK udsia
SA21 abcdef
BKxyz
I tried to capture the line and make sure it matches only when the same line is not found later: /(^SA21 abcdef$)(?!\1)/m regex101 but that does not work as it will probably always match the last line...
The regex you want should only match a line if the line is not present before or after the single occurrence of the line. This is achieved with a tempered greedy token:
/\A(?:(?!^SA21 abcdef$).)*(^SA21 abcdef$)(?:(?!^SA21 abcdef$).)*\z/ms
See the regex demo
The (?:(?!^SA21 abcdef$).)* is the token matching any text but the beginning of the SA21 abcdef line. The /s modifier is required so that a . could match a newline.
However, the construct is resource consuming, and it is a good idea to unroll it:
/\A(?:\n+(?!SA21 abcdef$).*)*\n*^(SA21 abcdef)$(?:\n+(?!SA21 abcdef$).*)*\z/m
See another demo
Note that \A and \z are unambiguous start/end string anchors, the /m modifier does not affect them.
Pattern explanation:
\A - start of string
(?:\n+(?!SA21 abcdef$).*)* - zero or more sequences of:
\n+ - 1 or more newlines ...
(?!SA21 abcdef$) - not followed with SA21 abcdef that is the whole line
.* - zero or more chars other than a newline
\n* - zero or more newlines
^ - start of a line
(SA21 abcdef) - the line that must be single
$ - end of line
(?:\n+(?!SA21 abcdef$).*)* - see above
\z - end of string.