Python re to retrieve pattern plus x number of characters after the pattern - regex

I want to use python re to search for a string, and then print out that string and the next 4 characters after the string. I can not work out how to do it.
I've tried using the .{4} parameter when I print the pattern, but nothing is displayed (see my code example)
import re
sequence="I want to know if there are some available 123"
pattern="available"
re.search(pattern, sequence):
print(pattern{.4})
else:
print ("it's not there")
What every the next 4 characters is after the search strong 'available' I would like to print out the search string, and those 4 characters, so in the code example it would print out 'available 123'.

You have to concatenate the .{4} to the pattern when searching:
import re
sequence="I want to know if there are some available 123"
pattern="available"
res = re.search(pattern + '.{4}', sequence)
if (res):
print(res.group(0))
else:
print ("it's not there")
Output:
available 123

Related

Extracting floating point number [duplicate]

Assuming I have the following string:
str = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222
"""
I want to export all numbers , Whether int or float after the word "stop" with no case sensitive . so the expected results will be :
[5.02, 16.1, 5, 2.3222]
The farthest I have come so far is by using PyPi regex from other post here:
regex.compile(r'(?<=stop.*)\d+(?:\.\d+)?', regex.I)
but this expression gives me only [5.02, 16.1]
Yet another one, albeit with the newer regex module:
(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?
See a demo on regex101.com.
In Python, this could be
import regex as re
string = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222
"""
pattern = re.compile(r'(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?')
numbers = pattern.findall(string)
print(numbers)
And would yield
['5.02', '16.1', '5', '2.3222']
Don't name your variables after inbuilt-functions, like str, list, dict and the like.
If you need to go further and limit your search within some bounds (e.g. all numbers between Stop and end), you could as well use
(?:\G(?!\A)|Stop)(?:(?!end)\D)+\K\d+(?:\.\d+)?
# ^^^ ^^^
See another demo on regex101.com.
You get only the first 2 numbers, as .* does not match a newline.
You can add update the flags to regex.I | regex.S to have the dot match a newline.
import regex
text = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222
"""
pattern = regex.compile(r'(?<=\bstop\b.*)\d+(?:\.\d+)?', regex.I | regex.S)
print(regex.findall(pattern, text))
Output
['5.02', '16.1', '5', '2.3222']
See a Python demo
If you want to print the numbers after the word "stop", you can also use python re and match stop, and then capture in a group all that follows.
Then you can take that group 1 value, and find all the numbers.
import re
text = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222
"""
pattern = r"\bStop\b(.+)"
m = re.search(pattern, text, re.S|re.I)
if m:
print(re.findall(r"\d+(?:\.\d+)*", m.group(1)))
Output
['5.02', '16.1', '5', '2.3222']
You could use:
inp = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222"""
nums = []
if re.search(r'\bstop\b', inp, flags=re.I):
inp = re.sub(r'^.*?\bstop\b', '', inp, flags=re.S|re.I)
nums = re.findall(r'\d+(?:\.\d+)?', inp)
print(nums) # ['5.02', '16.1', '5', '2.3222']
The if logic above ensures that we only attempt to populate the array of numbers if we are certain that Stop appears in the input text. Otherwise, the default output is just an empty array. If Stop does appear, then we strip off that leading portion of the string before using re.findall to find all numbers appearing afterwards.
import re
_string = """
HELLO 1 Stop #$**& 5.02‼️ 16.1
regex
5 ,#2.3222
"""
start = _string.find("Stop") + len("Stop")
print(re.findall("[-+]?\d*\.?\d+", _string[start:])) # ['5.02', '16.1', '5', '2.3222']

Appending a +1 to string of digits with re.sub

How do I use the re.sub python method to append +1 to a phone number?
When I use the following function it changes this string "802-867-5309" to this string "+1+15309". I'm trying to get this string "+1-802-867-5309". The examples in the docs replace show how to replace the entire string I don't want to replace the entire string just append a +1
import re
def transform_record(record):
new_record = re.sub("[0-9]+-","+1", record)
return new_record
print(transform_record("Some sample text 802-867-5309 some more sample text here"))
If you can match your phone numbers with a pattern you may refer to the match value using \g<0> backreference in the replacement.
So, taking the simplest pattern like \d+-\d+-\d+ that matches your phone number, you may use
new_record = re.sub(r"\d+-\d+-\d+", r"+1-\g<0>", record)
See the regex demo. See more ideas on how to match phone numbers at Find phone numbers in python script.
See the Python demo:
import re
def transform_record(record):
new_record = re.sub(r"\d+-\d+-\d+", r"+1-\g<0>", record)
return new_record
print(transform_record("Some sample text 802-867-5309 some more sample text here"))
# => Some sample text +1-802-867-5309 some more sample text here
You can try this:
new_record = re.sub(r"\d+-[\d+-]+", r"+1-\g<0>", record)

Match first 4 characters in a string

I am using python and regex. I read the file using python and I want to remove some of the words/characters from the file. I am using re.sub(). This is an example of what the strings look like:
Proxy BR 1.05s [HTTPS] 200.203.144.2:50262
I managed to remove the words and all the special characters, leaving, for example,
1.20 187.94.217.693128
but I cannot get rid of the first 4 characters. which are 1.05.
This is my regex:
pattern = "[a-zA-Z\[\],:<>]"
How can I get the first 4 characters to be removed?
Use an anchor (^ represents the start of the string, and .{4} any four characters after that):
import re
re.sub('^.{4}', '', '1.20 187.94.217.693128')
Output:
' 187.94.217.693128'
The code below only looks for the IPv4 address and port number in the input string. The format for an IP address and port number combination is:
digit{1,3}.digit{1,3}.digit{1,3}.digit{1,3}:digit{1,5}
import re
with open('myproxy.txt', 'r') as input:
lines = input.readlines()
pattern_to_find = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5})')
for line in lines:
find_pattern = re.search(pattern_to_find, line)
if find_pattern:
print(find_pattern.group())
# outputs
104.248.168.64:3128
54.81.69.91:3128
78.60.130.181:30664
80.120.86.242:46771
109.74.135.246:45769
198.50.172.161:1080
103.250.166.12:47031
88.255.101.244:8080

Python re.match not matching string ending with ".number"

As part of a bigger code I am trying to check if a string(filename) ends with ".number"
However, re.match (re.compile and match) just wont match the pattern at end of the string.
Code:
import re
f = ".1.txt.2"
print re.match('\.\d$',f)
Output:
>>> print re.match('\.\d$',f)
None
Any help will be much appreciated !
Use search instead of match
From https://docs.python.org/2/library/re.html#search-vs-match
re.match() checks for a match only at the beginning of the string,
while re.search() checks for a match anywhere in the string.
You can try this
import re
word_list = ["c1234", "c12" ,"c"]
for word in word_list:
m = re.search(r'.*\d+',word)
if m is not None:
print(m.group(),"-match")
else:
print(word[-1], "- nomatch")

String separation in required format, Pythonic way? (with or w/o Regex)

I have a string in the format:
t='#abc #def Hello this part is text'
I want to get this:
l=["abc", "def"]
s='Hello this part is text'
I did this:
a=t[t.find(' ',t.rfind('#')):].strip()
s=t[:t.find(' ',t.rfind('#'))].strip()
b=a.split('#')
l=[i.strip() for i in b][1:]
It works for the most part, but it fails when the text part has the '#'.
Eg, when:
t='#abc #def My email is red#hjk.com'
it fails. The #names are there in the beginning and there can be text after #names, which may possibly contain #.
Clearly I can append initally with a space and find out first word without '#'. But that doesn't seem an elegant solution.
What is a pythonic way of solving this?
Building unashamedly on MrTopf's effort:
import re
rx = re.compile("((?:#\w+ +)+)(.*)")
t='#abc #def #xyz Hello this part is text and my email is foo#ba.r'
a,s = rx.match(t).groups()
l = re.split('[# ]+',a)[1:-1]
print l
print s
prints:
['abc', 'def', 'xyz']
Hello this part is text and my email is foo#ba.r
Justly called to account by hasen j, let me clarify how this works:
/#\w+ +/
matches a single tag - # followed by at least one alphanumeric or _ followed by at least one space character. + is greedy, so if there is more than one space, it will grab them all.
To match any number of these tags, we need to add a plus (one or more things) to the pattern for tag; so we need to group it with parentheses:
/(#\w+ +)+/
which matches one-or-more tags, and, being greedy, matches all of them. However, those parentheses now fiddle around with our capture groups, so we undo that by making them into an anonymous group:
/(?:#\w+ +)+/
Finally, we make that into a capture group and add another to sweep up the rest:
/((?:#\w+ +)+)(.*)/
A last breakdown to sum up:
((?:#\w+ +)+)(.*)
(?:#\w+ +)+
( #\w+ +)
#\w+ +
Note that in reviewing this, I've improved it - \w didn't need to be in a set, and it now allows for multiple spaces between tags. Thanks, hasen-j!
t='#abc #def Hello this part is text'
words = t.split(' ')
names = []
while words:
w = words.pop(0)
if w.startswith('#'):
names.append(w[1:])
else:
break
text = ' '.join(words)
print names
print text
How about this:
Splitting by space.
foreach word, check
2.1. if word starts with # then Push to first list
2.2. otherwise just join the remaining words by spaces.
You might also use regular expressions:
import re
rx = re.compile("#([\w]+) #([\w]+) (.*)")
t='#abc #def Hello this part is text and my email is foo#ba.r'
a,b,s = rx.match(t).groups()
But this all depends on how your data can look like. So you might need to adjust it. What it does is basically creating group via () and checking for what's allowed in them.
[i.strip('#') for i in t.split(' ', 2)[:2]] # for a fixed number of #def
a = [i.strip('#') for i in t.split(' ') if i.startswith('#')]
s = ' '.join(i for i in t.split(' ') if not i.startwith('#'))
[edit: this is implementing what was suggested by Osama above]
This will create L based on the # variables from the beginning of the string, and then once a non # var is found, just grab the rest of the string.
t = '#one #two #three some text afterward with # symbols# meow#meow'
words = t.split(' ') # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)): # go through each word
word = words[i]
if word[0] == '#': # grab #'s from beginning of string
L.append(word[1:])
continue
s = ' '.join(words[i:]) # put spaces back in
break # you can ignore the rest of the words
You can refactor this to be less code, but I'm trying to make what is going on obvious.
Here's just another variation that uses split() and no regexpes:
t='#abc #def My email is red#hjk.com'
tags = []
words = iter(t.split())
# iterate over words until first non-tag word
for w in words:
if not w.startswith("#"):
# join this word and all the following
s = w + " " + (" ".join(words))
break
tags.append(w[1:])
else:
s = "" # handle string with only tags
print tags, s
Here's a shorter but perhaps a bit cryptic version that uses a regexp to find the first space followed by a non-# character:
import re
t = '#abc #def My email is red#hjk.com #extra bye'
m = re.search(r"\s([^#].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red#hjk.com #extra bye
This doesn't work properly if there are no tags or no text. The format is underspecified. You'll need to provide more test cases to validate.