Appending a +1 to string of digits with re.sub - regex

How do I use the re.sub python method to append +1 to a phone number?
When I use the following function it changes this string "802-867-5309" to this string "+1+15309". I'm trying to get this string "+1-802-867-5309". The examples in the docs replace show how to replace the entire string I don't want to replace the entire string just append a +1
import re
def transform_record(record):
new_record = re.sub("[0-9]+-","+1", record)
return new_record
print(transform_record("Some sample text 802-867-5309 some more sample text here"))

If you can match your phone numbers with a pattern you may refer to the match value using \g<0> backreference in the replacement.
So, taking the simplest pattern like \d+-\d+-\d+ that matches your phone number, you may use
new_record = re.sub(r"\d+-\d+-\d+", r"+1-\g<0>", record)
See the regex demo. See more ideas on how to match phone numbers at Find phone numbers in python script.
See the Python demo:
import re
def transform_record(record):
new_record = re.sub(r"\d+-\d+-\d+", r"+1-\g<0>", record)
return new_record
print(transform_record("Some sample text 802-867-5309 some more sample text here"))
# => Some sample text +1-802-867-5309 some more sample text here

You can try this:
new_record = re.sub(r"\d+-[\d+-]+", r"+1-\g<0>", record)

Related

convert string to regex pattern

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !
You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits

How to get all sub-strings of a specific format from a string

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.
Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.
What is the best way to do this?
I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.
This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.
You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.
Here is a Python solution,
import re
s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))
Prints,
['someword', 'some other word']
I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.
String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}
Prints,
someword
some other word
Let me know if this is what you were looking for.
Scala solution:
val text = "[[someword1]] test [[someword2]] test 1231"
val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
.findAllIn(text)
.matchData
.map(_.group(1)) //get 1st group
.toList
println(values)

Replacing everything but the matching regex string

I've searched for this answer but haven't found an answer that exactly works.
I have the following pattern where the hashes are any digit: 102###-###:#####-### or 102###-###:#####-####
It must start with 102 and the last set in the pattern can either be 3 or 4 digits.
The problem is that I can have a string with between 1-5 of these patterns in it with any sort of characters in between (spaces, letters etc). The Regex I posted below matches the patterns well but I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output. (Pattern, Pattern, Pattern) How do I accomplish this with regex?Perhaps there is a better way than trying to take this line? Thanks. This is using VBA.
Regex For Pattern:(\D102\d{3}-\d{3}:\d{5}-\d{3,4}\D)
String Example: type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff
No need to grab everything you don't need to remove it: That's more difficult. Just grab everything you need and do whatever you want with it.
See regex in use here
(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)
See code in use here
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
Dim re As Regex = New Regex("(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)")
Dim mc as MatchCollection = re.Matches(sourcestring)
For each m as Match in mc
Console.WriteLine(m.Groups(0).Value)
Next
End Sub
End Module
Result:
102456-345:56746-234
102456-345:56746-2343
102456-345:56746-234
102456-345:56746-2345
I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output
If you want to extract the patterns, then just do that, without removing everything around them. Example in Python: (Posted before the question's language was specified, but I'm sure the same can be done in VBA.)
>>> import re
>>> p = r"102\d{3}-\d{3}:\d{5}-\d{3,4}"
>>> text = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
>>> ",".join(re.findall(p, text))
'102456-345:56746-234,102456-345:56746-2343,102456-345:56746-234,102456-345:56746-2345'

RegEx pattern returning all words except those in parenthesis

I have a text of the form:
können {konnte, gekonnt} Verb
And I want to get a match for all words in it that are not in parenthesis. That means:
können = 1st match, Verb = 2nd match
Unfortunately I still don't get the knock of regular expression. There is a lot of testing possibility but not much help for creation unless you want to read a book.
I will use them in Java or Python.
In Python you could do this:
import re
regex = re.compile(r'(?:\{.*?\})?([^{}]+)', re.UNICODE)
print 'Matches: %r' % regex.findall(u'können {konnte, gekonnt} Verb')
Result:
Matches: [u'können ', u' Verb']
Although I would recommend simply replacing everything between { and } like so:
import re
regex = re.compile(r'\{.*?\}', re.UNICODE)
print 'Output string: %r' % regex.sub('', u'können {konnte, gekonnt} Verb')
Result:
Output string: u'können Verb'
A regex SPLIT using this pattern will do the job:
(\s+|\s*{[^}]*\}\s*)
and ignore any empty value.

Python extract words from a txt file

Is it possible to search for a series of words & extract the next word. For example in a txt file search for the word 'Test' & then return the word directly after it?
Test.txt
This is a test to test the function of the python code in the test environ_ment
I'm looking to get the results:-
to, the, environ_ment
You can use a regular expression for this:
import re
txt = "This is a test to test the function of the python code in the test environ_ment"
print re.findall("test\s+(\S+)", txt) # ['to', 'the', 'environ_ment']
The regular expression matches with "test" when it is followed by white space (\s+) and a series of non-white space characters \S+. The latter matches the words you are looking for and is put in a capture group (with parentheses) in order to return that part of the matches.