Find the first to last alphabet in a string - regex

I am new to Python and got pretty confused when reading the regex documentation. From what I understand, re.search searches everywhere in a string while re.match only searches the start of the string. But when do I have to use re.compile?
I tried playing around with regex but could not get it to work. If have a string that is mixed with letters, punctuations, numbers and spaces, how can I obtain the part of the string with alphabets?
import re
a = "123,12 jlkjL kSljdf 12.2"
test = re.search('^[a-zA-Z]', a)
print test
The output I am trying to get is jlkjL kSljdf.

You may use re.compile to compile a regex object before using the regex operation.
There are two options to ahcieve what you want: matching the letters with spaces and then stripping redundant whitespace or removing all non-letter symbols from start/end:
import re
a = "123,12 jlkjL kSljdf 12.2"
rg = re.compile(r'[a-zA-Z ]+')
mtch = rg.search(a)
if mtch:
print (mtch.group().strip()) # => jlkjL kSljdf
# Stripping non-letters from the start/end
rx = re.compile(r'^[^a-zA-Z]+|[^a-zA-Z]+$')
print(rx.sub('', a)) # => jlkjL kSljdf
See the Python demo
In the first approach, include a space to the character class and set a + (1 or more occurrences) quantifier on it.
In the second approach, ^[^a-zA-Z]+ matches 1 or more (+) characters other than letters ([^a-zA-Z]) at the start of the string (^) OR (|) 1 or more chars other than letters at the end of the string ($).

Related

Regex to remove trailing optional garbage

I want to clean strings that may contain garbage at the end, always separated by a forward slash / and if there is no garbage, there is no separator.
Example > expected output
Foo/Bar > Foo
Foobar > Foobar
I tried several versions like this one to extract the payload only, none of the worked:
(.*)\/.*
(.*)?\/.*
(.*)?\/*.*
And so on. Problem is: i always only get the first or second line to match.
What would be the correct expression to extract the wanted information?
Your first and second pattern capture till before the first / so that will not give a match for the third line as there is no / present.
The third pattern matches the whole line as the /* matches an optional forward slash, so the capture group will match the whole line, and the .* will not match any characters any more as the capture group is already at the end of the line.
You could write the pattern with a capture group for 1 or more word characters as the first part, and an optional second part starting the match from / till the end of the string.
In the replacement you can use the first capture group.
^(\w+)(?:\/.*)?$
^ Start of string
(\w+) Capture 1+ word characters in group 1
(?:\/.*)? Optionally match / and the rest of the line (to be removed after the replacement)
$ End of string
See a regex demo.
There is no language listed, but an example using JavaScript:
const regex = /^(\w+)(?:\/.*)?$/m;
const str = `Foo/Bar
Foobar`;
const result = str.replace(regex, "$1");
console.log(result);
Example using Python
import re
regex = r"^(\w+)(?:\/.*)?$"
test_str = ("Foo/Bar\n"
"Foobar")
result = re.sub(regex, r'\1', test_str, 0, re.MULTILINE)
if result:
print (result)
Output
Foo
Foobar
Python demo
You can use replace here as:
const cleanString = (str) => str.replace(/\/.*/, "");
console.log(cleanString("Foo/Bar"));
console.log(cleanString("Foobar"));
This task doesn't need the power of regex, you need to split on the first slash, e.g. in Python:
test_string.split('/', 1)[0]
I think the reason your regex doesn't work is that Foobar has no / to match on. So for regex you need to handle none, one, or many slashes. Again, in Python:
>>> test = ['foobar', 'foo/bar', 'foo/bar/baz']
>>> for s in t:
print(re.findall('^(.*?)(?=/|$)', s))
['foobar']
['foo']
['foo']
The regex says: from the start of the string, group all characters (non-greedy) until either a slash or the end of the string.
You can try doing a regex.split on / and select the first element from the list. For example in python:
import regex as re
new_string = re.split('/',string)[0]

How to print Hindi character from a string in python using regular expression?

I'm using regex in python and trying to extract 'Hindi' character from the given string and then print it but I'm not able to do so. I'm trying to extract जनवरी12 and जनवरी22 from the string. The code should search for a phrase that starts with जनवरी(or any hindi character) and ends with 12( or any number). Here is the code:
import re
string = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
mo = re.compile(r'[^(^a-zA-Z-0-9)]+\d+')
print(mo.findall(string))
Output:
[' 12', 'वें संस्करण जनवरी12', ' 12', ' जनवरी22']
I know that [^abc] matches any character that isn’t between the brackets and tried to achieve the same with [^(^a-zA-Z-0-9)]+ but the output is not what I expected.
Expected output:
जनवरी12, जनवरी22
Can anyone explain me how this should be done and matching the start and end in Python's regex?
I think you just need a pattern that matches 1+ letters (with 0 or more diacritics after each) and then 1+ digits.
See a Python demo that outputs ['जनवरी12', 'जनवरी22']:
import re
s = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
combining_marks = '[\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u07FD\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08D3-\u08E1\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u09FE\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AFA-\u0AFF\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C04\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D00-\u0D03\u0D3B\u0D3C\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u1885\u1886\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C5\uA8E0-\uA8F1\uA8FF\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F\U000101FD\U000102E0\U00010376-\U0001037A\U00010A01-\U00010A03\U00010A05\U00010A06\U00010A0C-\U00010A0F\U00010A38-\U00010A3A\U00010A3F\U00010AE5\U00010AE6\U00010D24-\U00010D27\U00010F46-\U00010F50\U00011000-\U00011002\U00011038-\U00011046\U0001107F-\U00011082\U000110B0-\U000110BA\U00011100-\U00011102\U00011127-\U00011134\U00011145\U00011146\U00011173\U00011180-\U00011182\U000111B3-\U000111C0\U000111C9-\U000111CC\U0001122C-\U00011237\U0001123E\U000112DF-\U000112EA\U00011300-\U00011303\U0001133B\U0001133C\U0001133E-\U00011344\U00011347\U00011348\U0001134B-\U0001134D\U00011357\U00011362\U00011363\U00011366-\U0001136C\U00011370-\U00011374\U00011435-\U00011446\U0001145E\U000114B0-\U000114C3\U000115AF-\U000115B5\U000115B8-\U000115C0\U000115DC\U000115DD\U00011630-\U00011640\U000116AB-\U000116B7\U0001171D-\U0001172B\U0001182C-\U0001183A\U000119D1-\U000119D7\U000119DA-\U000119E0\U000119E4\U00011A01-\U00011A0A\U00011A33-\U00011A39\U00011A3B-\U00011A3E\U00011A47\U00011A51-\U00011A5B\U00011A8A-\U00011A99\U00011C2F-\U00011C36\U00011C38-\U00011C3F\U00011C92-\U00011CA7\U00011CA9-\U00011CB6\U00011D31-\U00011D36\U00011D3A\U00011D3C\U00011D3D\U00011D3F-\U00011D45\U00011D47\U00011D8A-\U00011D8E\U00011D90\U00011D91\U00011D93-\U00011D97\U00011EF3-\U00011EF6\U00016AF0-\U00016AF4\U00016B30-\U00016B36\U00016F4F\U00016F51-\U00016F87\U00016F8F-\U00016F92\U0001BC9D\U0001BC9E\U0001D165-\U0001D169\U0001D16D-\U0001D172\U0001D17B-\U0001D182\U0001D185-\U0001D18B\U0001D1AA-\U0001D1AD\U0001D242-\U0001D244\U0001DA00-\U0001DA36\U0001DA3B-\U0001DA6C\U0001DA75\U0001DA84\U0001DA9B-\U0001DA9F\U0001DAA1-\U0001DAAF\U0001E000-\U0001E006\U0001E008-\U0001E018\U0001E01B-\U0001E021\U0001E023\U0001E024\U0001E026-\U0001E02A\U0001E130-\U0001E136\U0001E2EC-\U0001E2EF\U0001E8D0-\U0001E8D6\U0001E944-\U0001E94A\U000E0100-\U000E01EF]'
mo = re.compile(r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks))
print(mo.findall(s))
Note that r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks) creates a pattern that matches
(?:[^\W\d_]{}*)+ - one or more occurrences of
[^\W\d_] - any Unicode base letter (if you want to disallow ASCII letters, add (?![A-Za-z]) right before this pattern)
{}* - zero or more occurrences of combining_marks
\d+ - 1+ digits
So, if you want to avoid matching ASCII letters, in the above code, use
r'(?:(?![A-Za-z])[^\W\d_]{}*)+\d+'

Regex : split by occurrences groups

I am trying to find a solution to split a string by occurrences in groups.
Strings are formatted like this: "AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD"
I want the string to split like this:
1 ) AAA/BBB/CCC/DDD
2 ) BBB/CCC/DDD
3 ) BBB/DDD
'/' is always the separator and words are always AAA, BBB, CCC and DDD.
I tried regex expression (AAA|BBB|CCC|DDD){x} with {x} to specify the number of occurrences but it seems {} works only for characters, not words.
You can use re.findall with the following positive lookahead patterns to ensure that slashes are included only if they are followed by characters that are allowed in the sequence, and use ? as a repeater to make a match of each word optional (but greedy):
import re
s = 'AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD'
re.findall('(?=[ABCD])(?:AAA(?:/(?=[BCD]))?)?(?:BBB(?:/(?=[CD]))?)?(?:CCC(?:/(?=D))?)?(?:DDD)?', s)
This returns:
['AAA/BBB/CCC/DDD', 'BBB/CCC/DDD', 'BBB/DDD']
You can use re.split with an alternation pattern that includes slashes that are surrounded by positive lookbehind and lookahead patterns to ensure that the character preceding the slash is to be latter in the sequence than the character following the slash:
import re
s = 'AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD'
re.split('(?:(?<=[BCD])/(?=A)|(?<=[CD])/(?=B)|(?<=D)/(?=C))', s)
This returns:
['AAA/BBB/CCC/DDD', 'BBB/CCC/DDD', 'BBB/DDD']

Python String Dissection

Here is the problem:
Replace input string with the following: The first and last characters, separated by the count of distinct characters between the two.
Any non-alphabetic character in the input string should appear in the output string in its original relative location.
Here is the code I have thus far:
word = input("Please enter a word: ")
first_character = word[0]
last_character = word[-1]
unique_characters = (list(set(word[1:-1])))
unique_count = str(len(unique_characters))
print(first_character[0],unique_count,last_character[0])
For the second part, I have thought about using regex, however I have not been able to wrap my head around regex as it is not something I ever use.
You can use
import re
pat = r"\b([^\W\d_])([^\W\d_]*)([^\W\d_])\b"
s = "Testers"
print(re.sub(pat, (lambda m: "{0}{1}{2}".format(m.group(1), len(''.join(set(m.group(2)))), m.group(3))), s))
See the IDEONE demo.
The regex breakdown:
\b - word boundary (use ^ if you test an individual string)
([^\W\d_]) - Group 1 capturing any ASCII letter (use re.U flag if you need to match Unicode, too)
([^\W\d_]*) - Group 2 capturing zero or more letters
([^\W\d_]) - Group 3 capturing a letter at...
\b - the trailing word boundary (replace with $ if you handle individual strings)
In the replacement pattern, the len(''.join(set(m.group(2)))) is counting the number of unique letter occurrences (see this SO post).
If you need to handle 2-letter words like Ts > Ts, you may replace * with + quantifier in the second group.

How to replace the blank

Given the code:
import clr
clr.AddReference('System')
from System.Text.RegularExpressions import *
def ReplaceBlank(st):
return Regex.Replace(
st,r'[^a-z\s](\s+)[^a-z\s]',
lambda s:s.Value.Replace(' ', ''),RegexOptions.IgnoreCase)
I expect the input ABC EDF to return ABCDEF but it doesn't work, what did I do wrong?
[^a-z\s] with ignore-case flag set matches anything other than letters and whitespace characters. ^ at the beginning of a character class (the thing between []) negates the character class.
To replace blanks, you can simply replace \s+ with empty strings or, if you need to match only letters replace
(?<=[a-z])\s+(?=[a-z])
with an empty sting. The second regex will match string of whitespaces between two letters; to account for beginning/end of strings, use
(?<=(^|[a-z]))\s+(?=($|[a-z]))
or
\b\s+\b
The second one will match spaces between two word boundaries, which include symbol chars like period, comma, hyphen etc.