Regex: Match Empty \s inside [' '] - regex

Hello I'm trying to match the white space character(\s) to could be any of this [\r\n\t\f ] values only inside the text the [' ']
For example
$lang['some random text'] = 'some random text';
$lang['other random text'] = 'other random text';
I'm looking to replace white space for _. For example the above example will end in the following format.
$lang['some_random_text'] = 'some random text';
$lang['other_random_text'] = 'other random text';
Language: Plain regex
Could someone explain what will be the right approach ?
Thanks!

Related

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

Regex to capture hyphenated words separated by new line character

I have a pattern such as word-\nword, i.e. words are hyphenated and separated by new line character.
I would like the output as word-word. I get word-\nword with the below code.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-\n[A-Za-z]+", text_string)
print(result)
I tried this, but did not work, I get no result.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-(?=\n)[A-Za-z]+", text_string)
print(result)
How can I achieve this.
Thank You !
Edit:
Would it be efficient to do a replace and run a simple regex
text_string = "aaa bbb ccc-\nddd eee fff"
replaced_text = text_string.replace('-\n', '-')
result = re.findall("\w+-\w+",replaced_text)
print(result)
or use the method suggested by CertainPerformance
text_string = "word-\nword"
result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string)
print(result)
You should use re.sub instead of re.findall:
result = re.sub(r"(?<=-)\n+", "", test_str)
This matches any new lines after a - and replaces it with empty string.
Demo
You can alternatively use
(?<=-)\n(?=\w)
which matches new lines only if there is a - before it and it is followed by word characters.
If the string is composed of just that, then a pure regex solution is to use re.sub, capture the first word and the second word in a group, then echo those two groups back (without the dash and newline):
result=re.sub("(?i)([a-z]+)-\n([a-z]+)", r'\1\2', text_string)
Otherwise, if there is other stuff in the string, iterate over each match and join the groups:
text_string = "wordone-\nwordtwo wordthree-\nwordfour"
result=re.findall("(?i)([a-z]+)-\n([a-z]+)", text_string)
for match in result:
print(''.join(match))
You can simply replace any occurrences of '-\n' with '-' instead:
result = text_string.replace('-\n', '-')

Extracting Prices with Regex

I'm look to extract prices from a string of scraped data.
I'm using this at the moment:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1.01')
['1.01']
Which works fine 99% of the time. However, I occasionally see this:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1,444.01')
['1,444']
I'd like to see ['1444.01'] ideally.
This is an example of the string I'm extracting the prices from.
'\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
I'm after some help putting together the regex to get ['1000.73', '1.26'] from that above string
You may grab all the values with '£(\d[\d.,]*)\b' and then remove all the commas with
import re
s = '\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
r = re.compile(r'£(\d[\d.,]*)\b')
print([x.replace(',', '') for x in re.findall(r, s)])
# => ['1000.73', '1.26']
See the Python demo
The £(\d[\d.,]*)\b pattern finds £ and then captures a digit and then any 0+ digits/,/., as many as possible, but will backtrack to a position where a word boundary is.

Whats is difference if I include a space in Regular Expression

import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*)', line)
if matchObj:
print ("matchObj.group(2) : ", matchObj.group(2))
else:
print ("No match!!")
When i run this code i get an ouput: smarter than dogs
But if put an extra space at the end of my my RE
matchObj = re.match( r'(.*) are (.*) ', line)
I get output as: smarter than
Can anyone explain why am i getting this difference in output
When you're adding the extra space in matchObj = re.match( r'(.*) are (.*) ', line), you're asking to match as many character as it can in (.*) followed by a space.
In this case it is smarter than, as the space character matches the space in dogs.
Without the space, the . can match any number of characters other than new line. So it ends up matching until the end of the string, smarter than dogs.
Read the documentation on regex for more info.

String separation in required format, Pythonic way? (with or w/o Regex)

I have a string in the format:
t='#abc #def Hello this part is text'
I want to get this:
l=["abc", "def"]
s='Hello this part is text'
I did this:
a=t[t.find(' ',t.rfind('#')):].strip()
s=t[:t.find(' ',t.rfind('#'))].strip()
b=a.split('#')
l=[i.strip() for i in b][1:]
It works for the most part, but it fails when the text part has the '#'.
Eg, when:
t='#abc #def My email is red#hjk.com'
it fails. The #names are there in the beginning and there can be text after #names, which may possibly contain #.
Clearly I can append initally with a space and find out first word without '#'. But that doesn't seem an elegant solution.
What is a pythonic way of solving this?
Building unashamedly on MrTopf's effort:
import re
rx = re.compile("((?:#\w+ +)+)(.*)")
t='#abc #def #xyz Hello this part is text and my email is foo#ba.r'
a,s = rx.match(t).groups()
l = re.split('[# ]+',a)[1:-1]
print l
print s
prints:
['abc', 'def', 'xyz']
Hello this part is text and my email is foo#ba.r
Justly called to account by hasen j, let me clarify how this works:
/#\w+ +/
matches a single tag - # followed by at least one alphanumeric or _ followed by at least one space character. + is greedy, so if there is more than one space, it will grab them all.
To match any number of these tags, we need to add a plus (one or more things) to the pattern for tag; so we need to group it with parentheses:
/(#\w+ +)+/
which matches one-or-more tags, and, being greedy, matches all of them. However, those parentheses now fiddle around with our capture groups, so we undo that by making them into an anonymous group:
/(?:#\w+ +)+/
Finally, we make that into a capture group and add another to sweep up the rest:
/((?:#\w+ +)+)(.*)/
A last breakdown to sum up:
((?:#\w+ +)+)(.*)
(?:#\w+ +)+
( #\w+ +)
#\w+ +
Note that in reviewing this, I've improved it - \w didn't need to be in a set, and it now allows for multiple spaces between tags. Thanks, hasen-j!
t='#abc #def Hello this part is text'
words = t.split(' ')
names = []
while words:
w = words.pop(0)
if w.startswith('#'):
names.append(w[1:])
else:
break
text = ' '.join(words)
print names
print text
How about this:
Splitting by space.
foreach word, check
2.1. if word starts with # then Push to first list
2.2. otherwise just join the remaining words by spaces.
You might also use regular expressions:
import re
rx = re.compile("#([\w]+) #([\w]+) (.*)")
t='#abc #def Hello this part is text and my email is foo#ba.r'
a,b,s = rx.match(t).groups()
But this all depends on how your data can look like. So you might need to adjust it. What it does is basically creating group via () and checking for what's allowed in them.
[i.strip('#') for i in t.split(' ', 2)[:2]] # for a fixed number of #def
a = [i.strip('#') for i in t.split(' ') if i.startswith('#')]
s = ' '.join(i for i in t.split(' ') if not i.startwith('#'))
[edit: this is implementing what was suggested by Osama above]
This will create L based on the # variables from the beginning of the string, and then once a non # var is found, just grab the rest of the string.
t = '#one #two #three some text afterward with # symbols# meow#meow'
words = t.split(' ') # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)): # go through each word
word = words[i]
if word[0] == '#': # grab #'s from beginning of string
L.append(word[1:])
continue
s = ' '.join(words[i:]) # put spaces back in
break # you can ignore the rest of the words
You can refactor this to be less code, but I'm trying to make what is going on obvious.
Here's just another variation that uses split() and no regexpes:
t='#abc #def My email is red#hjk.com'
tags = []
words = iter(t.split())
# iterate over words until first non-tag word
for w in words:
if not w.startswith("#"):
# join this word and all the following
s = w + " " + (" ".join(words))
break
tags.append(w[1:])
else:
s = "" # handle string with only tags
print tags, s
Here's a shorter but perhaps a bit cryptic version that uses a regexp to find the first space followed by a non-# character:
import re
t = '#abc #def My email is red#hjk.com #extra bye'
m = re.search(r"\s([^#].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red#hjk.com #extra bye
This doesn't work properly if there are no tags or no text. The format is underspecified. You'll need to provide more test cases to validate.