How to exclude newline mark from requests.get().text - regex

I'm trying to get rid of numbers from site response http://app.lotto.pl/wyniki/?type=dl with code below
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'[^\d{4}\-\d{2}\-\d{2}]\d+')
response = requests.get(url)
data = re.findall(p, response.text)
print(data)
but instead of ['7', '46', '8', '43', '9', '47'] I'm getting ['\n7', '\n46', '\n8', '\n43', '\n9', '\n47'] How can I get rid of "\n"?

Your regex is not appropriate because [^\d{4}\-\d{2}\-\d{2}]\d+ matches any character but a digit, {, 4, }, -, 2 and then 1 or more digits. In other words, you turned a sequence into a character set. And that negated character class can match a newline. It can match any letter, too. And a lot more. strip will not help in other contexts, you need to fix the regular expression.
Use
r'(?<!-)\b\d+\b(?!-)'
See the regex and IDEONE demo
This pattern will match 1+ digits (\d+) that are not preceded with a hyphen ((?<!-)) or any word characters (\b) and is not followed with a word character (\b) or a hyphen (-).
You code will look like:
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'(?<!-)\b\d+\b(?!-)')
response = requests.get(url)
data = p.findall(response.text)
print(data)

You can strip \n using strip() function
data = [x.strip() for x in re.findall(p, response.text)]
I am assuming that \n can be in beginning as well as in end

Since your numbers are strings, you can easily use lstrip() method for strings. Such method will indeed remove newline/carriage return characters at the left side of your string (that's why lstrip).
You can try something like
print([item.lstrip() for item in data])
to remove your newlines.
Or you can as well overwrite data with the stripped version of itself:
data=[item.lstrip() for item in data]
and then simply print(data).

Related

Two groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters

Fill in the code to check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters
import re
def check_character_groups(text):
result = re.search(r"___", text)
return result != None
print(check_character_groups("One")) # False
print(check_character_groups("123 Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False
import re
def check_character_groups(text):
result = re.search(r"\w\s+\w", text)
return result != None
print(check_character_groups("One ")) # False
print(check_character_groups("123 Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False
This is my answer that returns what it is asking for,
the key answer is \w\s+\w
Use as regex pattern
\b\w+\s+\w+\b
\b word boundery, marks a start of a word (also works as start of string)
\w+ means alphanumeric characters [a-zA-Z0-9_] (one or more)
\s+ one or more whitespaces
\w+ next alphanumeric characters (1+)
\b word boundery (works at string end too)
So try:
import re
def check_character_groups(text):
result = re.search(r"\b\w+\s+\w+\b", text)
return result != None
print(check_character_groups("One ")) # False
print(check_character_groups("123 Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False
If there should be special characters/spaces/... left and/or right of the alphanumeric groups, you must specify it.
Test it on regex101 or debuggex.
Picture by Regulex (! The syntax there is for JavaScript !)
Try this regex:
re.search(r"\w\s+\w", text)
There are 2 space between " 123 Ready". Since \s is only indicated by one whitespace character, we should use \s+ which indicates one or more whitespace characters like so:
print(re.search(r"\w\s\w", "123 Ready Set GO"))
>>> <re.Match object; span=(9, 12), match='y S'>
print(re.search(r"\w\s+\w", "123 Ready Set GO"))
>>> <re.Match object; span=(7, 10), match='e u'>
import re
def check_character_groups(text):
result = re.search(r"\w+\s", text)
return result != None
print(check_character_groups("One")) # False
print(check_character_groups("123 Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False
Here is your output:
False
True
True
False
import re
def check_character_groups(text):
result = re.search(r"[\w]+[\s]+[\w]", text)
return result != None
The answer is to use [\w]+[\s]+[\w] as regex.
This will match the condition for 2 groups of alphanumeric characters including underscore separated by one or more whitespace characters
result = re.search(r"\w+\s.*", text)
Clarification:
Repeated characters will be searched by *
Alphanumeric characters including underscore will be searched by \w
Space characters will be searched by using \s
To search character between two will be searched by a combination of a dot(.)
result = re.search(r"\b\w*\s\w*\b", text)
You can try this too.
\b defines the boundary.
Use simple one
result = re.search(r"[\w] [\w]", text)

How to print Hindi character from a string in python using regular expression?

I'm using regex in python and trying to extract 'Hindi' character from the given string and then print it but I'm not able to do so. I'm trying to extract जनवरी12 and जनवरी22 from the string. The code should search for a phrase that starts with जनवरी(or any hindi character) and ends with 12( or any number). Here is the code:
import re
string = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
mo = re.compile(r'[^(^a-zA-Z-0-9)]+\d+')
print(mo.findall(string))
Output:
[' 12', 'वें संस्करण जनवरी12', ' 12', ' जनवरी22']
I know that [^abc] matches any character that isn’t between the brackets and tried to achieve the same with [^(^a-zA-Z-0-9)]+ but the output is not what I expected.
Expected output:
जनवरी12, जनवरी22
Can anyone explain me how this should be done and matching the start and end in Python's regex?
I think you just need a pattern that matches 1+ letters (with 0 or more diacritics after each) and then 1+ digits.
See a Python demo that outputs ['जनवरी12', 'जनवरी22']:
import re
s = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
combining_marks = '[\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u07FD\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08D3-\u08E1\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u09FE\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AFA-\u0AFF\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C04\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D00-\u0D03\u0D3B\u0D3C\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u1885\u1886\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C5\uA8E0-\uA8F1\uA8FF\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F\U000101FD\U000102E0\U00010376-\U0001037A\U00010A01-\U00010A03\U00010A05\U00010A06\U00010A0C-\U00010A0F\U00010A38-\U00010A3A\U00010A3F\U00010AE5\U00010AE6\U00010D24-\U00010D27\U00010F46-\U00010F50\U00011000-\U00011002\U00011038-\U00011046\U0001107F-\U00011082\U000110B0-\U000110BA\U00011100-\U00011102\U00011127-\U00011134\U00011145\U00011146\U00011173\U00011180-\U00011182\U000111B3-\U000111C0\U000111C9-\U000111CC\U0001122C-\U00011237\U0001123E\U000112DF-\U000112EA\U00011300-\U00011303\U0001133B\U0001133C\U0001133E-\U00011344\U00011347\U00011348\U0001134B-\U0001134D\U00011357\U00011362\U00011363\U00011366-\U0001136C\U00011370-\U00011374\U00011435-\U00011446\U0001145E\U000114B0-\U000114C3\U000115AF-\U000115B5\U000115B8-\U000115C0\U000115DC\U000115DD\U00011630-\U00011640\U000116AB-\U000116B7\U0001171D-\U0001172B\U0001182C-\U0001183A\U000119D1-\U000119D7\U000119DA-\U000119E0\U000119E4\U00011A01-\U00011A0A\U00011A33-\U00011A39\U00011A3B-\U00011A3E\U00011A47\U00011A51-\U00011A5B\U00011A8A-\U00011A99\U00011C2F-\U00011C36\U00011C38-\U00011C3F\U00011C92-\U00011CA7\U00011CA9-\U00011CB6\U00011D31-\U00011D36\U00011D3A\U00011D3C\U00011D3D\U00011D3F-\U00011D45\U00011D47\U00011D8A-\U00011D8E\U00011D90\U00011D91\U00011D93-\U00011D97\U00011EF3-\U00011EF6\U00016AF0-\U00016AF4\U00016B30-\U00016B36\U00016F4F\U00016F51-\U00016F87\U00016F8F-\U00016F92\U0001BC9D\U0001BC9E\U0001D165-\U0001D169\U0001D16D-\U0001D172\U0001D17B-\U0001D182\U0001D185-\U0001D18B\U0001D1AA-\U0001D1AD\U0001D242-\U0001D244\U0001DA00-\U0001DA36\U0001DA3B-\U0001DA6C\U0001DA75\U0001DA84\U0001DA9B-\U0001DA9F\U0001DAA1-\U0001DAAF\U0001E000-\U0001E006\U0001E008-\U0001E018\U0001E01B-\U0001E021\U0001E023\U0001E024\U0001E026-\U0001E02A\U0001E130-\U0001E136\U0001E2EC-\U0001E2EF\U0001E8D0-\U0001E8D6\U0001E944-\U0001E94A\U000E0100-\U000E01EF]'
mo = re.compile(r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks))
print(mo.findall(s))
Note that r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks) creates a pattern that matches
(?:[^\W\d_]{}*)+ - one or more occurrences of
[^\W\d_] - any Unicode base letter (if you want to disallow ASCII letters, add (?![A-Za-z]) right before this pattern)
{}* - zero or more occurrences of combining_marks
\d+ - 1+ digits
So, if you want to avoid matching ASCII letters, in the above code, use
r'(?:(?![A-Za-z])[^\W\d_]{}*)+\d+'

How to add a space between alphanumeric and non alphanumeric characters?

If a word contains alphanumeric characters and the first character or characters is (or are) non-alphanumeric, then how to split off each such leading non-alphanumeric character as a separate word; Whether or not the first rule was applied, if the word contains alphanumeric characters and the last character or characters is (or are) non-alphanumeric, then how to split off each such trailing non-alphanumeric character as a separate word?
For example, if I have a
string = "John had a meeting with 3managers! %nervous:( t^ria7 #manager's.!"
The output should look like this
"John had a meeting with 3managers ! % nervous : ( t^ria7 # managers's . !"
The (new) idea is to split the words by whitespaces and then to apply an alternative regex to each word. In the end, the parts are glued together again.
The expression in question:
^(\W+)|(\W+)$
Which is either non-word characters from the beginning or the end of the string, see a demo on regex101.com.
In Python, you need to check which group was captured to insert the appropriate whitespaces:
import re
string = """John had a meeting with 3managers! %nervous:( t^ria7 #manager's."""
def replacer(match):
if match.group(1) is not None:
return '{} '.format(match.group(1))
else:
return ' {}'.format(match.group(2))
rx = re.compile(r'^(\W+)|(\W+)$')
string = " ".join([rx.sub(replacer, word) for word in string.split()])
print(string)
This yields
John had a meeting with 3managers ! % nervous :( t^ria7 # manager's .

Find the first to last alphabet in a string

I am new to Python and got pretty confused when reading the regex documentation. From what I understand, re.search searches everywhere in a string while re.match only searches the start of the string. But when do I have to use re.compile?
I tried playing around with regex but could not get it to work. If have a string that is mixed with letters, punctuations, numbers and spaces, how can I obtain the part of the string with alphabets?
import re
a = "123,12 jlkjL kSljdf 12.2"
test = re.search('^[a-zA-Z]', a)
print test
The output I am trying to get is jlkjL kSljdf.
You may use re.compile to compile a regex object before using the regex operation.
There are two options to ahcieve what you want: matching the letters with spaces and then stripping redundant whitespace or removing all non-letter symbols from start/end:
import re
a = "123,12 jlkjL kSljdf 12.2"
rg = re.compile(r'[a-zA-Z ]+')
mtch = rg.search(a)
if mtch:
print (mtch.group().strip()) # => jlkjL kSljdf
# Stripping non-letters from the start/end
rx = re.compile(r'^[^a-zA-Z]+|[^a-zA-Z]+$')
print(rx.sub('', a)) # => jlkjL kSljdf
See the Python demo
In the first approach, include a space to the character class and set a + (1 or more occurrences) quantifier on it.
In the second approach, ^[^a-zA-Z]+ matches 1 or more (+) characters other than letters ([^a-zA-Z]) at the start of the string (^) OR (|) 1 or more chars other than letters at the end of the string ($).

How to replace the blank

Given the code:
import clr
clr.AddReference('System')
from System.Text.RegularExpressions import *
def ReplaceBlank(st):
return Regex.Replace(
st,r'[^a-z\s](\s+)[^a-z\s]',
lambda s:s.Value.Replace(' ', ''),RegexOptions.IgnoreCase)
I expect the input ABC EDF to return ABCDEF but it doesn't work, what did I do wrong?
[^a-z\s] with ignore-case flag set matches anything other than letters and whitespace characters. ^ at the beginning of a character class (the thing between []) negates the character class.
To replace blanks, you can simply replace \s+ with empty strings or, if you need to match only letters replace
(?<=[a-z])\s+(?=[a-z])
with an empty sting. The second regex will match string of whitespaces between two letters; to account for beginning/end of strings, use
(?<=(^|[a-z]))\s+(?=($|[a-z]))
or
\b\s+\b
The second one will match spaces between two word boundaries, which include symbol chars like period, comma, hyphen etc.