split string using regex python and "re" package - regex

I'm using Python 3 on Windows 10. Consider the following string:
import re
s = ["12345", "67891", "01112"]
I want to split these zips at the 3 character to get the zip3, but this code throws an error.
re.split("\d{3}", s)
TypeError: cannot use a string pattern on a bytes-like object
I'm not quite sure how to get around. Help appreciated. Thanks.

To get the first three of each, simply string-slice them:
s = ["12345", "67891", "01112"]
first_tree = [p[0:3] for p in s]
print(first_tree)
Outtput:
['123', '678', '011'] # slicing
To split all text in threes, join it, then use chunking to get chunks of 3 letters:
s = ["12345", "67891", "01112"]
k = ''.join(s)
threesome = [k[i:i+3] for i in range(0,len(k),3)]
print(threesome)
Outtput:
['123', '456', '789', '101', '112'] # join + chunking
See How do you split a list into evenly sized chunks? and Understanding Python's slice notation
Slicing and chunking works on strings as well - the official doku about strings is here: about strings and slicing
To get the remainder as well:
s = ["12345", "67891", "01112"]
three_and_two = [[p[:3], p[3:]] for p in s]
print(three_and_two) # [['123', '45'], ['678', '91'], ['011', '12']]

Related

Python 3.5.2 Regex in List comprehension returns all entries - inconsistent with other example

I am searching a list for a particular entry. The entry is digits followed by oblique (one or many times).
If I put an example into a string and use re.match() I get the result.
If I put the string into a list and loop through I get a result from re.match()
If I try to get the index using list comprehension I get all the list indexes returned.
Using a different list I get the correct result.
Why is the list comprehension for my regex not just returning [2] as the control list does?
Example code:
import re
import sys
from datetime import datetime
rxco = re.compile
rx = {}
#String
s = r'140/154/011/002'
#String in a list
l = ['abc', 'XX123 SHDJ FFFF', s, 'unknown', 'TTL/4/5/6', 'ORD/123']
#Regex to get what I am interested in
rx['ls_pax_split'] = rxco(r'\s?((\d+\/?)*)')
#For loop returns matches and misses
for i in l:
m = re.match(rx['ls_pax_split'], i)
print(m)
#List Comprehension returns ALL entries - NOT EXPECTED
idx = [i for i, item in enumerate(l) if re.match(rx['ls_pax_split'], item)]
print(idx)
#Control Comprehension returns - AS EXPECTED
fruit_list = ['raspberry', 'apple', 'strawberry']
berry_idx = [i for i, item in enumerate(fruit_list) if re.match('rasp', item)]
print(berry_idx)
re.match(rx['ls_pax_split'], item) is returning a match object each time it runs, whereas re.match('rasp', item) is not. Therefore, the result of re.match(rx['ls_pax_split'], item) is always truthy.
Try adding .group(0) to the end of line 22 to get the string that matched the regular expression, or an empty string (i.e. a falsey value) if there was no match.
Like this:
idx = [i for i, item in enumerate(l) if re.match(rx['ls_pax_split'], item).group(0)]
EDIT
While the above will solve this problem, there may be a better way to avoid the hassle of dealing with .group. The regular expression (\d+\/?)* will match (\d+\/?) zero or more times, meaning that it is generating a lot of false positives where it detects exactly zero matches and therefore returns a match. Changing this to (\d+\/?)+ would solve it for this example by looking for one or more (\d+\/?).

identify letter/number combinations using regex and storing in dictionary

import pandas as pd
df = pd.DataFrame({'Date':['This 1-A16-19 person is BL-17-1111 and other',
'dont Z-1-12 do here but NOT 12-24-1981',
'numbers: 1A-256-29Q88 ok'],
'IDs': ['A11','B22','C33'],
})
Using the dataframe above I want to do the following 1) Use regex to identify all digit + number combination e.g 1-A16-19 2) Store in dictionary
Ideally I would like the following output (note that 12-24-1981 intentionally was not picked up by the regex since it doesn't have a letter in it e.g. 1A-24-1981)
{1: 1-A16-19, 2:BL-17-1111, 3: Z-1-12, 4: 1A-256-29Q88}
Can anybody help me do this?
This regex might do the trick.
(?=.*[a-zA-Z])(\S+-\S+-\S+)
It matches everything between two spaces that has two - in it. Also there won't be a match if there is no letter present.
regex101 example
As you can see for the given input you provided only 1-A16-19, BL-17-1111, Z-1-12 & 1A-256-29Q88 are getting returned.
you could try :
vals = df['Date'].str.extractall(r'(\S+-\S+-\S+)')[0].tolist()
# extract your strings based on your condition above and pass to a list.
# make a list with the index range of your matches.
nums = []
for x,y in enumerate(vals):
nums.append(x)
pass both lists into a dictionary.
my_dict = dict(zip(nums,vals))
print(my_dict)
{0: '1-A16-19',
1: 'BL-17-1111',
2: 'Z-1-12',
3: '12-24-1981',
4: '1A-256-29Q88'}
if you want the index to start at one you can specify this in the enumerate function.
for x,y in enumerate(vals,1):
nums.append(x)
print(nums)
[1, 2, 3,4,5]

From SAS to Python : substr

I would like to use the function substr(my_var,1,2) of SAS in Python. I tried a lot of functions like contains(), split() but it didn't work.
The functions contains() and split() work only for a string value. I would like to use it on a Python Series without using a for.
Thanks a lot for your help
A string in python can be sliced like any list:
>>> str = 'Hello World'
>>> str[1:3]
'el'
>>> str[1:-2]
'ello Wor'
To get substrings for multiple strings, you can use list comprehensions:
>>> strs = ['Hello World', 'Foobar']
>>> [ str[1:4] for str in strs]
['ell', 'oob']
In python, you may try this:
my_var[1:3]
This gets sub string of my_var from position 1 to 3 (exclusive).

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

Single quote replacement, handling of null integers in pandas/python2.7

New to Pandas/Python and I'm having to write some kludgy code. I would appreciate any input on how you would do this and speed it up (I'll be doing this for gigabytes of data).
So, I'm using pandas/python for some ETL work. Row-wise calculations are performed so I need them as numeric types within the process (left this part out). I need to output some of the fields as an array and get rid of the single quotes, nan's, and ".0"'s.
First question, is there a way to vectorize these if else statements ala ifelse in R? Second, surely there is a better way to remove the ".0". There seems to be major issues with out pandas/numpy handles nulls in numeric types.
Finally, the .replace does not seem to work on the DataFrame for single quotes. Am I missing something? Here's the sample code, please let me know if you have any questions about it:
import pandas as pd
# have some nulls and need it in integers
d = {'one' : [1.0, 2.0, 3.0, 4.0],'two' : [4.0, 3.0, NaN, 1.0]}
dat = pd.DataFrame(d)
# make functions to get rid of the ".0" and necessarily converting to strings
def removeforval(val):
if str(val)[-2:] == ".0":
val = str(val)[:len(str(val))-2]
else:
val = str(val)
return val
def removeforcol(col):
col = col.apply(removeforval)
return col
dat = dat.apply(removeforcol,axis=0)
# remove the nan's
dat = dat.replace('nan','')
# need some fields in arrays on a postgres database
quoted = ['{' + str(tuple(x))[1:-1] + '}' for x in dat.to_records(index=False)]
print "Before single quote removal"
print quoted
# try to replace single quotes using DataFrame's replace
quoted_df = pd.DataFrame(quoted).replace('\'','')
quoted_df = quoted_df.replace('\'','')
print "DataFrame does not seem to work"
print quoted_df
# use a loop
for item in range(len(quoted)):
quoted[item] = quoted[item].replace('\'','')
print "This Works"
print quoted
Thank you!
You understand that this is very odd to make a string exactly like this. This is not valid python at all. What are you doing with this? Why are you stringifying it?
revised
In [144]: list([ "{%s , %s}" % tup[1:] for tup in df.replace(np.nan,0).astype(int).replace(0,'').itertuples() ])
Out[144]: ['{1 , 4}', '{2 , 3}', '{3 , }', '{4 , 1}']