Extract multi-digit numbers from a string in python 3 - regex

I am doing the algorithm challenges from HackerRank and one of the problems needs me to accept input in the form of strings of numbers formatted as follows:
3 4
12 14 16
1 2
3 4
5 6
Now, I know how to iterate through the lines and assign them where they need to go, but my issue is with the second line. The others are two two digit numbers so I've been extracting them by just referencing their index in the string. For example, the first line of numbers would be collected with string[0] and string[-1].
The second line, however is of indeterminate length, and may include numbers shorter or longer than three digits. How would I pull those out and assign them to variables? I'm sure there is probably a way to do it with RegEx, but I don't know how to assign multiple matches in one string to multiple variables.

import re
print(re.findall(r"(\d+)",x))
"x" being your line.This will return a list with all the number.

You mean this?
>>> import re
>>> s = """3 4
... 12 14 16
... 1 2
... 3 4
... 5 6"""
>>> m = re.findall(r'\b\d+\b', s, re.M)
>>> m
['3', '4', '12', '14', '16', '1', '2', '3', '4', '5', '6']
Just pickup each value in the final list and assign it to variables.

So if s is your string,
map(int, s.split())
yields a list of integers:
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6]
That's basically what skamazin suggested.

Given:
>>> txt='''\
... 3 4
... 12 14 16
... 1 2
... 3 4
... 5 6'''
If the lines have meaning, you can do:
>>> [map(int, line.split()) for line in txt.splitlines()]
[[3, 4], [12, 14, 16], [1, 2], [3, 4], [5, 6]]
If the lines have no meaning, you just want all the digits, you can do:
>>> map(int, txt.split())
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6]
If your source text has the possibility of strings that will not convert to integers:
>>> txt='''\
... 3 4
... 12 14 16
... 1 2
... 3 4
... 5 6
... text that won't be integers
... 99 100 101'''
You can use a conversion function:
>>> def conv(s):
... try:
... return int(s)
... except ValueError:
... return s
...
>>> [[conv(s) for s in line.split()] for line in txt.splitlines()]
[[3, 4], [12, 14, 16], [1, 2], [3, 4], [5, 6], ['text', 'that', "won't", 'be', 'integers'], [99, 100, 101]]
Or filter out the things that are not digits:
>>> map(int, filter(lambda s: s.isdigit(), txt.split()))
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6, 99, 100, 101]

Related

Making a dictionary? from 2 lists / columns

I have a large database with several columns, i need data from 2 of these.
The end result is to have 2 drop down menus where the first one sets "names" and the second one is the "numbers" values that has been merged into the name. I just need the data available so i can input it into another program.
So a list or dictionary that contains the Unique values of the "names" list, with the numbers from the numbers list appended to them.
# Just a list of random names and numbers for testing
names = [
"Cindi Brookins",
"Cumberband Hamberdund",
"Roger Ramsden",
"Cumberband Hamberdund",
"Lorean Dibble",
"Lorean Dibble",
"Coleen Snider",
"Rey Bains",
"Maxine Rader",
"Cindi Brookins",
"Catharine Vena",
"Lanny Mckennon",
"Berta Urban",
"Rey Bains",
"Roger Ramsden",
"Lanny Mckennon",
"Catharine Vena",
"Berta Urban",
"Maxine Rader",
"Coleen Snider"
]
numbers = [
6,
5,
7,
10,
3,
9,
1,
1,
2,
7,
4,
2,
8,
3,
8,
10,
4,
9,
6,
5
]
So in the above example "Berta Urban" would appear once, but still have the numbers 8 and 9 assigned, "Rey Bains" would have 1 and 3.
I have tried with
mergedlist = dict(zip(names, numbers))
But that only assigns the last of the numbers to the name.
I am not sure if i can make a dictionary with Unique "names" that holds multiple "numbers".
You only get the last number associated with each name because dictionary keys are unique (otherwise they wouldn't be much use). So if you do
mergedlist["Berta Urban"] = 8
and after that
mergedlist["Berta Urban"] = 9
the result will be
{'Berta Urban': 9}
Just as if you did:
berta_urban = 8
berta_urban = 9
In that case you would expect the value of berta_urban to be 9 and not [8,9].
So, as you can see, you need an append not an assignment to your dict entry.
from collections import defaultdict
mergedlist = defaultdict(list)
for (name,number) in zip(names, numbers): mergedlist[name].append(number)
This gives:
{'Coleen Snider': [1, 5],
'Cindi Brookins': [6, 7],
'Cumberband Hamberdund': [5, 10],
'Roger Ramsden': [7, 8],
'Lorean Dibble': [3, 9],
'Rey Bains': [1, 3],
'Maxine Rader': [2, 6],
'Catharine Vena': [4, 4],
'Lanny Mckennon': [2, 10],
'Berta Urban': [8, 9]
}
which is what I think you want. Note that you will get duplicates, as in 'Catharine Vena': [4, 4] and you will also get a list of numbers for each name, even if the list has only one number in it.
You cannot have multiple keys of the same name in a dict, but your dict keys can be unique while holding a list of matching numbers. Something like:
mergedlist = {}
for i, v in enumerate(names):
mergedlist[v] = mergedlist.get(v, []) + [numbers[i]]
print(mergedlist["Berta Urban"]) # prints [8, 9]
Not terribly efficient, tho. In dependence of the datatbase you're using, chances are that the database can get you the results in the form you prefer faster than you post-processing and reconstructing the data.

Python Dask - vertical concatenation of 2 DataFrames

I am trying to vertically concatenate two Dask DataFrames
I have the following Dask DataFrame:
d = [
['A','B','C','D','E','F'],
[1, 4, 8, 1, 3, 5],
[6, 6, 2, 2, 0, 0],
[9, 4, 5, 0, 6, 35],
[0, 1, 7, 10, 9, 4],
[0, 7, 2, 6, 1, 2]
]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)
Here is the data as a Pandas DataFrame
A B C D E F
0 1 4 8 1 3 5
1 6 6 2 2 0 0
2 9 4 5 0 6 35
3 0 1 7 10 9 4
4 0 7 2 6 1 2
Here is the Dask DataFrame
Dask DataFrame Structure:
A B C D E F
npartitions=4
0 int64 int64 int64 int64 int64 int64
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
4 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks
I am trying to concatenate 2 Dask DataFrames vertically:
ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)
but I get this error:
Traceback (most recent call last):
...
File "...", line 572, in concat
raise ValueError('All inputs have known divisions which cannot '
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order
However, if I try:
dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)
then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?
If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.
When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.
mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.
Vertically concatenating DataFrames
Create two DataFrames, concatenate them, and view the results.
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf3 = dd.concat([ddf1, ddf2])
print(ddf3.compute())
nums letters
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
0 88 xx
1 99 yy
Divisions metadata when vertically concatenating
Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.
def print_partitions(ddf):
for i in range(ddf.npartitions):
print(ddf.partitions[i].compute())
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1.divisions # (0, 3, 5)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2.divisions # (0, 1)
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (None, None, None, None)
Set interleave_partitions=True to avoid losing the divisions metadata.
ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)
ddf3_interleave.divisions # (0, 1, 3, 5)
When interleaving isn't necessary
Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:
df = pd.DataFrame(
{"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1 = ddf1.set_index("some_index")
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2 = ddf2.set_index("some_index")
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (4, 6, 10, 20)
I wrote a blog post to explain this in more detail. Let me know if you'd like the link.

find the index of maximum element of 2nd column of a 2 dimensional list

I am a newbie in python,
I need to find the index of maximum element of 2nd column of a 2 dimensional list as follows
3 | 23
7 | 12
5 | 42
1 | 25
so the output should be the index of 42 i.e. 2
I tried converting the list to numpy array and then using argmax, but in vain
Thanks!!
>>> import numpy as np
>>> a = np.asarray([[3, 7, 5, 1], [23, 12, 42, 25]])
>>> a
array([[ 3, 7, 5, 1],
[23, 12, 42, 25]])
>>> a[1]
array([23, 12, 42, 25])
>>> np.argmax(a[1])
2

How to parse given types of string to list?

The user input string follows the pattern:
{ 1, 2, 4, 6, 3 }
{ 2, 5, 8, 0, 3, 45, 5 }
How to convert any one of the given user input to list of integers.
Maybe split can be used as in A.split(', ') but then we get,
A = ['{ 1', '2', '4', '6', '3 }']
but the answer, we want should be,
A = [1, 2, 4, 6, 3]
replace {} to [] and use json module to parse:
>>> import json
>>> s = "{ 1, 2, 4, 6, 3 }"
>>> json.loads(s.replace("{","[").replace("}","]"))
[1, 2, 4, 6, 3]
The easy way would be to strip the parentheses, split on commas and cast the types.
A = [int(x) for x in A.lstrip(' (').rstrip(' )').split(', ')]
The proper way to do this is the same way a compiler would do it: tokenize the input before parsing, using a library like PLY.
(Also, you should make an effort to do your own homework and learn to write proper English.)
You can leverage try and except to accomplish this. Basically you try to convert each character of your string to int, if successful then you append it to a list if Python raises an error -because cannot convert to int- then nothing will be appended. Maybe this is not the shortest code but it is definitely very readable.
s = '{ 1, 2, 4, 6, 3 }'
result = []
for item in s:
try:
result.append(int(item))
except:
pass
print(result)
>>>[1, 2, 4, 6, 3]

How to filter dataframe in pandas by 'str' in columns name?

Following this recipe. I tried to filter a dataframe by the columns name that contain the string '+'. Here's the example:
B = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', '+B', '+C'], index=[1, 2, 3, 4, 5])
So I want a dataframe C with only '+B' and '+C' columns in it.
C = B.filter(regex='+')
However I get the error:
File "c:\users\hernan\anaconda\lib\site-packages\pandas\core\generic.py", line 1888, in filter
matcher = re.compile(regex)
File "c:\users\hernan\anaconda\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "c:\users\hernan\anaconda\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: nothing to repeat
The recipe says it is Python 3. I use python 2.7. However, I don't think that is the problem here.
Hernan
+ has a special meaning in regular expressions (see here). You can escape it with \:
>>> C = B.filter(regex='\+')
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4
Or, since all you care about is the presence of +, you could use the like argument instead:
>>> C = B.filter(like="+")
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4