Is there a way to test PySpark Regex's? - regex

I'd like to test different inputs to a PySpark regex to see if they fail/succeed before running a build. Is there a way to test this in Foundry before running a full build/checks?

You can downsample your input using the Preview functionality in Authoring, where you can then specify a filter you want to craft your input for testing.
Then, you can run your PySpark code on this custom sample to verify it does what you expect.
You click on the gear in the following view after clicking the Preview button.
Then, you can describe what sample you want.
After you have this, running your regex on your input will be fast and easy to test.

I am also a fan of writing unit tests. Create a small input df, desired output df, and write a simple function that takes the input, applies the regex, and returns the output.
import pytest
from datetime import date
import pandas as pd # noqa
import numpy as np
from myproject.analysis.simple_discount import (
calc
)
columns = [
"date",
"id",
"other",
"brand",
"grp_id",
"amounth",
"pct",
"max_amount",
"unit",
"total_units"
]
output_columns = [
"date",
"id",
"other",
"brand",
"grp_id",
"amount",
"pct",
"max_amount",
"qty",
"total_amount"
]
#pytest.fixture
def input_df(spark_session):
data = [
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 4],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 2],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 3],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 4],
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 1.08, 1],
['3/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
['6/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 2.6, 1],
]
pdf = pd.DataFrame(data, columns=columns)
pdf = pdf.replace({np.nan: None})
return spark_session.createDataFrame(pdf)
#pytest.fixture
def output_df(spark_session):
data = [
['4/1/21', 'a', '1', 'mn', '567', 0.54, 50, 1.08, 27, 14.580000000000002],
['3/1/21', 'b', '2', 'mn', '555', 1.3, 50, 2.6, 1, 1.3],
]
pdf = pd.DataFrame(data, columns=columns)
pdf = pdf.replace({np.nan: None})
return spark_session.createDataFrame(pdf)
# ======= FIRST RUN CASE
def test_normal_input(input_df, output_df):
calc_output_df = calc(input_df)
assert sorted(calc_output_df.collect()) == sorted(output_df.collect())
#
# Folder Structure
#
# transforms-python/
# ├── ...
# └── src/
# ├── ...
# ├── myproject/
# │ ├── ...
# │ └── analysis/
# │ ├── ...
# │ └── simple_discounts.py
# └── tests/
# ├── ...
# └── unit_tests.py

Related

Merging a list of dictionaries

I have a list of dicts as below.
list = [ {id: 1, s_id:2, class: 'a', teacher: 'b'} ]
list1 = [ {id: 1, c_id:1, rank:2, area: 34}, {id:1, c_id:2, rank:1, area: 21} ]
I want to merge the two lists on the common key-value pairs (in this case 'id:1')
Merged_list = [ {id:1, s_id:2, class: 'a', teacher: 'b', list1: {c_id:1, rank: 2, area: 34}, {c_id:2, rank: 1, area: 21} ]
How do I go about this?
Thanks
You can use
merged_list = [{**d1, **d2} for d1, d2 in zip(list1, list2)]
>>> merged_list
[{'id': 1, 's_id': 2, 'class': 'a', 'teacher': 'b', 'rank': 2, 'area': 34},
{'id': 2, 's_id': 3, 'class': 'c', 'teacher': 'd', 'rank': 1, 'area': 21}]
where {**d1, **d2} is just a neat way to combine 2 dictionaries. Keep in mind this will replace the duplicate keys of the first dictionary. If you're on Python 3.9, you could use d1 | d2.
EDIT: For the edit in your question, you can try this horrible one liner (keep in mind this will create the pair list1: [] if no matching indeces were found on list1):
list_ = [{"id": 1, "s_id": 2, "class": 'a', "teacher": 'b'}]
list1 = [{"id": 1, "c_id": 1, "rank": 2, "area": 34}, {"id": 1, "c_id": 2, "rank": 1, "area": 21}]
merged_list = [{**d, **{"list1": [{k: v for k, v in d1.items() if k != "id"} for d1 in list1 if d1["id"] == d["id"]]}} for d in list_]
>>> merged_list
[{'id': 1,
's_id': 2,
'class': 'a',
'teacher': 'b',
'list1': [{'c_id': 1, 'rank': 2, 'area': 34},
{'c_id': 2, 'rank': 1, 'area': 21}]}]
This is equivalent to (with some added benefits):
merged_list = []
for d in list_:
matched_ids = []
for d1 in list1:
if d["id"] == d1["id"]:
d1.pop("id") # remove id from dictionary before appending
matched_ids.append(d1)
if matched_ids: # added benefit of not showing the 'list1' key if matched_ids is empty
found = {"list1": matched_ids}
else:
found = {}
merged_list.append({**d, **found})
Try this
And don't forget to put " " when declare string
list = [ {"id": 1, "s_id": 2 ," class": 'a', "teacher": 'b'}, {"id": 2, "s_id" : 3, "class" : 'c', "teacher": 'd'} ]
list1 = [ {"id": 1, "rank" :2, "area" : 34}, {"id" :2, "rank" :1, "area": 21} ]
list2 = list1 + list
print(list2)

remove specific values from the python list

Here is the list
['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', '00', 'STOP/HOLD ADD', '5', '', '00', 'TOWER INQ', 'T', '', '00', 'ACCT FIELD MNT', '2', '', '00', 'COMB STMT MAINT', 'C', '', '00', 'MONETARY IM80', 'W', '', '00', 'MONETARY-IM201', 'D', '', '00', 'OCF INQ', 'G', '', '00', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', '', '00', 'NAME/ADDR CHG', '4', '', '00', 'MEMO POST', 'Z', '', '00', 'FLOOR LIMITS', '0']
I would like to remove '' and '00' from the list
result should be like this
['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', 'STOP/HOLD ADD', '5', 'TOWER INQ', 'T', 'ACCT FIELD MNT', '2', 'COMB STMT MAINT', 'C', 'MONETARY IM80', 'W', 'MONETARY-IM201', 'D', 'OCF INQ', 'G', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', 'NAME/ADDR CHG', '4', 'MEMO POST', 'Z', 'FLOOR LIMITS', '0']
I tried this
apa= [aa for aa in apa if aa != "''" or aa != "00"]
getting same result
Here's how I would do it:
for i in list:
if i = "00":
del list[i]
You can also use:
list.remove('00');
lst=['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', '00',
'STOP/HOLD ADD', '5', '', '00', 'TOWER INQ', 'T', '', '00',
'ACCT FIELD MNT', '2', '', '00', 'COMB STMT MAINT', 'C', '',
'00', 'MONETARY IM80', 'W', '', '00', 'MONETARY-IM201', 'D',
'', '00', 'OCF INQ', 'G', '', '00', 'ACCESS ALL FUNC', 'NO',
'RATE INQ', 'K', '', '00', 'NAME/ADDR CHG', '4', '', '00',
'MEMO POST', 'Z', '', '00', 'FLOOR LIMITS', '0']
print [x for x in lst if x != '00' and x != '']
#Output
['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', 'STOP/HOLD ADD',
'5', 'TOWER INQ', 'T', 'ACCT FIELD MNT', '2', 'COMB STMT MAINT',
'C', 'MONETARY IM80', 'W', 'MONETARY-IM201', 'D', 'OCF INQ', 'G',
'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', 'NAME/ADDR CHG', '4',
'MEMO POST', 'Z', 'FLOOR LIMITS', '0']
res=['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', '00', 'STOP/HOLD ADD', '5', '', '00', 'TOWER INQ', 'T', '', '00', 'ACCT FIELD MNT', '2', '', '00', 'COMB STMT MAINT', 'C', '', '00', 'MONETARY IM80', 'W', '', '00', 'MONETARY-IM201', 'D', '', '00', 'OCF INQ', 'G', '', '00', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', '', '00', 'NAME/ADDR CHG', '4', '', '00', 'MEMO POST', 'Z', '', '00', 'FLOOR LIMITS', '0']
res=[x for x in res if x not in ('00','')]
print res
['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', 'STOP/HOLD ADD', '5', 'TOWER INQ', 'T', 'ACCT FIELD MNT', '2', 'COMB STMT MAINT', 'C', 'MONETARY IM80', 'W', 'MONETARY-IM201', 'D', 'OCF INQ', 'G', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', 'NAME/ADDR CHG', '4', 'MEMO POST', 'Z', 'FLOOR LIMITS', '0']
Use while loop
list1 = ['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', '00', 'STOP/HOLD ADD', '5', '', '00', 'TOWER INQ', 'T', '', '00', 'ACCT FIELD MNT', '2', '', '00', 'COMB STMT MAINT', 'C', '', '00', 'MONETARY IM80', 'W', '', '00', 'MONETARY-IM201', 'D', '', '00', 'OCF INQ', 'G', '', '00', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', '', '00', 'NAME/ADDR CHG', '4', '', '00', 'MEMO POST', 'Z', '', '00', 'FLOOR LIMITS', '0']
while '00' in list1: list1.remove('00')
print(list1)
The output will be
['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', 'STOP/HOLD ADD', '5', '', 'TOWER INQ', 'T', '', 'ACCT FIELD MNT', '2', '', 'COMB STMT MAINT', 'C', '', 'MONETARY IM80', 'W', '', 'MONETARY-IM201', 'D', '', 'OCF INQ', 'G', '', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', '', 'NAME/ADDR CHG', '4', '', 'MEMO POST', 'Z', '', 'FLOOR LIMITS', '0']
With all the '00' terms removed
Single liner:
filter(lambda a: a!='' and a!='00', ['DEFAULT SECURITY', 'YES', 'ACCT INQ', '3', '', '00', 'STOP/HOLD ADD', '5', '', '00', 'TOWER INQ', 'T', '', '00', 'ACCT FIELD MNT', '2', '', '00', 'COMB STMT MAINT', 'C', '', '00', 'MONETARY IM80', 'W', '', '00', 'MONETARY-IM201', 'D', '', '00', 'OCF INQ', 'G', '', '00', 'ACCESS ALL FUNC', 'NO', 'RATE INQ', 'K', '', '00', 'NAME/ADDR CHG', '4', '', '00', 'MEMO POST', 'Z', '', '00', 'FLOOR LIMITS', '0'])
See https://stackoverflow.com/a/1157160/761963

Python how to create new lists from a list containing multiple lists depending on indexes

I have a list that stores lists with 5 elements. I want to create 5 new lists that store elements of each indexes. I have the following code but it seems not smart way.
>>> stats
[['1', '0', '36', '36', '3'], ['10', '0', '41', '77', '5'], ['1', '0', '631', '631', '63'], ['1', '0', '98', '98', '9'], ['9', '0', '52', '81', '6'], ['2', '0', '111', '167', '13'], ['1', '0', '98', '98', '9'], ['1', '0', '92', '92', '9'], ['2', '0', '241', '287', '26'], ['1', '0', '210', '210', '21'], ['2', '0', '336', '358', '34'], ['2', '0', '49', '57', '5'], ['5', '0', '52', '148', '7'], ['2', '0', '46', '76', '6'], ['3', '0', '33', '50', '4'], ['7', '0', '47', '70', '6'], ['1', '0', '94', '94', '9'], ['1', '0', '65', '65', '6'], ['1', '0', '66', '66', '6'], ['1', '0', '429', '429', '42'], ['1', '0', '337', '337', '33'], ['12', '0', '49', '126', '6'], ['1', '0', '47', '47', '4'], ['1', '0', '63', '63', '6'], ['1', '0', '79', '79', '7'], ['2', '0', '96', '100', '9'], ['1', '0', '36', '36', '3'], ['1', '0', '69', '69', '6'], ['6', '0', '44', '67', '5'], ['3', '0', '269', '385', '31'], ['2', '0', '78', '115', '9'], ['2', '0', '49', '52', '5'], ['3', '0', '26', '134', '9'], ['2', '0', '255', '561', '40'], ['1', '0', '75', '75', '7'], ['1', '0', '59', '59', '5'], ['2', '0', '59', '64', '6'], ['1', '0', '86', '86', '8'], ['1', '0', '63', '63', '6'], ['2', '0', '79', '100', '8'], ['4', '0', '825', '888', '86'], ['1', '0', '82', '82', '8'], ['3', '0', '65', '94', '7'], ['1', '0', '88', '88', '8'], ['1', '0', '344', '344', '34'], ['1', '0', '286', '286', '28'], ['1', '0', '73', '73', '7'], ['3', '0', '42', '69', '5'], ['1', '0', '151', '151', '15'], ['1', '0', '286', '286', '28'], ['2', '0', '47', '59', '5'], ['9', '0', '15', '41', '2'], ['2', '0', '343', '355', '34'], ['1', '0', '305', '305', '30'], ['1', '0', '238', '238', '23'], ['2', '0', '974', '2101', '153'], ['2', '0', '138', '142', '14'], ['7', '0', '45', '70', '5'], ['1', '0', '39', '39', '3']]
>>>
>>> num_requests,num_failures,min_response_time,max_response_time,avg_response_time = [], [], [], [], []
>>>
>>> for l in stats:
... num_requests.append(l[0])
... num_failures.append(l[1])
... min_response_time.append(l[2])
... max_response_time.append(l[3])
... avg_response_time.append(l[4])
...
>>> num_requests
['1', '10', '1', '1', '9', '2', '1', '1', '2', '1', '2', '2', '5', '2', '3', '7', '1', '1', '1', '1', '1', '12', '1', '1', '1', '2', '1', '1', '6', '3', '2', '2', '3', '2', '1', '1', '2', '1', '1', '2', '4', '1', '3', '1', '1', '1', '1', '3', '1', '1', '2', '9', '2', '1', '1', '2', '2', '7', '1']
It could be stored in one list which stores 5 sublist.
Solution
Just use zip with *:
(num_requests, num_failures, min_response_time, max_response_time,
avg_response_time) = zip(*stats)
This gives you tuples. Convert to lists if you need lists:
(num_requests, num_failures, min_response_time, max_response_time,
avg_response_time) = (list(x) for x in zip(*stats))
Details
A shorter example:
>>> data = [[1, 2, 3], [10, 20, 30], [100, 200, 300]]
>>> a, b, c = zip(*data)
>>> a
(1, 10, 100)
>>> b
(2, 20, 200)
>>> c
(3, 30, 300)
This is equivalent to:
a, b, c = zip(data[0], data[1], data[2])
but works for any number of sublists.
The left side uses tuple unpacking. For example, this:
x, y, z = (10, 20, 30)
assigns 10 to x, 20 to y, and 30 to z.
Performance
Measure how fast it is.
Version with append:
%%timeit
num_requests,num_failures,min_response_time,max_response_time,avg_response_time = [], [], [], [], []
for l in stats:
num_requests.append(l[0])
num_failures.append(l[1])
min_response_time.append(l[2])
max_response_time.append(l[3])
avg_response_time.append(l[4])
10000 loops, best of 3: 51 µs per loop
Version with zip:
%%timeit
(num_requests, num_failures, min_response_time, max_response_time,
avg_response_time) = zip(*stats)
100000 loops, best of 3: 8.58 µs per loop
It is about five times faster.
It takes a bit longer when you convert the tuples to lists:
%%timeit
(num_requests, num_failures, min_response_time, max_response_time,
avg_response_time) = (list(x) for x in zip(*stats))
100000 loops, best of 3: 13.3 µs per loop
Still, about four times faster.

How do you compare integer values in two separate lists?

This is my current code:
predict_results = []
with open ("newTesting.predict") as inputfile:
for line in inputfile:
predict_results.append(line.strip())
print predict_results
first_list = []
with open ("newTesting") as inputfile:
for line in inputfile:
first_list.append(line.strip().split()[0])
print first_list
if predict_results [0] == first_list[0]:
print True
else:
print False
This is my current output:
['-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '1', '-1', '1', '-1', '1', '-1', '1', '-1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '1', '1', '-1', '-1', '-1', '-1', '-1', '1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '-1']
['1', '-1', '1', '-1', '-1', '-1', '-1', '-1', '-1', '1', '1', '-1', '1', '1', '1', '1', '-1', '1', '-1', '1', '1', '1', '1', '-1', '1', '-1', '1', '1', '-1', '-1', '1', '1', '-1', '1', '1', '-1', '1', '1', '1', '1', '-1', '-1', '1', '1', '1', '1', '1', '1', '1', '-1', '-1', '1', '1', '-1', '1', '-1', '1', '-1', '-1', '1', '1', '-1', '1', '1', '1', '-1', '-1', '-1', '1', '-1', '-1', '1', '1', '1', '-1', '1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '1', '1', '-1', '-1', '-1', '-1', '-1', '1', '1', '-1', '-1', '1', '-1', '1', '1', '1', '-1', '1', '-1', '-1', '-1', '1', '1', '-1', '1', '1', '-1', '-1', '-1', '-1', '-1', '1', '-1', '-1', '-1', '-1', '-1']
False
I can only check index [0] which is correct. How do I check all indexes in predict_results with first_list
Thanks
Hope this will help:
>>> from pandas import *
>>> L1 = [1,2,3]
>>> L2 = [2,2,3]
>>> S1 = Series(L1)
>>> S2 = Series(L2)
>>> RES = S1==S2
>>> RES
0 False
1 True
2 True
For your case:
>>> S1 = Series(predict_results)
>>> S2 = Series(first_list)
>>> RES = S1==S2
>>> RES[0] # check predict_results[0]==first_list[0]
>>> RES[1] # check predict_results[1]==first_list[1]
And you can use loop:
for x,y in zip(predict_results,first_list):
if(x==y):
print False
else:
print True

Google Charts API, only show every second value on hAxis

I search for a solution, to only show every second value on the hAxis label of an Google LineChart.
The hAxis label interval is set by the hAxis.showTextEvery option as described here. This only works for discrete (non-number) horizontal axes. For example:
function drawVisualization() {
// Create and populate the data table.
var data = google.visualization.arrayToDataTable([
['x', 'Cats', 'Blanket 1', 'Blanket 2'],
['A', 1, 1, 0.5],
['B', 2, 0.5, 1],
['C', 4, 1, 0.5],
['D', 8, 0.5, 1],
['E', 7, 1, 0.5],
['F', 7, 0.5, 1],
['G', 8, 1, 0.5],
['H', 4, 0.5, 1],
['I', 2, 1, 0.5],
['J', 3.5, 0.5, 1],
['K', 3, 1, 0.5],
['L', 3.5, 0.5, 1],
['M', 1, 1, 0.5],
['N', 1, 0.5, 1]
]);
// Create and draw the visualization.
new google.visualization.LineChart(document.getElementById('visualization')).
draw(data, {curveType: "function",
width: 500, height: 400,
vAxis: {maxValue: 10},
hAxis: {showTextEvery: 2}}
);
}
If your axis is numerical, you can set this using the hAxis.minorGridlines.count option, as follows:
hAxis: {minorGridlines: {count: 1}}
This will add a single gridline (with no label) in between every two major gridlines.