Pyspark: Using map function instead of collect for iterating RDDs - python-2.7

In PySpark ,I've two RDD's which are structured as (key,list of list) :
input_rdd.take(2)
[(u'100',
[[u'36003165800', u'70309879', u'1']]),
(u'200',
[[u'5196352600', u'194837393', u'99']]) ]
output_rdd.take(2)
[(u'100',
[[u'875000', u'5959', u'1']]),
(u'300', [[u'16107000', u'12428', u'1']])]
Now i want a resultant RDD (as shown below) which is grouping the two RDD's based on keys and giving output as tuple in the order (keys,( ,)).Incase the key was not present in anyone of the input or output then the list of that rdd remains empty.
[(u'100',
([[[u'36003165800', u'70309879', u'1']]],
[[[u'875000', u'5959', u'1']]]),
(u'200',
([[[u'5196352600', u'194837393', u'99']]],
[])),
(u'300',([],[[[u'16107000', u'12428', u'1']]])
]
For obtaining the resultant RDD i'm using the below piece of code using
resultant=sc.parallelize(x, tuple(map(list, y))) for x,y in sorted(list(input_rdd.groupWith(output_rdd).collect()))
Is there a way i can remove .collect() and instead use .map() with groupWith function to obtain the same resultant RDD in Pyspark?

A full outer join gives:
input_rdd.fullOuterJoin(output_rdd).collect()
# [(u'200', ([[u'5196352600', u'194837393', u'99']], None)),
# (u'300', (None, [[u'16107000', u'12428', u'1']])),
# (u'100', ([[u'36003165800', u'70309879', u'1']], [[u'875000', u'5959', u'1']]))]
To replace None with []:
input_rdd.fullOuterJoin(output_rdd).map(lambda x: (x[0], tuple(i if i is not None else [] for i in x[1]))).collect()
# [(u'200', ([[u'5196352600', u'194837393', u'99']], [])),
# (u'300', ([], [[u'16107000', u'12428', u'1']])),
# (u'100', ([[u'36003165800', u'70309879', u'1']], [[u'875000', u'5959', u'1']]))]

Related

Order string list with multiple conditions

I would like to order a list using two conditions, sorting by the file extension (txt first) asc and by the file name desc.NameError
# Input list:
lst = ["d.csv", "a.TXT", "b.txt", "c.csv"]
# Expected output:
lst = ["b.txt", "a.TXT", "d.csv", "c.csv"]
I have a sample code, but it uses default order (I have found any posibility to set the reverse option for the second condidiotion only)
sorted(lst, key = lambda x:(x.split(".")[-1].lower(), x.split(".")[0].lower()))

error: pyspark list append operation inside foreach on dataframe gives empty list outside of loop

I am facing the following issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [('James','Smith','M',30),('Anna','Rose','F',41),('Robert','Williams','M',62), ]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
lst = []
def func2(x):
lst = lst.append(x.firstname)
df.foreach(func2)
# df.foreach(lambda x: func2(x))
print(len(lst))
the lst variable here at the end of the loop is always empty. what is the reason for this? any fix?
Thanks!
The reason your code does not work is because lambda functions in PySpark are executed in different executors, each within its own local Python process and hence global variables are not accessible across executors.
You can use accumulators to achieve this. However, it comes with a performance penalty and PySpark does not provide a native List accumulator.
Solution using Accumulators
from pyspark.accumulators import AccumulatorParam
class ListParam(AccumulatorParam):
def zero(self, init_value):
return init_value
def addInPlace(self, v1, v2):
return v1 + v2
lst = spark.sparkContext.accumulator([], ListParam())
def func2(x):
global lst
lst += [x.firstname]
df.foreach(func2)
print(lst.value)
Output:
['James', 'Anna', 'Robert']
If you are looking to get back all the values for a particular column in PySpark then you can select the particular column, collect them as Row and then fetch the key you are interested in.
Collecting is an expensive operation and brings all the data to the driver and in the presence of all volume of data, it will cause driver to fail.
[row.firstname for row in df.select("firstname").collect()]

How to combine two lists together that are being formed while iterating another list?

I'm using scrapy to iteratively scrape some data, and the data is being output as two lists through each iteration. I want to combine the two lists into one list at each iteration, so that in the end I will have one big list with many sublists(each sublist being the combination of the two lists created from each iteration)
That may be confusing so I will show my current output and code:
using Scrapy I"m iterating in the following way,
for i in response.css(''tr.insider....."):
i.css(a.tab-link:text).extract() #creating the first list
i.css('td::text').extract() #creating the second list
So the current output is something like this
[A,B,C] #first iteration
[1,2,3]
[D,E,F] #second iteration
[4,5,6]
[G,H,I] #third iteration
[7,8,9]
Desired output is
[[A,B,C,1,2,3], [D,E,F,4,5,6],[G,H,I,7,8,9]]
I tried the following code but I'm getting a list of None.
x =[]
for i in response.css(''tr.insider....."):
x.append(i.css(a.tablink::text).extract().extend(i.css('td::text').extract()))
But the return is just
None
None
None
None
None.....
Thanks!
extend function returns None, so you always append None to x.
For your purpose, I this is what you want:
for i in response.css(''tr.insider....."):
i.css('a.tab-link:text, td::text').extract()
You can simply add two lists together and append them to your results list.
results = []
for i in response.css("tr.insider....."):
first = i.css(a.tab-link:text).extract()
second = i.css('td::text').extract()
# combine both and append to results
results.append(first + second)
print(results)
# e.g.: [[A,B,C,1,2,3], [D,E,F,4,5,6],[G,H,I,7,8,9]]

Comparing lists which consists of tuples using RF

Let me try to put this in a simple manner.
I have 2 lists, which look like below:
List1 = [('a', 1, 'low'), ('b', 10, 'high')] # --> Tuples in List
List2 = ["('a', 1, 'low')", "('b', 10, 'high')"] # --> Here the Tuples are actually of Type String.
List1 is output of a SQL query. List2 is defined by me as expected result.
I am using Robot Framework to compare these two lists with the Keyword Lists Should Be Equal. But it fails as List2 has strings which look like Tuple.
How can I compare these two lists? Can I convert both the lists to a different variable type so that I can compare them. I am trying to avoid the python coding here.
It's unclear exactly what your data looks like, but since the two lists have different contents you'll have to convert one or both of them to a common format.
You could, for example, convert the first list to a string with something like this:
| ${actual}= | Evaluate | [str(x) for x in ${List1}]
I doubt that gives you exactly what you need because, again, it's unclear exactly what you need. However, the technique remains the same: use Evaluate to write a little bit of python code to convert one of your lists to be the same format as the other list before doing the compare.
This may be the long procedure, i have used tuples (1,2) instead (a,1,low) #( cause name error in python). But you told its getting from SQL. Important is difference between (1,2) and (1, 2) #(space mismatch)
var.py
List1 = [(1,2), (3,4)]
test.robot(txt file)
*** Settings ***
Library BuiltIn
Library Collections
Variables var.py
Library String
*** Variables ***
#{appnd_li}
*** Test Cases ***
TEST
#constructing List2=["(1, 2)","(3, 4)"]
${List2}= Create List (1, 2) (3, 4)
# importing List1 from variable file
${len}= Get Length ${List1}
#initialize empty list
${li}= Create List #{appnd_li}
:FOR ${I} IN RANGE 0 ${len}
\ ${item}= Convert To String ${List1[${I}]}
\ Append To List ${li} ${item}
Lists Should Be Equal ${li} ${List2}
~

how to check if previous element is similar to next elemnt in python

I have a text file like:
abc
abc
abc
def
def
def
...
...
...
...
Now I would like o create a list
list1=['abc','abc','abc']
list2=['def','def','def']
....
....
....
I would like to know how to check if next element is similar to previous element in a python for loop.
You can create a list comprehension and check if the ith element is equal to the ith-1 element in your list.
[ list1[i]==list1[i-1] for i in range(len(list1)) ]
>>> list1=['abc','abc','abc']
>>> [ list1[i]==list1[i-1] for i in range(len(list1)) ]
[True, True, True]
>>> list1=['abc','abc','abd']
>>> [ list1[i]==list1[i-1] for i in range(len(list1)) ]
[False, True, False]
This can be written within a for loop as well:
aux_list = []
for i in range(len(list1)):
aux_list.append(list1[i]==list1[i-1])
Check this post:
http://www.pythonforbeginners.com/lists/list-comprehensions-in-python/
for i in range(1,len(list)):
if(list[i] == list[i-1]):
#Over here list[i] is equal to the previous element i.e list[i-1]
file = open('workfile', 'r') # open the file
splitStr = file.read().split()
# will look like splitStr = ['abc', 'abc', 'abc', 'def', ....]
I think the best way to progress from here would be to use a dictionary
words = {}
for eachStr in splitStr:
if (words.has_key(eachStr)): # we have already found this word
words[eachStr] = words.get(eachStr) + 1 # increment the count (key) value
else: # we have not found this word yet
words[eachStr] = 1 # initialize the new key-value set
This will create a dictionary so the result would look like
print words.items()
[('abc', 3), ('def', 3)]
This way you store all of the information you want. I proposed this solution because its rather messy to create an unknown number of lists to accommodate what you want to do, but it is easy and memory efficient to store the data in a dictionary from which you can create a list if need be. Furthermore, using dictionaries and sets allow you to have a single copy of each string (in this case).
If you absolutely need new lists let me know and I will try to help you figure it out