Related
I'm quite new in python coding and I canĀ“t solve the following problem:
I have a list with trackingpoints for different animals(ID,date,time,lat,lon) given in strings:
aList = [[id,date,time,lat,lon],
[id2,date,time,lat,lon],
[...]]
The txt file is very big and the IDs(a unique animal) is occuring multiple times:
i.e:
aList = [['25','20-05-13','15:16:17','34.89932','24.09421'],
['24','20-05-13','15:16:18','35.89932','23.09421],
['25','20-05-13','15:18:15','34.89932','24.13421'],
[...]]
What I'm trying to do is order the ID's in dictionaries so each unique ID will be the key and all the dates, times, latitudes and longitudes will be the values. Then I would like to write each individual ID to a new txt file so all the values for a specific ID are in one txt file. The output should look like this:
{'25':['20-05-13','15:16:17','34.89932','24.09421'],
['20-05-13','15:18:15','34.89932','24.13421'],
[...],
'24':['20-05-13','15:16:18','35.89932','23.09421'],
[...]
}
I have tried the following (and a lot of other solutions which didn't work):
items = {}
for line in aList:
key,value = lines[0],lines[1:]
items[key] = value
Which results in a key with the last value in the list forthat particular key :
{'25':['20-05-13','15:18:15','34.89932','24.13421'],
'24':['20-05-13','15:16:18','35.89932','23.09421']}
How can I loop through my list and assign the same IDs to the same key and all the corresponding values?
Is there any simple solution to this? Other "easier to implement" solutions are welcome!
I hope it makes sense :)
Try adding all the lists that match to the same ID as list of lists:
aList = [['25','20-05-13','15:16:17','34.89932','24.09421'],
['24','20-05-13','15:16:18','35.89932','23.09421'],
['25','20-05-13','15:18:15','34.89932','24.13421'],
]
items = {}
for line in aList:
key,value = line[0],line[1:]
if key in items:
items[key].append(value)
else:
items[key] = [value]
print items
OUTPUT:
{'24': [['20-05-13', '15:16:18', '35.89932', '23.09421']], '25': [['20-05-13', '15:16:17', '34.89932', '24.09421'], ['20-05-13', '15:18:15', '34.89932', '24.13421']]}
In PySpark ,I've two RDD's which are structured as (key,list of list) :
input_rdd.take(2)
[(u'100',
[[u'36003165800', u'70309879', u'1']]),
(u'200',
[[u'5196352600', u'194837393', u'99']]) ]
output_rdd.take(2)
[(u'100',
[[u'875000', u'5959', u'1']]),
(u'300', [[u'16107000', u'12428', u'1']])]
Now i want a resultant RDD (as shown below) which is grouping the two RDD's based on keys and giving output as tuple in the order (keys,( ,)).Incase the key was not present in anyone of the input or output then the list of that rdd remains empty.
[(u'100',
([[[u'36003165800', u'70309879', u'1']]],
[[[u'875000', u'5959', u'1']]]),
(u'200',
([[[u'5196352600', u'194837393', u'99']]],
[])),
(u'300',([],[[[u'16107000', u'12428', u'1']]])
]
For obtaining the resultant RDD i'm using the below piece of code using
resultant=sc.parallelize(x, tuple(map(list, y))) for x,y in sorted(list(input_rdd.groupWith(output_rdd).collect()))
Is there a way i can remove .collect() and instead use .map() with groupWith function to obtain the same resultant RDD in Pyspark?
A full outer join gives:
input_rdd.fullOuterJoin(output_rdd).collect()
# [(u'200', ([[u'5196352600', u'194837393', u'99']], None)),
# (u'300', (None, [[u'16107000', u'12428', u'1']])),
# (u'100', ([[u'36003165800', u'70309879', u'1']], [[u'875000', u'5959', u'1']]))]
To replace None with []:
input_rdd.fullOuterJoin(output_rdd).map(lambda x: (x[0], tuple(i if i is not None else [] for i in x[1]))).collect()
# [(u'200', ([[u'5196352600', u'194837393', u'99']], [])),
# (u'300', ([], [[u'16107000', u'12428', u'1']])),
# (u'100', ([[u'36003165800', u'70309879', u'1']], [[u'875000', u'5959', u'1']]))]
I have a list of ordered tuples which each tuple contains column name and value pair to be written to a csv for example
lst = [('name','bob'),('age',19),('loc','LA')]
which has in for for bob, age 19 and location, loc, in LA. I want to be able to write this to CSV file based on column names and sometimes some of these columns are missing, for example for another row.
lst2 = [('name','bob'),('loc','LA')]
age is missing, how I can write these rows properly in python to a csv?
Those tuples can be used to initialize a dict so csv.DictWriter seems the best choice. In this example I create a dict filled with default values. For each list of tuples, I copy the dict, update with the known values and write it out.
import csv
# sample data
lst = [('name','bob'),('age',19),('loc','LA')]
lst2 = [('name','jane'),('loc','LA')]
lists = [lst, lst2]
# columns need some sort of default... I just guessed
defaults = {'name':'', 'age':-1, 'loc':'N/A'}
with open('output.csv', 'wb') as outfile:
writer = csv.DictWriter(outfile, fieldnames=sorted(defaults.keys()))
writer.writeheader()
for row_tuples in lists:
# copy defaults then update with known values
kv = defaults.copy()
kv.update(row_tuples)
writer.writerow(kv)
# debug...
print open('output.csv').read()
You should give more examples, as to what exactly is required- as what if the location is not given in ls2 then what do you want to write to your csv? From what I understand, you can make a function and default argument:
import csv
def write_tuples_to_csv(name="DefaultName", age="DefaultAge", loc="Default location"):
writer = csv.writer(open("/path/to/csv/file", 'a')) # appending to a file
row = (name, age, loc)
writer.writerow(['name','num','location'])
writer.writerow(row)
Now you can call this function for every item in the list. This should help you to get you started.
I have UDF output as :-
Sample records:-
({(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,5),(Todd,10),(Todd,20),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10)})
({(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,5),(Jon,10),(Jon,20),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,5),(Jon,20),(Jon,1)})
Schema for UDF:- name:chararray(1 single column)
Now i want to read this bag of tuples and generate output as :-
Todd,240
Jon,422
The output of the UDF i stored in a temp file and read it back using different schema as:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
After that i am trying to use foreach loop and reference dot notation to find the sum.
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1128: Cannot find field T in name:chararray,denom:int Details at
logfile: /home/training/pig_1488648405070.log
Can you please let me know how to find it? I am new to Apache Pig so not sure how it traverse in Bag of Tuples and find sum.
GROUP the dataset on name before performing SUM.
FLATTEN the bag to perform GROUP.
flattened = FOREACH D GENERATE FLATTEN(B);
dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....
Then, GROUP them on name
grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
And apply SUM() over the result
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)
I am trying to populate a list in Python3 with 3 random items being read from a file using REGEX, however i keep getting duplicate items in the list.
Here is an example.
import re
import random as rn
data = '/root/Desktop/Selenium[FILTERED].log'
with open(data, 'r') as inFile:
index = inFile.read()
URLS = re.findall(r'https://www\.\w{1,10}\.com/view\?i=\w{1,20}', index)
list_0 = []
for i in range(3):
list_0.append(URLS[rn.randint(1, 30)])
inFile.close()
for i in range(len(list_0)):
print(list_0[i])
What would be the cleanest way to prevent duplicate items being appended to the list?
(EDIT)
This is the code that i think has done the job quite well.
def random_sample(data):
r_e = ['https://www\.\w{1,10}\.com/view\?i=\w{1,20}', '..']
with open(data, 'r') as inFile:
urls = re.findall(r'%s' % r_e[0], inFile.read())
x = list(set(urls))
inFile.close()
return x
data = '/root/Desktop/[TEMP].log'
sample = random_sample(data)
for i in range(3):
print(sample[i])
Unordered collection with no duplicate entries.
Use the builtin random.sample.
random.sample(population, k)
Return a k length list of unique elements chosen from the population sequence or set.
Used for random sampling without replacement.
Addendum
After seeing your edit, it looks like you've made things much harder than they have to be. I've wired a list of URLS in the following, but the source doesn't matter. Selecting the (guaranteed unique) subset is essentially a one-liner with random.sample:
import random
# the following two lines are easily replaced
URLS = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8']
SUBSET_SIZE = 3
# the following one-liner yields the randomized subset as a list
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
print(urlList) # produces, e.g., => ['url7', 'url3', 'url4']
Note that by using len(URLS) and SUBSET_SIZE, the one-liner that does the work is not hardwired to the size of the set nor the desired subset size.
Addendum 2
If the original list of inputs contains duplicate values, the following slight modification will fix things for you:
URLS = list(set(URLS)) # this converts to a set for uniqueness, then back for indexing
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
Or even better, because it doesn't need two conversions:
URLS = set(URLS)
urlList = [u for u in random.sample(URLS, SUBSET_SIZE)]
seen = set(list_0)
randValue = URLS[rn.randint(1, 30)]
# [...]
if randValue not in seen:
seen.add(randValue)
list_0.append(randValue)
Now you just need to check list_0 size is equal to 3 to stop the loop.