RDD of List String convert to Row - list

I'm trying to convert an RDD that has a fixed size lists of strings (a result of parsing CSV file) into and RDD of Rows. This is so I can turn it into a dataframe, because I need it into a dataframe to write to parquet. Anyway the only part I need help with is the converting of Rdd from list of strings to Row.
The RDD variable name is RDD

I used:
import org.apache.spark.sql._
val RowRDD = RDD.map(r => Row.fromSeq(r))

Related

pyspark convert a list of tuples of mix type into a dataframe give null values

I have a function that calculate something and return a list of tuple, it looks like this:
def check():
[...]
return [("valid": 1, "wrong": 4, "lines":["line1","line2"])]
Then I'd like to add all these values together to have the final counts
rdd = lines.mapPartitions(lambda x: check()).reduceByKey(lambda a,b: a+b)
the result is something like
[("valid": 102), ("wrong": 322), ("lines": ["test1", "test2",
"test2"]))
My goal is to be able to write in a files (or multiple files) the 'lines' tuple and in a separate file the valid and wrong counts.
My question is: Is there a better data structure of what I'm currently using ? If no, how can I search for the "lines" tuple in my list ?
Or maybe better, is it possible to transform that RDD into a Dataframe where I could make a sql select on it?
I tried rdd.toDF().show() but for some reasons the value column of "lines" becomes null

Possible using python struct.unpack to keep specific values grouped?

I created a parser for some complex binary files using numpy.fromfile and defining the various dtypes necessary for reading each portion of the binary file. The resulting numpy array was then placed into a pandas dataframe and the same dtype that was defined for converting the binary files into the numpy array was recycled to define the column names for the pandas dataframe.
I was hoping to replicate this process using python struct but ran into an issue. If part of my structure requires a value to be a group of 3 ints, I can define the dtype as numpy.dtype([('NameOfField', '>i4', 3)]) and the returned value from the binary file is [int, int, int]. Can this be replicated using struct or do I need to regroup the values in the returned tuple based on the dtype before ingesting it into my pandas dataframe ?? I have read the python struct documentation and have not noticed any examples of this.
When using a dtype of >3i returns a result of int, int, int instead of [int, int, int] like I need.
Edit ...
Below is a generic example. This method using numpy.fromfile works perfect but is slow when working on my huge binary files so I am trying to implement using struct
import numpy as np
import pandas as pd
def example_structure():
dt = np.dtype([
('ExampleFieldName', '>i4', 3)
])
return dt
# filename of binary file
file_name = 'example_binary_file'
# define the dtype for this chunk of binary data
d_type = example_structure()
# define initial index for the file in memory
start_ind = 0
end_ind = 0
# read in the entire file generically
x = np.fromfile(file_name, dtype='u1')
# based on the dtype find the chunk size
chunk_size = d_type.itemsize
# define the start and end index based on the chunk size
start_ind = end_ind
end_ind = chunk_size + start_ind
# extract just the first chunk
temp = x[start_ind:end_ind]
# cast the chunk as the defined dtype
temp.dtype = d_type
# store the chunk in its own pandas dataframe
example_df = pd.DataFrame(temp.tolist(), columns=temp.dtype.names)
This will return a temp[0] value of [int, int, int] that will then be read into the pandas dataframe as a single entry under the column ExampleFieldName. If I attempt to replicate this using struct the temp[0] value is int, int, int, which is not be read properly into pandas. Is there a way to make struct group values like I can do using numpy ??
I'd suggest just splitting it up into a list of objects after unpacking it. It won't be as fast as numpy for huge objects, but then that's why numpy exists :P. Assuming data holds your raw bytes that you want to split into groups of 5 uint32_ts (obviously the data must also be in this shape):
import struct
output = struct.unpack("5I"*int(len(data)//5), data)
output = [data[i:i+5] for i in range(0, len(data), 5)]
Of course, this means iterating the data twice, and since struct.unpack doesn't yield successive values (afaik), doing it in one line won't help that. It'd maybe be faster to iterate over the data directly - I haven't run any tests - like this:
import struct
output, itemsize = [], struct.calcsize("5I")
for i in range(0, len(data), itemsize):
output.append(struct.unpack("5I", data[i:i+itemsize])

dataframe to dictionary:python

So, i have a file
F1.txt
CDUS,CBSCS,CTRS,CTRS_ID
0,0,0.000000000375,056572
0,0,4.0746,0309044
0,0,0.6182,0971094
0,0,15.4834,075614
I want to insert the column names and its dtype into a dictionary with the column names being the key and the corresponding dtype of the column being the value.
My read statement has to be like this:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
I'm expecting something like this:
data = {'CDUS':'int64','CBSCS':'int64','CTRS':'float64','CTRS_ID':'str'}
Can someone help me with this. Thanks in advance
You can use dtypes to find the type of each column and then transform the result to a dictionary with to_dict. Also, if you want a string representation of the type, you can convert the dtypes output to string:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
csv.dtypes.astype(str).to_dict()
Which gives the output:
{'CBSCS': 'int64', 'CDUS': 'int64', 'CTRS': 'float64', 'CTRS_ID': 'object'}
This is actually the right result, since pandas treats string as object.
I have not enough expertise to elaborate on this, but here a couple of references:
pandas distinction between str and object types
pandas string data types
"pandas doesn't support the internal string types (in fact they are always converted to object)" [from pandas maintainer #Jeff]

Write tuple to csv by skipping missing columns

I have a list of ordered tuples which each tuple contains column name and value pair to be written to a csv for example
lst = [('name','bob'),('age',19),('loc','LA')]
which has in for for bob, age 19 and location, loc, in LA. I want to be able to write this to CSV file based on column names and sometimes some of these columns are missing, for example for another row.
lst2 = [('name','bob'),('loc','LA')]
age is missing, how I can write these rows properly in python to a csv?
Those tuples can be used to initialize a dict so csv.DictWriter seems the best choice. In this example I create a dict filled with default values. For each list of tuples, I copy the dict, update with the known values and write it out.
import csv
# sample data
lst = [('name','bob'),('age',19),('loc','LA')]
lst2 = [('name','jane'),('loc','LA')]
lists = [lst, lst2]
# columns need some sort of default... I just guessed
defaults = {'name':'', 'age':-1, 'loc':'N/A'}
with open('output.csv', 'wb') as outfile:
writer = csv.DictWriter(outfile, fieldnames=sorted(defaults.keys()))
writer.writeheader()
for row_tuples in lists:
# copy defaults then update with known values
kv = defaults.copy()
kv.update(row_tuples)
writer.writerow(kv)
# debug...
print open('output.csv').read()
You should give more examples, as to what exactly is required- as what if the location is not given in ls2 then what do you want to write to your csv? From what I understand, you can make a function and default argument:
import csv
def write_tuples_to_csv(name="DefaultName", age="DefaultAge", loc="Default location"):
writer = csv.writer(open("/path/to/csv/file", 'a')) # appending to a file
row = (name, age, loc)
writer.writerow(['name','num','location'])
writer.writerow(row)
Now you can call this function for every item in the list. This should help you to get you started.

Python Make dict of lists from string

I have the dict of lists that stored in database as a text
a = {u'1': [u'12'], u'2': [u'7', u'8', u'9']}
I want manipulating with this structure as a dict of lists.
a["2"][3] = 9
but haven't any idea how to convert this string to dict of lists back.
You should store that data as JSON rather than just the repr of a dict. Then you can easily convert to and from that format using the json library.
Tnx. I can do what I want using JSON
import json
a = {u'1': [u'12'], u'2': [u'7', u'8', u'9']}
Make JSON
y = json.dumps(a)
b = json.loads(y)
And finally we have
>>> b["2"][1]
8