I'm trying to do something like this
file = sc.textFile('mytextfile')
def myfunction(mystring):
new_value = mystring
for i in file.toLocalIterator()
if i in mystring:
new_value = i
return new_value;
rdd_row = some_data_frame.map(lambda u: Row(myfunction(u.column_name)))
But I get this error
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
The problem is (as is clearly stated in the error message) that you are trying to work with an RDD inside the map. File is an RDD. it can have various transformations on it (e.g. you are trying to do a local iterator on it). But you are trying to use the transformation inside another - the map.
UPDATE
If I understand correctly you have a dataframe df with a column URL. You also have a text file which contains blacklist values.
Lets assume for the sake of argument that your blacklist files is a csv with a column blacklistNames and that the dataframe df's URL column is already parsed. i.e. you just want to check if URL is in the blacklistNames columns.
What you can do is something like this:
df.join(blackListDF, df["URL"]==blackListDF["blacklistNames"], "left_outer")
This join basically adds a blacklistNames column to your original dataframe which would contain the matched name if it is in the blacklist and null otherwise. Now all you need to do is filter based on whether or not the new column is null or not.
Related
I need to read a csv file into DataFlow that represents a table, perform a GroupBy transformation to get the number of elements that are in a specific column, and then write that number to a BigQuery table along with the original file.
So far I've gotten the first step - reading the file from my storage bucket and I've called a transformation, but I don't know how to get the count for a single column since the csv has 16.
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> lines = p.apply("ReadLines", TextIO.read().from("gs://bucket/data.csv"));
PCollection<String> grouped_lines = lines.apply(GroupByKey())
PCollection<java.lang.Long> count = grouped_lines.apply(Count.globally())
p.run();
}
}
You are reading whole lines from your CSV to a PCollection on strings. That's most likely not enough for you.
What you want to do is to
Split whole string into multiple strings relevant to columns
Filter PCollection to values that contain something in required column. [1]
Apply Count [2]
[1] https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/transforms/Filter.html
[2] https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/Count.html
It would be better if you convert that csv into suitable form. For Eg: Convert it into TableRow and then perform GroupByKey based. This way you can identify the column respective to particular value and find the count based on that.
I have a python dictionary that maps column names from a source table to a destination table.
Note: this question was answered in a previous thread for a different query string, but this query string is more complicated and I'm not sure if it can be generated using the same list comprehension method.
Dictionary:
tablemap_computer = {
'ComputerID' : 'computer_id',
'HostName' : 'host_name',
'Number' : 'number'
}
I need to dynamically produce the following query string, such that it will update properly when new column name pairs are added to the dictionary.
(ComputerID, HostName, Number) VALUES (%(computer_id.name)s, %(host_name)s, %(number)s)
I started with a list comprehension but I only was able to generate the first part of the query string so far with this technique.
queryStrInsert = '('+','.join([tm_val for tm_key, tm_val in tablemap_incident.items()])+')'
print(queryStrInsert)
#Output
#(computer_id,host_name,number)
#Still must generate the remaining part of the query string parameterized VALUES
If I understand what you're trying to get at, you can get it done this way:
holder = list(zip(*tablemap_computer.items()))
"insert into mytable ({0}) values ({1})".format(",".join(holder[0]), ",".join(["%({})s".format(x) for x in holder[1]]))
This should yield:
# 'insert into mytable (HostName,Number,ComputerID) values (%(host_name)s,%(number)s,%(computer_id)s)'
I hope this helps.
I am new to pig. I have the below output.
(001,Kumar,Jayasuriya,1123456754,Matara)
(001,Kumar,Sangakkara,112722892,Kandy)
(001,Rajiv,Reddy,9848022337,Hyderabad)
(002,siddarth,Battacharya,9848022338,Kolkata)
(003,Rajesh,Khanna,9848022339,Delhi)
(004,Preethi,Agarwal,9848022330,Pune)
(005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(006,Archana,Mishra,9848022335,Chennai)
(007,Kumar,Dharmasena,758922419,Colombo)
(008,Mahela,Jayawerdana,765557103,Colombo)
How can I create a map of the above so that the output will look something like,
001#{(Kumar,Jayasuriya,1123456754,Matara),(Kumar,Sangakkara,112722892,Kandy),(001,Rajiv,Reddy,9848022337,Hyderabad)}
002#{(siddarth,Battacharya,9848022338,Kolkata)}
I tried the ToMap function.
mapped_students = FOREACH students GENERATE TOMAP($0,$1..);
But I am unable dump the output from the above command as the process throws an error and stops there. Any help would be much appreciated.
I think you are trying to achieve is group records into tuples having same id.
according to TOMAP function it Converts key/value expression pairs into a map, hence you wont be able to group your rest records, and will result in something like unable to open iterator for alias..
as per your desiring output here is the piece of code.
A = LOAD 'path_to_data/data.txt' USING PigStorage(',') AS (id:chararray,first:chararray,last:chararray,phone:chararray,city:chararray);
If you do not want to give schema then:
A = LOAD 'path_to_data/data.txt' USING PigStorage(',');
B = GROUP A BY $0; (this relation will group all your records based on your first column)
DESCRIBE B; (this will show your described schema)
DUMP B;
Hope this helps..
I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)
With the csv module, I loop through the rows to execute logic:
import csv
with open("file.csv", "r") as csv_read:
r = csv.reader(csv_read, delimiter = ",")
next(r, None) #Skip headers first row
for row in rows:
#Logic here
I'm new to Pandas, and I want to execute the same logic, using the second column only in the csv as the input for the loop.
import pandas as pd
pd.read_csv("file.csv", usecols=[1])
Assuming the above is correct, what should I do from here to execute the logic based on the cells in column 2?
I want to use the cell values in column 2 as input for a web crawler. It takes each result and inputs it as a search term on a webpage, and then scrapes data from that webpage. Is there any way to grab each cell value in the array rather than the whole array at the same time?
Basically the pandas equivalent of your code is this:
import pandas as pd
df = pd.read_csv("file.csv", usecols=[1])
So passing usecols=[1] will only load the second column, see the docs.
now assuming this column has a name like 'url' but really it doesn't matter we can do something like:
def crawl(x):
#do something
df.apply(crawl)
So in principle the above will crawl each url in your column a value at a time.
EDIT
You can pass param axis=1 to apply so that it process each row rather than the entire column:
df.apply(crawl, axis=1)