import data from HBase insert to another table using Pig UDF - python-2.7

I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)

Related

AWS Glue transform string value from postgres to json array

I am new to AWS Glue and pyspark. I have a table in RDS which contains a varchar field id. I want to map id to a String field in the output json which is inside a json array field (let's say newId):
{
 "sources" : [
  "newId" : "1234asdf"
 ]
}
How can I achieve this using the transforms defined in the pyspark script of the AWS Glue job.
Use the AWS Glue Map Transformation to map the string field into a field inside a JSON array in target.
NewFrame= Map.apply(frame=OldFrame, f=map_fields)
and define a function map_fields like such:
def map_fields(rec):
rec["sources"] = {}
rec["sources"] = [{"newID": rec["id"]}]
del rec["id"]
return rec
Make sure to delete the original field as done in del rec["uid"] otherwise the logic doesn't work.

How do I read a .csv file into a GCP Dataflow and then get the count for a specific column and write it to BigQuery?

I need to read a csv file into DataFlow that represents a table, perform a GroupBy transformation to get the number of elements that are in a specific column, and then write that number to a BigQuery table along with the original file.
So far I've gotten the first step - reading the file from my storage bucket and I've called a transformation, but I don't know how to get the count for a single column since the csv has 16.
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> lines = p.apply("ReadLines", TextIO.read().from("gs://bucket/data.csv"));
PCollection<String> grouped_lines = lines.apply(GroupByKey())
PCollection<java.lang.Long> count = grouped_lines.apply(Count.globally())
p.run();
}
}
You are reading whole lines from your CSV to a PCollection on strings. That's most likely not enough for you.
What you want to do is to
Split whole string into multiple strings relevant to columns
Filter PCollection to values that contain something in required column. [1]
Apply Count [2]
[1] https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/transforms/Filter.html
[2] https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/Count.html
It would be better if you convert that csv into suitable form. For Eg: Convert it into TableRow and then perform GroupByKey based. This way you can identify the column respective to particular value and find the count based on that.

How to delete a row from csv file on datalake store without using usql?

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?
You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.
Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell

How to create a key value map when there are duplicate keys?

I am new to pig. I have the below output.
(001,Kumar,Jayasuriya,1123456754,Matara)
(001,Kumar,Sangakkara,112722892,Kandy)
(001,Rajiv,Reddy,9848022337,Hyderabad)
(002,siddarth,Battacharya,9848022338,Kolkata)
(003,Rajesh,Khanna,9848022339,Delhi)
(004,Preethi,Agarwal,9848022330,Pune)
(005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(006,Archana,Mishra,9848022335,Chennai)
(007,Kumar,Dharmasena,758922419,Colombo)
(008,Mahela,Jayawerdana,765557103,Colombo)
How can I create a map of the above so that the output will look something like,
001#{(Kumar,Jayasuriya,1123456754,Matara),(Kumar,Sangakkara,112722892,Kandy),(001,Rajiv,Reddy,9848022337,Hyderabad)}
002#{(siddarth,Battacharya,9848022338,Kolkata)}
I tried the ToMap function.
mapped_students = FOREACH students GENERATE TOMAP($0,$1..);
But I am unable dump the output from the above command as the process throws an error and stops there. Any help would be much appreciated.
I think you are trying to achieve is group records into tuples having same id.
according to TOMAP function it Converts key/value expression pairs into a map, hence you wont be able to group your rest records, and will result in something like unable to open iterator for alias..
as per your desiring output here is the piece of code.
A = LOAD 'path_to_data/data.txt' USING PigStorage(',') AS (id:chararray,first:chararray,last:chararray,phone:chararray,city:chararray);
If you do not want to give schema then:
A = LOAD 'path_to_data/data.txt' USING PigStorage(',');
B = GROUP A BY $0; (this relation will group all your records based on your first column)
DESCRIBE B; (this will show your described schema)
DUMP B;
Hope this helps..

How to iterate on each line of a rdd which contains textFile

I'm trying to do something like this
file = sc.textFile('mytextfile')
def myfunction(mystring):
new_value = mystring
for i in file.toLocalIterator()
if i in mystring:
new_value = i
return new_value;
rdd_row = some_data_frame.map(lambda u: Row(myfunction(u.column_name)))
But I get this error
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
The problem is (as is clearly stated in the error message) that you are trying to work with an RDD inside the map. File is an RDD. it can have various transformations on it (e.g. you are trying to do a local iterator on it). But you are trying to use the transformation inside another - the map.
UPDATE
If I understand correctly you have a dataframe df with a column URL. You also have a text file which contains blacklist values.
Lets assume for the sake of argument that your blacklist files is a csv with a column blacklistNames and that the dataframe df's URL column is already parsed. i.e. you just want to check if URL is in the blacklistNames columns.
What you can do is something like this:
df.join(blackListDF, df["URL"]==blackListDF["blacklistNames"], "left_outer")
This join basically adds a blacklistNames column to your original dataframe which would contain the matched name if it is in the blacklist and null otherwise. Now all you need to do is filter based on whether or not the new column is null or not.