I'm trying to convert a tablerow containing multiple values to a KV. I can achieve this in a DoFn but that adds more complexity to the code that I want to write further and makes my job harder.
(Basically I need to perform CoGroupBy operation on two pcollections of tablerow)
Is there any way I can convert a PCollection to PCollection<KV<String, String>>, where the keys and values are stored in the same format as present in the tablerow?
I wrote a snippet that looks something like this but this doesnt give me the result I want, is there any way I can load all the entries in tablerow and generate a KV with those values?
ImmutableList<TableRow> input = ImmutableList.of(new TableRow().set("val1", "testVal1").set("val2", "testVal2").set("val3", "testVal3");
PCollection<TableRow> inputPC = p.apply(Create.of(input));
inputPC.apply(MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.strings()))
.via(tableRow -> KV.of((String) tableRow.get("val1"), (String) tableRow.get("val2"))));
It looks like what you want is a way to perform a Join on data obtained from BigQuery. There is no way to perform Joins on TableRows directly, and this is because TableRows are not meant to be generally manipulated as elements in your pipeline, their purpose is specifically for reading and writing with BigQuery IO.
In order to be able to use existing Beam transforms, you'll want to convert those TableRows into a more useful representation, such as either a Java object you write yourself, or the Beam schema Row type. Since TableRow is essentially a dictionary of JSON strings, all you need to do is write a Map function that reads the appropriate types and parses them if necessary. For example:
PCollection<TableRow> tableRows = ... // Reading from BigQuery IO.
PCollection<Foo> foos = tableRows.apply(MapElements.via(
new SimpleFunction<TableRow, Foo>() {
#Override
public Foo apply(TableRow row) {
String bar = (String) row.get("bar");
Integer baz = (Integer.parseInt((String) row.get("baz")));
return new Foo(bar, baz);
}
});
Once you have the data in a type of your choice, you can use find a way to perform a Join with built-in Beam transforms. There are many potential ways to do this so I won't list all of them, but a clear first choice to look at is the Join class.
To convert from PCollection TableRow into PCollection string you can use the following code:
static class StringConverter extends DoFn<String, TableRow> {
#Override
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element())); }
}
Here you can read more on how to transform from a TableRow to a String.
Related
I'm using ML.NET to do Multiclass Classification. I have 3 use cases with different input models(different number of columns and data types) and there will be more to come so it doesn't make sense to have to create a physical file for each input models for every new use cases. I'd like to have preferably just ONE physical file that can adapt to any models if possible and if not, dynamically create the input model at runtime based on the column definitions defined out of a json string retrieved from a table in a Sql Server DB. Is this even possible? If so, can you share the sample codes?
Here are some snippets of the prediction codes that I'd like to make generic :-
public class DynamicInputModel
{
[ColumnName("ColumnA"), LoadColumn(0)]
public string ColumnA { get; set; }
[ColumnName("ColumnB"), LoadColumn(1)]
public string ColumnB { get; set; }
}
PredictionEngine<DynamicInputModel, MulticlassClassificationPrediction> predEngine = _predEnginePool.GetPredictionEngine(modelName: modelName);
IDataView dataView = _mlContext.Data.LoadFromTextFile<DynamicInputModel>(
path: testDataPath,
hasHeader: true,
separatorChar: ',',
allowQuoting: true,
allowSparse: false);
var testDataList = _mlContext.Data.CreateEnumerable<DynamicInputModel>(dataView, false).ToList();
I don't think you can do DynamicInput, however you can create pipelines from one input schema and create multiple different models based on the labels/features. I have an example below that does that...two label columns and you can pass in an array of what feature columns to use for the model. The one downside to this approach is that the input schema (CSV/Database) has to be static (not change on load):
https://github.com/bartczernicki/MLDotNet-BaseballClassification
I need to read a csv file into DataFlow that represents a table, perform a GroupBy transformation to get the number of elements that are in a specific column, and then write that number to a BigQuery table along with the original file.
So far I've gotten the first step - reading the file from my storage bucket and I've called a transformation, but I don't know how to get the count for a single column since the csv has 16.
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> lines = p.apply("ReadLines", TextIO.read().from("gs://bucket/data.csv"));
PCollection<String> grouped_lines = lines.apply(GroupByKey())
PCollection<java.lang.Long> count = grouped_lines.apply(Count.globally())
p.run();
}
}
You are reading whole lines from your CSV to a PCollection on strings. That's most likely not enough for you.
What you want to do is to
Split whole string into multiple strings relevant to columns
Filter PCollection to values that contain something in required column. [1]
Apply Count [2]
[1] https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/transforms/Filter.html
[2] https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/Count.html
It would be better if you convert that csv into suitable form. For Eg: Convert it into TableRow and then perform GroupByKey based. This way you can identify the column respective to particular value and find the count based on that.
I am new to pig. I have the below output.
(001,Kumar,Jayasuriya,1123456754,Matara)
(001,Kumar,Sangakkara,112722892,Kandy)
(001,Rajiv,Reddy,9848022337,Hyderabad)
(002,siddarth,Battacharya,9848022338,Kolkata)
(003,Rajesh,Khanna,9848022339,Delhi)
(004,Preethi,Agarwal,9848022330,Pune)
(005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(006,Archana,Mishra,9848022335,Chennai)
(007,Kumar,Dharmasena,758922419,Colombo)
(008,Mahela,Jayawerdana,765557103,Colombo)
How can I create a map of the above so that the output will look something like,
001#{(Kumar,Jayasuriya,1123456754,Matara),(Kumar,Sangakkara,112722892,Kandy),(001,Rajiv,Reddy,9848022337,Hyderabad)}
002#{(siddarth,Battacharya,9848022338,Kolkata)}
I tried the ToMap function.
mapped_students = FOREACH students GENERATE TOMAP($0,$1..);
But I am unable dump the output from the above command as the process throws an error and stops there. Any help would be much appreciated.
I think you are trying to achieve is group records into tuples having same id.
according to TOMAP function it Converts key/value expression pairs into a map, hence you wont be able to group your rest records, and will result in something like unable to open iterator for alias..
as per your desiring output here is the piece of code.
A = LOAD 'path_to_data/data.txt' USING PigStorage(',') AS (id:chararray,first:chararray,last:chararray,phone:chararray,city:chararray);
If you do not want to give schema then:
A = LOAD 'path_to_data/data.txt' USING PigStorage(',');
B = GROUP A BY $0; (this relation will group all your records based on your first column)
DESCRIBE B; (this will show your described schema)
DUMP B;
Hope this helps..
I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)
I have Json Documents in the following format
Name :
Class :
City :
Type :
Age :
Level :
Mother :
Father :
I have a map function like this
function(doc,meta)
{
emit([doc.Name,doc.Age,doc.Type,doc.Level],null);
}
What I can do is give "name" and filter out all results but what I also want to do is give "age" only and filter out on that. For that couchbase does not provide functionality to skip "Name" key. So I have to create a new map function which has "Age" as first key, but I also have to query on only "Level" key also so like this. I would have to create many map functions for each field which obviously is not feasible so is there anything I can do apart from making new map function to achieve this type of functionality?
I can't us n1ql because I have 150 million documents so it will take a lot of time.
First of all - that is not a very good reduce function.
it does not have any filtering
function header should be function(doc, meta)
if you have mixture between json and binary objects - add meta.type == "json"
Now for the things you can do:
If you are using v4 and above (v4.1 in much more recommended) you can use N1QL and use it very similar to SQL language. (I didn't understand why you can't use n1ql)
You can emit multiple items in multiple order
i.e. if I have doc in the format of
{
"name": "Roi",
"age": 31
}
I can emit to the index two values:
function (doc, meta) {
if (meta.type=="json") {
emit(doc.name, null);
emit(doc.age, null);
}
}
Now I can query by 2 values.
this is much better than creating 2 views.
Anyway, if you have something to filter by - it is always recommended.