How to parse TSV data into nested objects

How to parse TSV data into nested objects - univocity

I'm trying to parse the following TSV data into a nested object but my "title" field is always null within the Nested class.
I've included the method at the bottom which converts the TSV data to the object.
value1 | metaData1 | valueA |
value2 | metaData2 | valueB |
value3 | metaData3 | valueC |
public class Data {
#Parsed(index = 0)
private String value0;
#Parsed(index = 1)
private String foo;
#Nested
MetaData metaData;
public static class MetaData {
#Parsed(index = 1)
private String title;
}
}
public <T> List<T> convertFileToData(File file, Class<T> clazz, boolean removeHeader) {
BeanListProcessor<T> rowProcessor = new BeanListProcessor<>(clazz);
CsvParserSettings settings = new CsvParserSettings();
settings.getFormat().setDelimiter('|');
settings.setProcessor(rowProcessor);
settings.setHeaderExtractionEnabled(removeHeader);
CsvParser parser = new CsvParser(settings);
parser.parseAll(file);
return rowProcessor.getBeans();
}

You forgot to define an index on your Metadata.title:
public static class MetaData {
#Parsed(index=1)
private String title;
}
Also, you are setting the delimiter to \t while your input is using | as the separator.

Related

Using Univocity, how can I convert a date string value to a Date type in Java

I'll like to parse column zero in a csv file to a particular datatype, in this example a Date Object.
The method below is what I use currently to parse a csv file but I don't know how to incorporate this requirement.
import java.sql.Date;
public class Data {
#Parsed(index = 0)
private Date date;
}
}
public <T> List<T> convertFileToData(File file, Class<T> clazz) {
BeanListProcessor<T> rowProcessor = new BeanListProcessor<>(clazz);
CsvParserSettings settings = new CsvParserSettings();
settings.setProcessor(rowProcessor);
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
parser.parseAll(file);
return rowProcessor.getBeans();
}

All you need is to define the format(s) of your date and you are set:
#Format(formats = {"dd-MMM-yyyy", "yyyy-MM-dd"})
#Parsed(index = 0)
private Date date;
}
As an extra suggestion, you can also replace a lot of your code by using the CsvRoutines class. Try this:
List<T> beanList = new CsvRoutines(settings).parseAll(clazz, file);
Hope it helps.

Text to String map reduce

I am trying to split a string using mapreduce2(yarn) in Hortonworks Sandbox.
It throws a ArrayOutOfBound Exception if I try to access val[1] , Works fine with when I don't split the input file.
Mapper:
public class MapperClass extends Mapper<Object, Text, Text, Text> {
private Text airline_id;
private Text name;
private Text country;
private Text value1;
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String s = value.toString();
if (s.length() > 1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}
}
}
Reducer:
public class ReducerClass extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String airports = "";
if (key.equals("India")) {
for (Text val : values) {
airports += "\t" + val.toString();
}
result.set(airports);
context.write(key, result);
}
}
}
MainClass:
public class MainClass {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Flights MR");
job.setJarByClass(MainClass.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Can you help?
Update:
Figured out that it doesn't convert Text to String.

If the string you are splitting does not contain a comma, the resulting String[] will be of length 1 with the entire string in at val[0].
Currently, you are making sure that the string is not the empty string
if (s.length() > -1)
But you are not checking that the split will actually result in an array of length more than 1 and assuming that there was a split.
context.write(new Text("blah"), new Text(val[1]));
If there was no split this will cause an out of bounds error. A possible solution would be to make sure that the string contains at least 1 comma, instead of checking that it is not the empty string like so:
String s = value.toString();
if (s.indexOf(',') > -1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}

String concatenation in mapper class of a MapReduce Program giving errors

In my mapper class I want to do a small manipulation to a string read from a file(as a line) and then send it over to the reducer to get a string count. The manipulation being replace null strings with 0. (the current replace & join part is failing my hadoop job)
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
.....
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private static Text partString = new Text("");
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
// Read line by line to bufferreader and output the (line,count) pair
BufferedReader bufReader = new BufferedReader(new StringReader(line));
String l=null;
while( (l=bufReader.readLine()) != null )
{
/**** This part is the problem ****/
String a[]=l.split(",");
if(a[1]==""){ // if a[1] i.e. second string is "" then set it to "0"
a[1]="0";
l = StringUtils.join(",", a); // join the string array to form a string
}
/**** problematic part ends ****/
partString.set(l);
output.collect(partString, count);
}
}
}
After this is run, the mapper just fails and doesn't post any errors.
[The code is run with yarn]
I am not sure what I am doing wrong, the same code worked without the string join part.
Could any of you explain what is wrong with the string replace/concat? Is there a better way to do it?

Here's a modified version of your Mapper class with a few changes:
Remove the BufferedReader, it seems redundant and isn't being closed
String equality should be .equals() and not ==
Declare a String array using String[] and not String a[]
Resulting in the following code:
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text partString = new Text();
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String[] a = l.split(",");
if (a[1].equals("")) {
a[1] = "0";
l = StringUtils.join(",", a);
}
partString.set(l);
output.collect(partString, count);
}
}

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}

So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"

How do I export NHibernate schema for SQLite if I have dots in the schema name?

I'm trying to set up SQLite for unit testing with Fluent NHibernate as shown here but the table names isn't being generated as I expected.
Some tables have schemas with dots inside which seems to break the generation. (Dots works perfectly well with Microsoft SQL Server which I have in my production environment.)
Example:
[Foo.Bar.Schema].[TableName]
Result:
TestFixture failed: System.Data.SQLite.SQLiteException : SQLite error
unknown database Foo
How do I instruct SQLite to translate the dots to underscores or something so I can run my unit tests?
(I've tried adding brackets to the schema names with no success)

You can use a convention
http://wiki.fluentnhibernate.org/Conventions
*UPDATED
public static class PrivatePropertyHelper
{
// from http://stackoverflow.com/questions/1565734/is-it-possible-to-set-private-property-via-reflection
public static T GetPrivatePropertyValue<T>(this object obj, string propName)
{
if (obj == null) throw new ArgumentNullException("obj");
PropertyInfo pi = obj.GetType().GetProperty(propName, BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Instance);
if (pi == null) throw new ArgumentOutOfRangeException("propName", string.Format("Property {0} was not found in Type {1}", propName, obj.GetType().FullName));
return (T)pi.GetValue(obj, null);
}
}
public class CustomTableNameConvention : IClassConvention
{
// Use this to set schema to specific value
public void Apply(FluentNHibernate.Conventions.Instances.IClassInstance instance)
{
instance.Schema("My_NEw_Schema");
instance.Table(instance.EntityType.Name.CamelToUnderscoreLower());
}
// Use this to alter the existing schema value.
// note that Schema is a private property and you need reflection to get it
public void Apply(FluentNHibernate.Conventions.Instances.IClassInstance instance)
{
instance.Schema(instance.GetPrivatePropertyValue<string>("Schema").Replace(".", "_"));
instance.Table(instance.EntityType.Name.CamelToUnderscoreLower());
}
}
You must use only one of he Apply methods.
*UPDATE 2
I don't know I would recommend this but if you like to experiment this seems to work. Even more reflection :)
public static void SetSchemaValue(this object obj, string schema)
{
var mapping_ref = obj.GetType().GetField("mapping", BindingFlags.Instance | BindingFlags.IgnoreCase | BindingFlags.GetField | BindingFlags.NonPublic).GetValue(obj);
var mapping = mapping_ref as ClassMapping;
if (mapping != null)
{
mapping.Schema = schema;
}
}
public void Apply(FluentNHibernate.Conventions.Instances.IClassInstance instance)
{
var schema = instance.GetPrivatePropertyValue<string>("Schema");
if (schema == null)
{
instance.Schema("My_New_Schema");
}
else
{
instance.SetSchemaValue("My_New_Schema");
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to parse TSV data into nested objects - univocity

You forgot to define an index on your Metadata.title: public static class MetaData { #Parsed(index=1) private String title; } Also, you are setting the delimiter to \t while your input is using | as the separator.

Related

Using Univocity, how can I convert a date string value to a Date type in Java

Text to String map reduce

String concatenation in mapper class of a MapReduce Program giving errors

Hbase Map/reduce-How to access individual columns of the table?

How do I export NHibernate schema for SQLite if I have dots in the schema name?

Categories

Resources