Cassandra Map Reduce for TimeUUID columns - mapreduce

I recently Setup 4 node Cassandra cluster for learning with one column family which hold time series data as.
Key -> {column name: timeUUID, column value: csv log line, ttl: 1year}, I use Netflix Astyanax java client to load about 1 million log lines.
I also configured Hadoop to run map-reduce jobs with 1 namenode and 4 datanode's to run some analytics on Cassandra data.
All the available examples on internet uses column name as SlicePredicate for Hadoop Job Configuration, where as I have timeUUID as columns how can I efficiently feed Cassandra data to Hadoop Job configurator with batches of 1000 columns at one time.
There are more than 10000 column's for some rows in this test data and expected to be more in real data.
I configure my job as
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), JOB_NAME);
Job.setJarByClass(LogTypeCounterByDate.class);
job.setMapperClass(LogTypeCounterByDateMapper.class);
job.setReducerClass(LogTypeCounterByDateReducer.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
ConfigHelper.setRangeBatchSize(getConf(), 1000);
SliceRange sliceRange = new SliceRange(ByteBuffer.wrap(new byte[0]),
ByteBuffer.wrap(new byte[0]), true, 1000);
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(sliceRange);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), INPUT_RPC_PORT);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), INPUT_INITIAL_ADRESS);
ConfigHelper.setInputPartitioner(job.getConfiguration(), INPUT_PARTITIONER);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
But I can't able to understand how I define Mapper, kindly can you provide template for Mapper class.
public static class LogTypeCounterByDateMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
private Text key = null;
private LongWritable value = null;
#Override
protected void setup(Context context){
}
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context){
//String[] lines = columns.;
}
}

ConfigHelper.setRangeBatchSize(getConf(), 1000)
...
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(TimeUUID.asByteBuffer(startValue), TimeUUID.asByteBuffer(endValue), false, 1000))
ConfigHelper.setInputSlicePredicate(conf, predicate)

Related

File to DB load using Apache beam

I need to load a file into my database, but before that I have to verify data is present in the database based on some file data. For instance, suppose I have 5 records in a file then I have to check 5 times in the database for separate records.
So how can I get this value dynamically? We have to pass dynamic value instead of 2 in line (preparedStatement.setString(1, "2");)
Here we are creating a Dataflow pipeline which loads data into the database using Apache Beam. Now we create a pipeline object and create a pipeline. Using a PCollection we are storing into database.
Pipeline p = Pipeline.create(options);
p.apply("Reading Text", TextIO.read().from(options.getInputFile()))
.apply(ParDo.of(new FilterHeaderFn(csvHeader)))
.apply(ParDo.of(new GetRatePlanID()))
.apply("Format Result", MapElements.into(
TypeDescriptors.strings()).via(
(KV < String, Integer > ABC) - >
ABC.getKey() + "," + ABC.getValue()))
.apply("Write File", TextIO.write()
.to(options.getOutputFile())
.withoutSharding());
// Retrieving data from database
PCollection < String > data =
p.apply(JdbcIO. < String > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.cj.jdbc.Driver", "jdbc:mysql://localhost:3306/XYZ")
.withUsername("root")
.withPassword("root1234"))
.withQuery("select * from xyz where z = ?")
.withCoder(StringUtf8Coder.of())
.withStatementPreparator(new JdbcIO.StatementPreparator() {
private static final long serialVersionUID = 1 L;
#Override
public void setParameters(PreparedStatement preparedStatement) throws Exception {
preparedStatement.setString(1, "2");
}
})
.withRowMapper(new JdbcIO.RowMapper < String > () {
private static final long serialVersionUID = 1 L;
public String mapRow(ResultSet resultSet) throws Exception {
return "Symbol: " + resultSet.getInt(1) + "\nPrice: " + resultSet.getString(2) +
"\nCompany: " + resultSet.getInt(3);
}
}));
As suggested, the most efficient would probably be loading the whole file into a temporary table and then doing a query to update the requisite rows.
If that can't be done, you could instead read the table into Dataflow (i.e. "select * from xyz") and then do a join/CoGroupByKey to match records with those found in your file. If you expect the existing database to be very large compared to the files you're hoping to upload into it, you could have a DoFn that makes queries to your database directly using JDBC (possibly caching the connection in the DoFn's setUp method) rather than using JdbcIO.

Write more than 25 items using BatchWriteItemEnhancedRequest Dynamodb JAVA SDK 2

I have an List items to be inserted into the DynamoDb collection. The size of the list may vary from 100 to 10k. I looking for an optimised way to Batch Write all the items using the BatchWriteItemEnhancedRequest (JAVA SDK2). What is the best way to add the items into the WriteBatch builder and then write the request using BatchWriteItemEnhancedRequest?
My Current Code:
WriteBatch.Builder<T> builder = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build();
items.forEach(item -> { builder.addPutItem(item); });
BatchWriteItemEnhancedRequest bwr = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build()
BatchWriteResult batchWriteResult =
DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
do {
// Check for unprocessed keys which could happen if you exceed
// provisioned throughput
List<T> unprocessedItems = batchWriteResult.unprocessedPutItemsForTable(getTable());
if (unprocessedItems.size() != 0) {
unprocessedItems.forEach(unprocessedItem -> {
builder.addPutItem(unprocessedItem);
});
batchWriteResult = DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
}
} while (batchWriteResult.unprocessedPutItemsForTable(getTable()).size() > 0);
Looking for a batching logic and a more better way to execute the BatchWriteItemEnhancedRequest.
I came up with a utility class to deal with that. Their batches of batches approach in v2 is overly complex for most use cases, especially when we're still limited to 25 items overall.
public class DynamoDbUtil {
private static final int MAX_DYNAMODB_BATCH_SIZE = 25; // AWS blows chunks if you try to include more than 25 items in a batch or sub-batch
/**
* Writes the list of items to the specified DynamoDB table.
*/
public static <T> void batchWrite(Class<T> itemType, List<T> items, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
Stream<List<T>> chunksOfItems = Lists.partition(items, MAX_DYNAMODB_BATCH_SIZE);
chunksOfItems.forEach(chunkOfItems -> {
List<T> unprocessedItems = batchWriteImpl(itemType, chunkOfItems, client, table);
while (!unprocessedItems.isEmpty()) {
// some failed (provisioning problems, etc.), so write those again
unprocessedItems = batchWriteImpl(itemType, unprocessedItems, client, table);
}
});
}
/**
* Writes a single batch of (at most) 25 items to DynamoDB.
* Note that the overall limit of items in a batch is 25, so you can't have nested batches
* of 25 each that would exceed that overall limit.
*
* #return those items that couldn't be written due to provisioning issues, etc., but were otherwise valid
*/
private static <T> List<T> batchWriteImpl(Class<T> itemType, List<T> chunkOfItems, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
WriteBatch.Builder<T> subBatchBuilder = WriteBatch.builder(itemType).mappedTableResource(table);
chunkOfItems.forEach(subBatchBuilder::addPutItem);
BatchWriteItemEnhancedRequest.Builder overallBatchBuilder = BatchWriteItemEnhancedRequest.builder();
overallBatchBuilder.addWriteBatch(subBatchBuilder.build());
return client.batchWriteItem(overallBatchBuilder.build()).unprocessedPutItemsForTable(table);
}
}

how to pass dynamic parameters in google cloud dataflow pipeline

I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}
Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.

How to do multiple parallel readers for data export using Google Spanner?

External Backups/Snapshots for Google Cloud Spanner recommends to use queries with timestamp bounds to create snapshots for export. On the bottom of the Timestamp Bounds documentation it states:
Cloud Spanner continuously garbage collects deleted and overwritten data in the background to reclaim storage space. This process is known as version GC. By default, version GC reclaims versions after they are one hour old. Because of this, Cloud Spanner cannot perform reads at a read timestamp more than one hour in the past.
So any export would need to complete within an hour. A single reader (i.e. select * from table; using timestamp X) would not be able to export the entire table within an hour.
How can multiple parallel readers be implemented in spanner?
Note: It is mentioned in one of the comments that support for Apache Beam is coming, but it looks like that uses a single reader:
/** A simplest read function implementation. Parallelism support is coming. */
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/NaiveSpannerReadFn.java#L26
Is there a way to do the parallel reader that beam requires today using exising APIs? Or will Beam need to use something that isn't released yet on google spanner?
It is possible to read data in parallel from Cloud Spanner with the BatchClient class. Follow read_data_in_parallel for more information.
If you are looking to export data from Cloud Spanner, I'd recommend you to use Cloud Dataflow (see the integration details here) as it provides higher level abstractions and takes care data processing details, like scaling and failure handling.
Edit 2018-03-30 - The example project has been updated to use the BatchClient offered by Google Cloud Spanner
After the release of the BatchClient for reading/downloading large amounts of data, the example project below has been updated to use the new batch client instead of the standard database client. The basic idea behind the project is still the same: Copy data to/from Cloud Spanner and any other database using standard jdbc functionality. The following code snippet sets the jdbc connection in batch read mode:
if (source.isWrapperFor(ICloudSpannerConnection.class))
{
ICloudSpannerConnection con = source.unwrap(ICloudSpannerConnection.class);
// Make sure no transaction is running
if (!con.isBatchReadOnly())
{
if (con.getAutoCommit())
{
con.setAutoCommit(false);
}
else
{
con.commit();
}
con.setBatchReadOnly(true);
}
}
When the connection is in 'batch read only mode', the connection will use the BatchClient of Google Cloud Spanner instead of the standard database client. When one of the Statement#execute(String) or PreparedStatement#execute() methods are called (as these allow multiple result sets to be returned) the jdbc driver will create a partitioned query instead of a normal query. The results of this partitioned query will be a number of result sets (one per partition) that can be fetched by the Statement#getResultSet() and Statement#getMoreResults(int) methods.
Statement statement = source.createStatement();
boolean hasResults = statement.execute(select);
int workerNumber = 0;
while (hasResults)
{
ResultSet rs = statement.getResultSet();
PartitionWorker worker = new PartitionWorker("PartionWorker-" + workerNumber, config, rs, tableSpec, table, insertCols);
workers.add(worker);
hasResults = statement.getMoreResults(Statement.KEEP_CURRENT_RESULT);
workerNumber++;
}
The result sets that are returned by the Statement#execute(String) are not executed directly, but only after the first call to ResultSet#next(). Passing these result sets to separate worker threads ensures parallel download and copying of the data.
Original answer:
This project was initially created for conversion in the other direction (from a local database to Cloud Spanner), but as it uses JDBC for both source and destination it can also be used the other way around: Converting a Cloud Spanner database to a local PostgreSQL database. Large tables are converted in parallel using a thread pool.
The project uses this open source JDBC driver instead of the JDBC driver supplied by Google. The source Cloud Spanner JDBC connection is set to read-only mode and autocommit=false. This ensures that the connection automatically creates a read-only transaction using the current time as timestamp the first time you execute a query. All subsequent queries within the same (read-only) transaction will use the same timestamp giving you a consistent snapshot of your Google Cloud Spanner database.
It works as follows:
Set the source database to read-only transactional mode.
The convert(String catalog, String schema) method iterates over all
tables in the source database (Cloud Spanner)
For each table the number of records is determined, and depending on the size of the table, the table is copied using either the main thread of the application or by a worker pool.
The class UploadWorker is responsible for the parallel copying. Each worker is assigned a range of records from the table (for example rows 1 to 2,400). The range is selected by a select statement in this format: 'SELECT * FROM $TABLE ORDER BY $PK_COLUMNS LIMIT $BATCH_SIZE OFFSET $CURRENT_OFFSET'
Commit the read-only transaction on the source database after ALL tables have been converted.
Below is a code snippet of the most important parts.
public void convert(String catalog, String schema) throws SQLException
{
int batchSize = config.getBatchSize();
destination.setAutoCommit(false);
// Set the source connection to transaction mode (no autocommit) and read-only
source.setAutoCommit(false);
source.setReadOnly(true);
try (ResultSet tables = destination.getMetaData().getTables(catalog, schema, null, new String[] { "TABLE" }))
{
while (tables.next())
{
String tableSchema = tables.getString("TABLE_SCHEM");
if (!config.getDestinationDatabaseType().isSystemSchema(tableSchema))
{
String table = tables.getString("TABLE_NAME");
// Check whether the destination table is empty.
int destinationRecordCount = getDestinationRecordCount(table);
if (destinationRecordCount == 0 || config.getDataConvertMode() == ConvertMode.DropAndRecreate)
{
if (destinationRecordCount > 0)
{
deleteAll(table);
}
int sourceRecordCount = getSourceRecordCount(getTableSpec(catalog, tableSchema, table));
if (sourceRecordCount > batchSize)
{
convertTableWithWorkers(catalog, tableSchema, table);
}
else
{
convertTable(catalog, tableSchema, table);
}
}
else
{
if (config.getDataConvertMode() == ConvertMode.ThrowExceptionIfExists)
throw new IllegalStateException("Table " + table + " is not empty");
else if (config.getDataConvertMode() == ConvertMode.SkipExisting)
log.info("Skipping data copy for table " + table);
}
}
}
}
source.commit();
}
private void convertTableWithWorkers(String catalog, String schema, String table) throws SQLException
{
String tableSpec = getTableSpec(catalog, schema, table);
Columns insertCols = getColumns(catalog, schema, table, false);
Columns selectCols = getColumns(catalog, schema, table, true);
if (insertCols.primaryKeyCols.isEmpty())
{
log.warning("Table " + tableSpec + " does not have a primary key. No data will be copied.");
return;
}
log.info("About to copy data from table " + tableSpec);
int batchSize = config.getBatchSize();
int totalRecordCount = getSourceRecordCount(tableSpec);
int numberOfWorkers = calculateNumberOfWorkers(totalRecordCount);
int numberOfRecordsPerWorker = totalRecordCount / numberOfWorkers;
if (totalRecordCount % numberOfWorkers > 0)
numberOfRecordsPerWorker++;
int currentOffset = 0;
ExecutorService service = Executors.newFixedThreadPool(numberOfWorkers);
for (int workerNumber = 0; workerNumber < numberOfWorkers; workerNumber++)
{
int workerRecordCount = Math.min(numberOfRecordsPerWorker, totalRecordCount - currentOffset);
UploadWorker worker = new UploadWorker("UploadWorker-" + workerNumber, selectFormat, tableSpec, table,
insertCols, selectCols, currentOffset, workerRecordCount, batchSize, source,
config.getUrlDestination(), config.isUseJdbcBatching());
service.submit(worker);
currentOffset = currentOffset + numberOfRecordsPerWorker;
}
service.shutdown();
try
{
service.awaitTermination(config.getUploadWorkerMaxWaitInMinutes(), TimeUnit.MINUTES);
}
catch (InterruptedException e)
{
log.severe("Error while waiting for workers to finish: " + e.getMessage());
throw new RuntimeException(e);
}
}
public class UploadWorker implements Runnable
{
private static final Logger log = Logger.getLogger(UploadWorker.class.getName());
private final String name;
private String selectFormat;
private String sourceTable;
private String destinationTable;
private Columns insertCols;
private Columns selectCols;
private int beginOffset;
private int numberOfRecordsToCopy;
private int batchSize;
private Connection source;
private String urlDestination;
private boolean useJdbcBatching;
UploadWorker(String name, String selectFormat, String sourceTable, String destinationTable, Columns insertCols,
Columns selectCols, int beginOffset, int numberOfRecordsToCopy, int batchSize, Connection source,
String urlDestination, boolean useJdbcBatching)
{
this.name = name;
this.selectFormat = selectFormat;
this.sourceTable = sourceTable;
this.destinationTable = destinationTable;
this.insertCols = insertCols;
this.selectCols = selectCols;
this.beginOffset = beginOffset;
this.numberOfRecordsToCopy = numberOfRecordsToCopy;
this.batchSize = batchSize;
this.source = source;
this.urlDestination = urlDestination;
this.useJdbcBatching = useJdbcBatching;
}
#Override
public void run()
{
// Connection source = DriverManager.getConnection(urlSource);
try (Connection destination = DriverManager.getConnection(urlDestination))
{
log.info(name + ": " + sourceTable + ": Starting copying " + numberOfRecordsToCopy + " records");
destination.setAutoCommit(false);
String sql = "INSERT INTO " + destinationTable + " (" + insertCols.getColumnNames() + ") VALUES \n";
sql = sql + "(" + insertCols.getColumnParameters() + ")";
PreparedStatement statement = destination.prepareStatement(sql);
int lastRecord = beginOffset + numberOfRecordsToCopy;
int recordCount = 0;
int currentOffset = beginOffset;
while (true)
{
int limit = Math.min(batchSize, lastRecord - currentOffset);
String select = selectFormat.replace("$COLUMNS", selectCols.getColumnNames());
select = select.replace("$TABLE", sourceTable);
select = select.replace("$PRIMARY_KEY", selectCols.getPrimaryKeyColumns());
select = select.replace("$BATCH_SIZE", String.valueOf(limit));
select = select.replace("$OFFSET", String.valueOf(currentOffset));
try (ResultSet rs = source.createStatement().executeQuery(select))
{
while (rs.next())
{
int index = 1;
for (Integer type : insertCols.columnTypes)
{
Object object = rs.getObject(index);
statement.setObject(index, object, type);
index++;
}
if (useJdbcBatching)
statement.addBatch();
else
statement.executeUpdate();
recordCount++;
}
if (useJdbcBatching)
statement.executeBatch();
}
destination.commit();
log.info(name + ": " + sourceTable + ": Records copied so far: " + recordCount + " of "
+ numberOfRecordsToCopy);
currentOffset = currentOffset + batchSize;
if (recordCount >= numberOfRecordsToCopy)
break;
}
}
catch (SQLException e)
{
log.severe("Error during data copy: " + e.getMessage());
throw new RuntimeException(e);
}
log.info(name + ": Finished copying");
}
}

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...