How to solve the "too many connection" problem in zookeeper when I want to query too many times in reduce stage? - mapreduce

Sorry for my stupid question and thank you in advance.
I need to replace the outputvalue in reduce stage(or map stage). However, it will case too many connection in zookeeper. I don't know how to deal with it.
This is my reduce method:
public static class HbaseToHDFSReducer extends Reducer<Text,Text,Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashSet<String> address = new HashSet<>();
for(Text item :values){
String city = getDataByRowKey("A1",item.toString());
address.add(city);
}
context.write(key,new Text(String.valueOf(address).replace("\"", "")));
}
This is the query method:
public static String getDataByRowKey(String tableName, String rowKey) throws IOException {
Table table = ConnectionFactory.createConnection(conf).getTable(TableName.valueOf(tableName));
Get get = new Get(rowKey.getBytes());
String data = new String();
if (!get.isCheckExistenceOnly()) {
Result result = table.get(get);
for (Cell cell : result.rawCells()) {
String colName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
String value = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength());
if (colName.equals(rowKey)) {
data = value;
}
}
}
table.close();
return data;
}
What should I do to solve it?
Thank you again

You created one connection per query, and connection creation is a heavy-weight operation. Maybe you can get a Connection in your reduce method and change
getDataByRowKey(String tableName, String rowKey) to this
getDataByRowKey(Connection connection, String tableName, String rowKey).

Related

AWS SDK2 java s3 select example - how to get result bytes

I am trying to use aws sdk2 java for s3 select operations but not able to get extract the final data. Looking for an example if someone has implemented it. I got some idea from [this post][1] but not able to figure out how to get and read the full data .
Fetching specific fields from an S3 document
Basically, equivalent of v1 sdk:
``` InputStream resultInputStream = result.getPayload().getRecordsInputStream(
new SelectObjectContentEventVisitor() {
#Override
public void visit(SelectObjectContentEvent.StatsEvent event)
{
System.out.println(
"Received Stats, Bytes Scanned: " + event.getDetails().getBytesScanned()
+ " Bytes Processed: " + event.getDetails().getBytesProcessed());
}
/*
* An End Event informs that the request has finished successfully.
*/
#Override
public void visit(SelectObjectContentEvent.EndEvent event)
{
isResultComplete.set(true);
System.out.println("Received End Event. Result is complete.");
}
}
);```
///IN AWS SDK2, how do get ResultOutputStream ?
```public byte[] getQueryResults() {
logger.info("V2 query");
S3AsyncClient s3Client = null;
s3Client = S3AsyncClient.builder()
.region(Region.US_WEST_2)
.build();
String fileObjKeyName = "upload/" + filePath;
try{
logger.info("Filepath: " + fileObjKeyName);
ListObjectsV2Request listObjects = ListObjectsV2Request
.builder()
.bucket(Constants.bucketName)
.build();
......
InputSerialization inputSerialization = InputSerialization.builder().
json(JSONInput.builder().type(JSONType.LINES).build()).build()
OutputSerialization outputSerialization = null;
outputSerialization = OutputSerialization.builder().
json(JSONOutput.builder()
.build()
).build();
SelectObjectContentRequest selectObjectContentRequest = SelectObjectContentRequest.builder()
.bucket(Constants.bucketName)
.key(partFilename)
.expression(query)
.expressionType(ExpressionType.SQL)
.inputSerialization(inputSerialization)
.outputSerialization(outputSerialization)
.scanRange(ScanRange.builder().start(0L).end(Constants.limitBytes).build())
.build();
final DataHandler handler = new DataHandler();
CompletableFuture future = s3Client.selectObjectContent(selectObjectContentRequest, handler);
//hold it till we get a end event
EndEvent endEvent = (EndEvent) handler.receivedEvents.stream()
.filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.END)
.findFirst()
.orElse(null);```
//Now, from here how do I get the response bytes ?
///////---> ISSUE: How do I get ResultStream bytes ????
return <bytes>
}```
// handler
private static class DataHandler implements SelectObjectContentResponseHandler {
private SelectObjectContentResponse response;
private List receivedEvents = new ArrayList<>();
private Throwable exception;
#Override
public void responseReceived(SelectObjectContentResponse response) {
this.response = response;
}
#Override
public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
publisher.subscribe(receivedEvents::add);
}
#Override
public void exceptionOccurred(Throwable throwable) {
exception = throwable;
}
#Override
public void complete() {
}
} ```
[1]: https://stackoverflow.com/questions/67315601/fetching-specific-fields-from-an-s3-document
i came to your post since I was working on the same issue as to avoid V1.
After hours of searching i ended up with finding the answer at. https://github.com/aws/aws-sdk-java-v2/pull/2943/files
The answer is located at SelectObjectContentIntegrationTest.java File
services/s3/src/it/java/software/amazon/awssdk/services/SelectObjectContentIntegrationTest.java
The way to get the bytes is by using the RecordsEvent class, please note for my use case I used CSV, not sure if this would be different for a different file type.
in the complete method you have access to the receivedEvents. this is where you get the first index to get the filtered returned results and casting it to the RecordsEvent class. then this class provides the payload as bytes
#Override
public void complete() {
RecordsEvent records = (RecordsEvent) this.receivedEvents.get(0)
String result = records.payload().asUtf8String();
}

How do get rid of java.lang.IllegalArgumentException: Unable to match key(s) in corda?

I have created a counterparty session, issuer signs the transaction by passing in its key to the signInitialTransaction. Then when I call the CollectSignaturesFlow to get the buyer's signature, it throws 'Unable to match key(s)' exception.
No idea what went wrong.
This is my initiator flow.
package com.template.flows;
#InitiatingFlow
#StartableByRPC
public class InitiateTicketMovementFlow extends FlowLogic<String> {
private final String buyer;
private final String issuer;
private final StateRef assetReference;
public InitiateTicketMovementFlow(String buyer, String issuer, String hash, int index) {
this.buyer = buyer;
this.issuer = issuer;
this.assetReference = new StateRef(SecureHash.parse(hash), index);
}
#Override
#Suspendable
public String call() throws FlowException {
final Party notary = getServiceHub().getNetworkMapCache().getNotaryIdentities().get(0);
AccountInfo issuerAccountInfo = UtilitiesKt.getAccountService(this)
.accountInfo(issuer).get(0).getState().getData();
AccountInfo receiverAccountInfo = UtilitiesKt.getAccountService(this)
.accountInfo(buyer).get(0).getState().getData();
AnonymousParty buyerAccount = subFlow(new RequestKeyForAccount(receiverAccountInfo));
QueryCriteria.VaultQueryCriteria queryCriteria = new QueryCriteria.VaultQueryCriteria()
.withStateRefs(ImmutableList.of(assetReference));
StateAndRef<CustomTicket> ticketStateStateAndRef = getServiceHub().getVaultService()
.queryBy(CustomTicket.class, queryCriteria).getStates().get(0);
CustomTicket ticketState = ticketStateStateAndRef.getState().getData();
TransactionBuilder txBuilder = new TransactionBuilder(notary);
MoveTokensUtilities.addMoveNonFungibleTokens(txBuilder, getServiceHub(),
ticketState.toPointer(CustomTicket.class), receiverAccountInfo.getHost());
FlowSession buyerSession = initiateFlow(receiverAccountInfo.getHost());
buyerSession.send(ticketState.getValuation());
List<StateAndRef<FungibleToken>> inputs = subFlow(new ReceiveStateAndRefFlow<>(buyerSession));
List<FungibleToken> moneyReceived = buyerSession.receive(List.class).unwrap(value -> value);
MoveTokensUtilities.addMoveTokens(txBuilder, inputs, moneyReceived);
SignedTransaction selfSignedTransaction = getServiceHub().
signInitialTransaction(txBuilder, ImmutableList.of(issuerAccountInfo.getHost().getOwningKey()));
SignedTransaction signedTransaction = subFlow(new CollectSignaturesFlow(
selfSignedTransaction, Arrays.asList(buyerSession), Collections.singleton(issuerAccountInfo.getHost().getOwningKey())));
SignedTransaction stx = subFlow(new FinalityFlow(
signedTransaction, ImmutableList.of(buyerSession)));
subFlow(new UpdateDistributionListFlow(stx));
return "\nTicket is sold to "+ buyer;
}
}
It SEEMS like the issue here is that you're getting the buyer account the wrong way? Or that the finality flow call might be off. Take a look at our samples on this.
Maybe try something like this to get your account info
AccountInfo targetAccount = accountService.accountInfo(<STRING NAME OF ACCOUNT >).get(0);
src is our corda samples repo: https://github.com/corda/samples-java/blob/master/Accounts/supplychain/workflows/src/main/java/net/corda/samples/supplychain/flows/SendShippingRequest.java#L80
Also, note how different the finality calls look like:
https://github.com/corda/samples-java/blob/master/Accounts/supplychain/workflows/src/main/java/net/corda/samples/supplychain/flows/SendShippingRequest.java#L112

Mapreduce MultipleOutputs error

I want to store output of a mapreduce job in two different directories.
Eventhough my code is designed to store the same output in different directories.
My Driver class code below
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job myhadoopJob = new Job(conf);
myhadoopJob.setJarByClass(WordCountMain.class);
myhadoopJob.setJobName("WORD COUNT JOB");
FileInputFormat.addInputPath(myhadoopJob, new Path(args[0]));
myhadoopJob.setMapperClass(WordCountMapper.class);
myhadoopJob.setReducerClass(WordCountReducer.class);
myhadoopJob.setInputFormatClass(TextInputFormat.class);
myhadoopJob.setOutputFormatClass(TextOutputFormat.class);
myhadoopJob.setMapOutputKeyClass(Text.class);
myhadoopJob.setMapOutputValueClass(IntWritable.class);
myhadoopJob.setOutputKeyClass(Text.class);
myhadoopJob.setOutputValueClass(IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output1", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output2", TextOutputFormat.class, Text.class, IntWritable.class);
FileOutputFormat.setOutputPath(myhadoopJob, new Path(args[1]));
System.exit(myhadoopJob.waitForCompletion(true) ? 0 : 1);
}
}
My Mapper Code
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String line = value.toString();
String word =null;
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
word= st.nextToken();
context.write(new Text(word), new IntWritable(1));
}
}
}
My Reducer Code is below
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
MultipleOutputs mout =null;
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int count=0;
int num =0;
Iterator<IntWritable> ie =values.iterator();
while(ie.hasNext())
{
num = ie.next().get();//1
count= count+num;
}
mout.write("output1", key, new IntWritable(count));
mout.write("output2", key, new IntWritable(count));
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
I am simply giving the output directories in reduce method itself
But when I run this mapreduce job using the below command, it does nothing. Even Mapreduce is not at all started. just a blank and stays idle.
hadoop jar WordCountMain.jar /user/cloudera/inputfiles/words.txt /user/cloudera/outputfiles/mapreduce/multipleoutputs
Could someone explain me what went wrong and how do I correct this with my code
Actually what happens is two output files with different name are stored inside /user/cloudera/outputfiles/mapreduce/multipleoutputs.
but what I need is storing output files in different directories.
In pig we can use by two STORE statement by giving different directories
How do I achieve the same in mapreduce
Can you try closing multiple output object in cleanup method for Reducer.

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}
So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"