HBase Get/Scan in a Scalding job

HBase Get/Scan in a Scalding job - mapreduce

I'm using Scalding with Spyglass to read from/write to HBase.
I'm doing a left outer join of table1 and table2 and write back to table1 after transforming a column.
Both table1 and table2 are declared as Spyglass HBaseSource.
This works fine. But, i need to access a different row in table1 using rowkey to compute transformed value.
I tried the following for HBase get:
val hTable = new HTable(conf, TABLE_NAME)
val result = hTable.get(new Get(rowKey.getBytes()))
I'm getting access to Configuration in Scalding job as mentioned in this link:
https://github.com/twitter/scalding/wiki/Frequently-asked-questions#how-do-i-access-the-jobconf
This works when i run the scalding job locally.
But, when i run it in cluster, conf is null when this code is executed in Reducer.
Is there a better way to do HBase get/scan in a Scalding/Cascading job for cases like this?

Ways to do this...
1) You can use a managed resource
class SomeJob(args: Args) extends Job(args) {
val someConfig = HBaseConfiguration.create().addResource(new Path(pathtoyourxmlfile))
lazy val hPool = new HTablePool(someConfig, 3)
def getConf = {
implicitly[Mode] match {
case Hdfs(_, conf) => conf
case _ => whateveryou are doing for a local conf...
}
}
... somePipe.someOperation.... {
val gets = key.map { key => new Get(key) }
managed(hPool.getTable("myTableName")) acquireAndGet { table =>
val results = table.get(gets)
...do something with these results
}
}
}
2) You can use some more specific cascading code, where you write a custom scheme and inside that you will override the source method and possibly some others depending on your needs. In there you can access the JobConf like this:
class MyScheme extends Scheme[JobConf, SomeRecordReader, SomeOutputCollector, ..] {
#transient var jobConf: Configuration = super.jobConfiguration
override def source(flowProcess: FlowProcess[JobConf], ...): Boolean = {
jobConf = flowProcess match {
case h: HadoopFlowProcess => h.getJobConf
case _ => jconf
}
... dosomething with the jobConf here
}
}

Related

Is it possible to build an OLTP/CRUD HTTP server using AkkaHttp, AkkaStreams, Alpakka and a database?

It is clear to me that using Actors of course it is possible: for instance https://github.com/chbatey/akka-http-typed.git is using AkkaHttp and typed actors.
But it is unclear to me if just using AkkaStreams and its Alpakka connectors library (which includes databases), if is it possible to do regular CRUD / OLTP services, or just data replication from one database to another, or other OLAP / batch / stream processing scenarios.
If you know how it can be done please indicate a few details and if you can provide an example on github for instance that would be great.
The way I am thinking it may be possible is that the server is involved in two conversations / stateful stream transformation: one with the outside world over HTTP, and one with the database. I am not sure if this is possible to be modelled like that.
https://doc.akka.io/docs/alpakka/current/slick.html seems to offer both UPDATE/INSERTS as a Sink as well as pointed SELECT to a certain id as a Source. Do you know if an example app is there or can you broadly mention how the wiring would happen with Akka Http?

I put a demo here, hope it can help you.
Creating table, database is mysql.
CREATE TABLE test(id VARCHAR(32))
sbt:
"com.lightbend.akka" %% "akka-stream-alpakka-slick" % "1.1.0",
"mysql" % "mysql-connector-java" % "5.1.40"
Code:
package tech.parasol.scala.crud
import java.sql.SQLException
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.server.Directives.{complete, get, path, _}
import akka.stream.alpakka.slick.scaladsl.{Slick, SlickSession}
import akka.stream.scaladsl.Sink
import akka.stream.{ActorAttributes, ActorMaterializer, Supervision}
import com.typesafe.config.ConfigFactory
import scala.concurrent.Future
import scala.io.StdIn
import scala.util.{Failure, Success}
object CrudTest1 {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("CrudTest1")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val hostName = "120.0.0.1"
val rocketDbConfig =
s"""
|db-config {
| profile = "slick.jdbc.MySQLProfile$$"
| db {
| dataSourceClass = "slick.jdbc.DriverDataSource"
| properties = {
| driver = "com.mysql.jdbc.Driver"
| url = "jdbc:mysql://${hostName}:3306/rocket?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true&useSSL=false"
| user = "root"
| password = "passw0rd"
| }
| }
|}
|
""".stripMargin
implicit val session = SlickSession.forConfig("db-config", ConfigFactory.parseString(rocketDbConfig))
import session.profile.api._
def persistence(message: String) = {
def insert(message: String): DBIO[Int] = {
sqlu"""INSERT INTO test(id) VALUES (${message})"""
}
session.db.run(insert(message)).map {
case _ => message
}.recover {
case e : SQLException => {
throw new Exception("Database error ===>")}
case e : Exception => {
throw new Exception("Database error.")}
}
}
val route = path("hello" / Segment ) { name =>
get {
val res = persistence(name)
onComplete(res) {
case Success(value) => {
complete(s"<h1>Say hello to ${name}</h1>")
}
case Failure(e) => {
complete(s"<h1>Failed to say hello to ${name}</h1>")
}
}
}
}
val bindingFuture = Http().bindAndHandle(route, "localhost", 8088)
println(s"Server online at http://localhost:8088/\nPress RETURN to stop...")
StdIn.readLine() // let it run until user presses return
bindingFuture
.flatMap(_.unbind()) // trigger unbinding from the port
.onComplete(_ => system.terminate()) // and shutdown when done
}
}

Yes, basically at every request receive in AkkaHttp, we create an AkkaStreams Graph (just a pipeline typically), basically just the Slick Alpakka Source from the database, maybe prefixed by some operators, and then returned in AkkaHttp, which of course supports Source. More details at [https://www.quora.com/Is-it-possible-to-build-an-OLTP-CRUD-HTTP-server-using-Akka-HTTP-Akka-Streams-Alpakka-and-a-database-Do-you-know-any-examples-of-code-on-GitHub-or-elsewhere/answer/Nicolae-Marasoiu]

how to mock/match lambda in kotlin method signature

I have some code on the follwing form:
#Language("SQL")
val someSql = """
SELECT foo
FROM bar
WHERE foo = :foo
"""
return session.select(some, mapOf("foo" to foo)) {
MyObject(
foo = it.string("foo"),
)
}.firstOrNull()
which use the below from com.github.andrewoma.kwery.core. Note the lambda in the method signature:
fun <R> select(#Language("SQL") sql: String,
parameters: Map<String, Any?> = mapOf(),
options: StatementOptions = defaultOptions,
mapper: (Row) -> R): List<R>
I use mockitokotlin2.
I need to return an instance of MyObject when the session select method is called with a select query (containing "SELECT foo").
I was thinking I could pass a mock into the lambda as below (but then it wont match the method call I am trying to mock). The below code is an attempt. But it never matches in eq(function2):
val function2: (Row) -> Int = mock {
onGeneric { invoke(any()) }.thenReturn(MyObject(foo="test-foo"))
}
val session = mock<Session> {
on { select(sql = any(), parameters = any(), options = any(), mapper = eq(function2))}.thenReturn(listOf(MyObject(foo="test-foo")))
}
function2 in my case is not really a mapper, it is not eq to what I am trying to mock, it never matches and the mock is never called.
So what do I put in the mock of session, select instead of eq(function2) in the code above to get MyObject object returned?

I think you just need to specify they type that your mapper is expected to return when setting up the session mock - in your case looks to be Function1<Row, MyObject>
val session = mock<Session> {
on { select(sql = anyString(), parameters = anyMap(), options = any(), mapper = any<Function1<Row, MyObject>>())}.thenReturn(listOf(MyObject(foo="test-foo")))
}

How do I query multiple IDs via the ContentSearchManager?

When I have an array of Sitecore IDs, for example TargetIDs from a MultilistField, how can I query the ContentSearchManager to return all the SearchResultItem objects?
I have tried the following which gives an "Only constant arguments is supported." error.
using (var s = Sitecore.ContentSearch.ContentSearchManager.GetIndex("sitecore_master_index").CreateSearchContext())
{
rpt.DataSource = s.GetQueryable<SearchResultItem>().Where(x => f.TargetIDs.Contains(x.ItemId));
rpt.DataBind();
}
I suppose I could build up the Linq query manually with multiple OR queries. Is there a way I can use Sitecore.ContentSearch.Utilities.LinqHelper to build the query for me?
Assuming I got this technique to work, is it worth using it for only, say, 10 items? I'm just starting my first Sitecore 7 project and I have it in mind that I want to use the index as much as possible.
Finally, does the Page Editor support editing fields somehow with a SearchResultItem as the source?
Update 1
I wrote this function which utilises the predicate builder as dunston suggests. I don't know yet if this is actually worth using (instead of Items).
public static List<T> GetSearchResultItemsByIDs<T>(ID[] ids, bool mustHaveUrl = true)
where T : Sitecore.ContentSearch.SearchTypes.SearchResultItem, new()
{
Assert.IsNotNull(ids, "ids");
if (!ids.Any())
{
return new List<T>();
}
using (var s = Sitecore.ContentSearch.ContentSearchManager.GetIndex("sitecore_master_index").CreateSearchContext())
{
var predicate = PredicateBuilder.True<T>();
predicate = ids.Aggregate(predicate, (current, id) => current.Or(p => p.ItemId == id));
var results = s.GetQueryable<T>().Where(predicate).ToDictionary(x => x.ItemId);
var query = from id in ids
let item = results.ContainsKey(id) ? results[id] : null
where item != null && (!mustHaveUrl || item.Url != null)
select item;
return query.ToList();
}
}
It forces the results to be in the same order as supplied in the IDs array, which in my case is important. (If anybody knows a better way of doing this, would love to know).
It also, by default, ensures that the Item has a URL.
My main code then becomes:
var f = (Sitecore.Data.Fields.MultilistField) rootItem.Fields["Main navigation links"];
rpt.DataSource = ContentSearchHelper.GetSearchResultItemsByIDs<SearchResultItem>(f.TargetIDs);
rpt.DataBind();
I'm still curious how the Page Editor copes with SearchResultItem or POCOs in general (my second question), am going to continue researching that now.
Thanks for reading,
Steve

You need to use the predicate builder to create multiple OR queries, or AND queries.
The code below should work.
using (var s = Sitecore.ContentSearch.ContentSearchManager.GetIndex("sitecore_master_index").CreateSearchContext())
{
var predicate = PredicateBuilder.True<SearchResultItem>();
foreach (var targetId in f.Targetids)
{
var tempTargetId = targetId;
predicate = predicate.Or(x => x.ItemId == tempTargetId)
}
rpt.DataSource = s.GetQueryable<SearchResultItem>().Where(predicate);
rpt.DataBind();
}

Deletion from amazon dynamodb

Is there any efficient way to delete all the items from a amazon dynamodb tabe at once.I have gone through the aws docs but there it's shown deletion of a single item.

Do the following steps:
Make delete table request
In the response you will get the TableDescription
Using TableDescription create the table again.
For step 1 and 2 click here
for step 3 click here
That's what I do in my application.

DynamoDBMapper will do the job in few lines :
AWSCredentials credentials = new PropertiesCredentials(credentialFile);
client = new AmazonDynamoDBClient(credentials);
DynamoDBMapper mapper = new DynamoDBMapper(this.client);
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
PaginatedScanList<LogData> result = mapper.scan(LogData.class, scanExpression);
for (LogData data : result) {
mapper.delete(data);
}

As ihtsham says, the most efficient way is to delete and re-create the table. However, if that is not practical (e.g. due to complex configuration of the table, such as Lambda triggers), here are some AWS CLI commands to delete all records. They require the jq program for JSON processing.
Deleting records one-by-one (slow!), assuming your table is called my_table, your partition key is called partition_key, and your sort key (if any) is called sort_key:
aws dynamodb scan --table-name my_table | \
jq -c '.Items[] | { partition_key, sort_key }' | \
tr '\n' '\0' | \
xargs -0 -n1 -t aws dynamodb delete-item --table-name my_table --key
Deleting records in batches of up to 25 records:
aws dynamodb scan --table-name my_table | \
jq -c '[.Items | keys[] as $i | { index: $i, value: .[$i]}] | group_by(.index / 25 | floor)[] | { "my_table": [.[].value | { "DeleteRequest": { "Key": { partition_key, sort_key }}}] }' | \
tr '\n' '\0' | \
xargs -0 -n1 -t aws dynamodb batch-write-item --request-items
If you start seeing non-empty UnprocessedItems responses, your write capacity has been exceeded. You can account for this by reducing the batch size. For me, each batch takes about a second to submit, so with a write capacity of 5 per second, I set the batch size to 5.

Just for the record, a quick solution with item-by-item delete in Python 3 (using Boto3 and scan()):
(Credentials need to be set.)
def delete_all_items(table_name):
# Deletes all items from a DynamoDB table.
# You need to confirm your intention by pressing Enter.
import boto3
client = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(table_name)
response = client.describe_table(TableName=table_name)
keys = [k['AttributeName'] for k in response['Table']['KeySchema']]
response = table.scan()
items = response['Items']
number_of_items = len(items)
if number_of_items == 0: # no items to delete
print("Table '{}' is empty.".format(table_name))
return
print("You are about to delete all ({}) items from table '{}'."
.format(number_of_items, table_name))
input("Press Enter to continue...")
with table.batch_writer() as batch:
for item in items:
key_dict = {k: item[k] for k in keys}
print("Deleting " + str(item) + "...")
batch.delete_item(Key=key_dict)
delete_all_items("test_table")
Obviously, this shouldn't be used for tables with a lot of items. (100+) For that, the delete / recreate approach is cheaper and more efficient.

You will want to use BatchWriteItem if you can't drop the table. If all your entries are within a single HashKey, you can use the Query API to retrieve the records, and then delete them 25 items at a time. If not, you'll probably have to Scan.
Alternatively, you could provide a simple wrapper around AmazonDynamoDBClient (from the official SDK) that collects a Set of Hash/Range keys that exist in your table. Then you wouldn't need to Query or Scan for the items you inserted after the test, since you would already have the Set built. That would look something like this:
public class KeyCollectingAmazonDynamoDB implements AmazonDynamoDB
{
private final AmazonDynamoDB delegate;
// HashRangePair is something you have to define
private final Set<Key> contents;
public InsertGatheringAmazonDynamoDB( AmazonDynamoDB delegate )
{
this.delegate = delegate;
this.contents = new HashSet<>();
}
#Override
public PutItemResult putItem( PutItemRequest putItemRequest )
throws AmazonServiceException, AmazonClientException
{
contents.add( extractKey( putItemRequest.getItem() ) );
return delegate.putItem( putItemRequest );
}
private Key extractKey( Map<String, AttributeValue> item )
{
// TODO Define your hash/range key extraction here
// Create a Key object
return new Key( hashKey, rangeKey );
}
#Override
public DeleteItemResult deleteItem( DeleteItemRequest deleteItemRequest )
throws AmazonServiceException, AmazonClientException
{
contents.remove( deleteItemRequest.getKey() );
return delegate.deleteItem( deleteItemRequest );
}
#Override
public BatchWriteItemResult batchWriteItem( BatchWriteItemRequest batchWriteItemRequest )
throws AmazonServiceException, AmazonClientException
{
// Similar extraction, but in bulk.
for ( Map.Entry<String, List<WriteRequest>> entry : batchWriteItemRequest.getRequestItems().entrySet() )
{
String tableName = entry.getKey();
List<WriteRequest> writeRequests = entry.getValue();
for ( WriteRequest writeRequest : writeRequests )
{
PutRequest putRequest = writeRequest.getPutRequest();
if ( putRequest != null )
{
// Add to Set just like putItem
}
DeleteRequest deleteRequest = writeRequest.getDeleteRequest();
if ( deleteRequest != null )
{
// Remove from Set just like deleteItem
}
}
}
// Write through to DynamoDB
return delegate.batchWriteItem( batchWriteItemRequest );
}
// remaining methods elided, since they're direct delegation
}
Key is a class within the DynamoDB SDK that accepts zero, one, or two AttributeValue objects in the constructor to represent a hash key or a hash/range key. Assuming it's equals and hashCode methods work, you can use in within the Set I described. If they don't, you'll have to write your own Key class.
This should get you a maintained Set for use within your tests. It's not specific to a table, so you might need to add another layer of collection if you're using multiple tables. That would change Set<Key> to something like Map<TableName, Set<Key>>. You would need to look at the getTableName() property to pick the correct Set to update.
Once your test finishes, grabbing the contents of the table and deleting should be straightforward.
One final suggestion: use a different table for testing than you do for your application. Create an identical schema, but give the table a different name. You probably even want a different IAM user to prevent your test code from accessing your production table. If you have questions about that, feel free to open a separate question for that scenario.

You can recreate a DynamoDB table using AWS Java SDK
// Init DynamoDB client
AmazonDynamoDB dynamoDB = AmazonDynamoDBClientBuilder.standard().build();
// Get table definition
TableDescription tableDescription = dynamoDB.describeTable("my-table").getTable();
// Delete table
dynamoDB.deleteTable("my-table");
// Create table
CreateTableRequest createTableRequest = new CreateTableRequest()
.withTableName(tableDescription.getTableName())
.withAttributeDefinitions(tableDescription.getAttributeDefinitions())
.withProvisionedThroughput(new ProvisionedThroughput()
.withReadCapacityUnits(tableDescription.getProvisionedThroughput().getReadCapacityUnits())
.withWriteCapacityUnits(tableDescription.getProvisionedThroughput().getWriteCapacityUnits())
)
.withKeySchema(tableDescription.getKeySchema());
dynamoDB.createTable(createTableRequest);

I use following javascript code to do it:
async function truncate(table, keys) {
const limit = (await db.describeTable({
TableName: table
}).promise()).Table.ProvisionedThroughput.ReadCapacityUnits;
let total = 0;
let lastEvaluatedKey = null;
do {
const qp = {
TableName: table,
Limit: limit,
ExclusiveStartKey: lastEvaluatedKey,
ProjectionExpression: keys.join(' '),
};
const qr = await ddb.scan(qp).promise();
lastEvaluatedKey = qr.LastEvaluatedKey;
const dp = {
RequestItems: {
},
};
dp.RequestItems[table] = [];
if (qr.Items) {
for (const i of qr.Items) {
const dr = {
DeleteRequest: {
Key: {
}
}
};
keys.forEach(k => {
dr.DeleteRequest.Key[k] = i[k];
});
dp.RequestItems[table].push(dr);
if (dp.RequestItems[table].length % 25 == 0) {
await ddb.batchWrite(dp).promise();
total += dp.RequestItems[table].length;
dp.RequestItems[table] = [];
}
}
if (dp.RequestItems[table].length > 0) {
await ddb.batchWrite(dp).promise();
total += dp.RequestItems[table].length;
dp.RequestItems[table] = [];
}
}
console.log(`Deleted ${total}`);
setTimeout(() => {}, 1000);
} while (lastEvaluatedKey);
}
(async () => {
truncate('table_name', ['id']);
})();

In this case, you may delete the table and create a new one.
Example:
from __future__ import print_function # Python 2/3 compatibility
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-west-2', endpoint_url="http://localhost:8000")
table = dynamodb.Table('Movies')
table.delete()

How can I get EclipseLink to output valid Informix SQL for an UPDATE WHERE clause?

We have a named query like this:
UPDATE Foo f SET f.x = 0 WHERE f.x = :invoiceId
Foo in this case is an entity with a superclass, using the table-per-class inheritance strategy.
The SQL that EclipseLink generates is:
UPDATE foo_subclass SET x = ?
WHERE EXISTS(SELECT t0.id
FROM foo_superclass t0, foo_subclass t1
WHERE ((t1.x = ?) AND ((t1.id = t0.id) AND (t0.DTYPE = ?)))
(The ? slots are correctly filled in.)
On Informix 11.70, we get an error that the subquery cannot access the table being changed.
Here is the documentation that I was able to find on subquery restrictions on Informix: http://publib.boulder.ibm.com/infocenter/idshelp/v115/index.jsp?topic=%2Fcom.ibm.sqls.doc%2Fids_sqs_2005.htm
Other databases also feature restrictions on subqueries like this, so although this is manifesting as an Informix issue, I'm sure that if we ran this against, say, MySQL, we would get a similar error.
How can I get EclipseLink to honor these restrictions? Is there a better query I should be using?

Instead of:
UPDATE foo_subclass SET x = ?
WHERE EXISTS(SELECT t0.id
FROM foo_superclass t0, foo_subclass t1
WHERE ((t1.x = ?) AND ((t1.id = t0.id) AND (t0.DTYPE = ?)))
do this:
UPDATE
foo_subclass SET x = ?
WHERE
foo_subclass.x = ? AND
EXISTS(SELECT t0.id
FROM foo_superclass t0
WHERE ((foo_subclass.id = t0.id) AND (t0.DTYPE = ?))
Note that on 11.50 you cannot use alias for foo_subclass. It's allowed in 11.70.
I'm assuming "id" are primary keys (or at least unique identifiers).

Found the answer. Looks like EclipseLink had to handle this case for MySQL, which also has similar issues.
The answer is that in your InformixPlatform subclass, you need to override the following methods to solve this problem:
supportsLocalTemporaryTables(): this needs to return true
shouldAlwaysUseTempStorageForModifyAll(): this needs to return true
dontBindUpdateAllQueryUsingTempTables needs to return true
getCreateTempTableSqlPrefix(): this needs to return CREATE TEMP TABLE
getCreateTempTableSqlSuffix(): this needs to return WITH NO LOG
isInformixOuterJoin(): needs to return false
getTempTableForTable(DatabaseTable): this needs to do this:
return new DatabaseTable("TL_" + table.getName(), "" /* no table qualifier */, table.shouldUseDelimiters(), this.getStartDelimiter(), this.getEndDelimiter());
In addition, you need to override the following methods as well for proper InformixPlatform behavior:
appendBoolean(Boolean, Writer): the stock Informix platform does not write out boolean literals properly. Yours needs to do this:
if (Boolean.TRUE.equals(booleanValue)) {
writer.write("'t'");
} else {
writer.write("'f'");
}
You need to override writeUpdateOriginalFromTempTableSql so that it contains the same code as the H2Platform's override does:
#Override
public void writeUpdateOriginalFromTempTableSql(final Writer writer, final DatabaseTable table, final Collection pkFields, final Collection assignedFields) throws IOException {
writer.write("UPDATE ");
final String tableName = table.getQualifiedNameDelimited(this);
writer.write(tableName);
writer.write(" SET ");
final int size = assignedFields.size();
if (size > 1) {
writer.write("(");
}
writeFieldsList(writer, assignedFields, this);
if (size > 1) {
writer.write(")");
}
writer.write(" = (SELECT ");
writeFieldsList(writer, assignedFields, this);
writer.write(" FROM ");
final String tempTableName = this.getTempTableForTable(table).getQualifiedNameDelimited(this);
writer.write(tempTableName);
writeAutoJoinWhereClause(writer, null, tableName, pkFields, this);
writer.write(") WHERE EXISTS(SELECT ");
writer.write(((DatabaseField)pkFields.iterator().next()).getNameDelimited(this));
writer.write(" FROM ");
writer.write(tempTableName);
writeAutoJoinWhereClause(writer, null, tableName, pkFields, this);
writer.write(")");
}
Lastly, your constructor needs to call this.setShouldBindLiterals(false).
With these changes, it seems that Informix is happy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

HBase Get/Scan in a Scalding job - mapreduce

Related

Is it possible to build an OLTP/CRUD HTTP server using AkkaHttp, AkkaStreams, Alpakka and a database?

how to mock/match lambda in kotlin method signature

How do I query multiple IDs via the ContentSearchManager?

Deletion from amazon dynamodb

How can I get EclipseLink to output valid Informix SQL for an UPDATE WHERE clause?

Categories

Resources