I have a requirement where i have to write a PCollection of String to a Cloud SQL using Cloud Dataflow API.
pipeline.apply(TextIO.read().from("gs://***/sampleBigtable.csv"))
.apply(JdbcIO.write()
.withDataSourceConfiguration(DataSourceConfiguration
.create("org.postgresql.Driver", "jdbc:postgresql://***:5432/test")
.withUsername("**").withPassword("password10"))
.withStatement("insert into person values(?,?)")
.withPreparedStatementSetter(
new JdbcIO.PreparedStatementSetter < Object > () {
/**
*
*/
private static final long serialVersionUID = 1 L;
#Override
public void setParameters(Object arg0, PreparedStatement query)
throws Exception {
// TODO Auto-generated method stub
query.setString(1, "Hello");
query.setString(1, "Hi");
}
}));
This is the sample code I am trying. A very simple version of what I want to do.
Also, is it feasible to write into Cloud SQL from Dataflow using a parDo and writing simple insert statements?
The previous transform outputs a PCollection<String>, so you need to specify that was the input type to the JdbcIO<T>.write()
Something like this:
pipeline
.apply(TextIO.read().from("gs://***/sampleBigtable.csv"))
.apply(JdbcIO.<String>write().withDataSourceConfiguration(
DataSourceConfiguration.create("org.postgresql.Driver","jdbc:postgresql://***:5432/test")
.withUsername("**")
.withPassword("password10"))
.withStatement("insert into person values(?,?)")
.withPreparedStatementSetter((element, query) -> {
query.setInt(1, 1);
query.setString(2, "Hello");
})
);
Related
I want to run an OCR(Tesseract) on AWS Lambda using Java.I wanted to have a basic-"Hello World" set up so I created the Handler function for AWS Lambda like so-
public class Hello implements RequestHandler<Object, String> {
#Override
public String handleRequest(Object input, Context context) {
context.getLogger().log("Inside handleRequest of Hello class PPPPp ");
context.getLogger().log("Input: " + input);
String output = "Hello, " + input + "!";
return output;
}
}
And this is working as I have tested it on the AWS Lambda console. And also Teseract is running on my local machine and tested it by passing an image and is printing the words present.
But now I am not getting how to combine these 2 (AWS Lambda and Teseract) into a single java program. I found this link which runs Teseract on Python - https://typless.com/tesseract-on-aws-lambda-ocr-as-a-service/ and tried to do the equivalent in Java, but was not able to do so as they were manipulating .py files and I am dealing with .java files.
I also came across this github link- https://github.com/tesseract-ocr/tesseract present in the AWS Lambda documentation,and this one-
https://github.com/tesseract-ocr/tesseract
but didnt understand how to use the code for my requirment.
I am stuck here and new to AWS Lambda. Any help very much appreciated. Thanks in advance.
Please see the example https://github.com/jlcorradi/tesseract-example/blob/e85b93f3481109c56bb0e7255e0116691463693f/src/main/java/com/playground/tesseract/TesseractPlaygroundLambdaHandle.java
Github: Not verified
#Log4j
public class TesseractPlaygroundLambdaHandle implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
#Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent input, Context context) {
Base64.Decoder decoder = Base64.getDecoder();
String inputBody = input.getBody();
Tesseract tesseract = new Tesseract();
//tesseract.setDatapath("/home/jorgecorradi/Downloads/");
try {
String decodedText = tesseract.doOCR(decodeToImage(inputBody));
return createResponse(decodedText);
} catch (TesseractException e) {
e.printStackTrace();
return createResponse(e.getMessage());
}
}
private APIGatewayProxyResponseEvent createResponse(String responseContent) {
APIGatewayProxyResponseEvent responseEvent = new APIGatewayProxyResponseEvent();
responseEvent.setBody(responseContent);
responseEvent.setStatusCode(200);
return responseEvent;
}
public static BufferedImage decodeToImage(String imageString) {
BufferedImage image = null;
byte[] imageByte;
try {
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(imageString);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
} catch (Exception e) {
log.error("Error decoding base64 to image", e);
}
return image;
}
}
You will need to have a project/lib management tool such as Maven for example, to allow you to specify dependencies (For Terrasact), and to create a standalone jar, that you can then upload towards your Lambda.
This is in its simplest form.
https://docs.aws.amazon.com/lambda/latest/dg/java-package.html for how to deploy a standalone zip/jar towards Lambda.
I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}
Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.
I have 3 method which is returning a List of result and my sql query execute in each method and returning a list of result .I want to execute all 3 method parallel so it will not wait to complete of one and another . I saw one stachoverflow post but its not working.
that link is [How to execute multiple queries in parallel instead of sequentially?
[Execute multiple queries in parallel via Streams
I want to solve using java 8 features.
But the above link how I can call multiple method please tell me.
Execute multiple queries in parallel via Streams works for your task. Here is a sample code which demonstrates it:
public static void main(String[] args) {
// Create Stream of tasks:
Stream<Supplier<List<String>>> tasks = Stream.of(
() -> getServerListFromDB(),
() -> getAppListFromDB(),
() -> getUserFromDB());
List<List<String>> lists = tasks
// Supply all the tasks for execution and collect CompletableFutures
.map(CompletableFuture::supplyAsync).collect(Collectors.toList())
// Join all the CompletableFutures to gather the results
.stream()
.map(CompletableFuture::join).collect(Collectors.toList());
System.out.println(lists);
}
private static List<String> getUserFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getUser");
return Arrays.asList("User1", "User2", "User3");
}
private static List<String> getAppListFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getAppList");
return Arrays.asList("App1", "App2", "App3");
}
private static List<String> getServerListFromDB() {
try {
TimeUnit.SECONDS.sleep((long) (Math.random() * 3));
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(Thread.currentThread().getName() + " getServer");
return Arrays.asList("Server1", "Server2", "Server3");
}
The output is:
ForkJoinPool.commonPool-worker-1 getServer
ForkJoinPool.commonPool-worker-3 getUser
ForkJoinPool.commonPool-worker-2 getAppList
[[Server1, Server2, Server3], [App1, App2, App3], [User1, User2, User3]]
You can see that default ForkJoinPool.commonPool is used and each get* method is executed from separate thread in this pool. You just need to run your SQL queries inside these get* methods
The solution from Alex won't work in an expected way. CompletableFuture executes effects one after another. You will have to run parallel streams for this or use CompletableFuture with parallelStream
I am using In-Memory database (using ServiceStack.OrmLite.Sqlite.Windows) for unit testing in servicestack based web api. I want to test the service endpoints which depends on stored Procedures through In-Memory database for which i have gone through the link Servicestack Ormlite SqlServerProviderTests, the unit test class that i am using for the test is as follows,
using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
using NUnit.Framework;
using ServiceStack.Text;
using ServiceStack.Configuration;
using ServiceStack.Data;
namespace ServiceStack.OrmLite.Tests
{
public class DummyTable
{
public int Id { get; set; }
public string Name { get; set; }
}
[TestFixture]
public class SqlServerProviderTests
{
private IDbConnection db;
protected readonly ServiceStackHost appHost;
public SqlServerProviderTests()
{
appHost = TestHelper.SetUp(appHost).Init();
db = appHost.Container.Resolve<IDbConnectionFactory>().OpenDbConnection("inventoryDb");
if (bool.Parse(System.Configuration.ConfigurationManager.AppSettings["IsMock"]))
TestHelper.CreateInMemoryDB(appHost);
}
[TestFixtureTearDown]
public void TearDown()
{
db.Dispose();
}
[Test]
public void Can_SqlColumn_StoredProc_returning_Column()
{
var sql = #"CREATE PROCEDURE dbo.DummyColumn
#Times integer
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #Temp
(
Id integer NOT NULL,
);
declare #i int
set #i=1
WHILE #i < #Times
BEGIN
INSERT INTO #Temp (Id) VALUES (#i)
SET #i = #i + 1
END
SELECT * FROM #Temp;
DROP TABLE #Temp;
END;";
db.ExecuteSql("IF OBJECT_ID('DummyColumn') IS NOT NULL DROP PROC DummyColumn");
db.ExecuteSql(sql);
var expected = 0;
10.Times(i => expected += i);
var results = db.SqlColumn<int>("EXEC DummyColumn #Times", new { Times = 10 });
results.PrintDump();
Assert.That(results.Sum(), Is.EqualTo(expected));
results = db.SqlColumn<int>("EXEC DummyColumn 10");
Assert.That(results.Sum(), Is.EqualTo(expected));
results = db.SqlColumn<int>("EXEC DummyColumn #Times", new Dictionary<string, object> { { "Times", 10 } });
Assert.That(results.Sum(), Is.EqualTo(expected));
}
}
}
when i tried to execute this through Live-DB, it was working fine. but when i tried for In-Memory DB was getting Exceptions as follows,
System.Data.SQLite.SQLiteException : SQL logic error or missing database near "IF": syntax error
near the code line,
db.ExecuteSql("IF OBJECT_ID('DummyColumn') IS NOT NULL DROP PROC DummyColumn");
i commented the above line and executed the test case but still i am getting exception as follows,
System.Data.SQLite.SQLiteException : SQL logic error or missing database near "IF": syntax error
for the code line,
db.ExecuteSql(sql);
the In-Memory DB Created is as follows, and its working fine for remaining cases.
public static void CreateInMemoryDB(ServiceStackHost appHost)
{
using (var db = appHost.Container.Resolve<IDbConnectionFactory>().OpenDbConnection("ConnectionString"))
{
db.DropAndCreateTable<DummyData>();
TestDataReader<TableList>("Reservation.json", "InMemoryInput").Reservation.ForEach(x => db.Insert(x));
db.DropAndCreateTable<DummyTable>();
}
}
why we are facing this exception is there any other way to add and run stored Procedure in In-Memory DB with Sqlite??
The error is because you're trying to run SQL Server-specific queries with TSQL against an in memory version of Sqlite - i.e. a completely different, embeddable database. As the name suggests SqlServerProviderTests only works on SQL Server, I'm confused why you would try to run this against Sqlite?
SQLite doesn't support Stored Procedures, TSQL, etc so trying to execute SQL Server TSQL statements will always result in an error. The only thing you can do is fake it with a custom Exec Filter, where you can catch the exception and return whatever custom result you like, e.g:
public class MockStoredProcExecFilter : OrmLiteExecFilter
{
public override T Exec<T>(IDbConnection dbConn, Func<IDbCommand, T> filter)
{
try
{
return base.Exec(dbConn, filter);
}
catch (Exception ex)
{
if (dbConn.GetLastSql() == "exec sp_name #firstName, #age")
return (T)(object)new Person { FirstName = "Mocked" };
throw;
}
}
}
OrmLiteConfig.ExecFilter = new MockStoredProcExecFilter();
I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)