I'm trying to save data into Big query using Spark Big Query connector. Let say I have a Java pojo like below
#Getter
#Setter
#AllArgsConstructor
#ToString
#Builder
public class TagList {
private String s1;
private List<String> s2;
}
Now when I try to save this Pojo into Big query its throwing me below error
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to test_table1 in job JobId{project=<project_id>, job=<job_id>, location=US}. BigQuery error was Provided Schema does not match Table <Table_Name>. Field s2 has changed type from STRING to RECORD
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156)
at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89)
... 35 more
Sample code:
Dataset<TagList> mapDS = inputDS.map((MapFunction<Row, TagList>) x -> {
List<String> list = new ArrayList<>();
list.add(x.get(0).toString());
list.add("temp1");
return TagList.builder()
.s1("Hello World")
.s2(list).build();
}, Encoders.bean(TagList.class));
mapDS.write().format("bigquery")
.option("temporaryGcsBucket","<bucket_name>")
.option("table", "<table_name>")
.option("project", projectId)
.option("parentProject", projectId)
.mode(SaveMode.Append)
.save();
Big Query Table:
create table <dataset>.<table_name> (
s1 string,
s2 array<string>,
)
PARTITION BY
TIMESTAMP_TRUNC(_PARTITIONTIME, HOUR);
Please change the intermediateFormat to AVRO or ORC. When using Parquet, the serialization creates an intermediate structure. See more at https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties
Related
I need to load a file into my database, but before that I have to verify data is present in the database based on some file data. For instance, suppose I have 5 records in a file then I have to check 5 times in the database for separate records.
So how can I get this value dynamically? We have to pass dynamic value instead of 2 in line (preparedStatement.setString(1, "2");)
Here we are creating a Dataflow pipeline which loads data into the database using Apache Beam. Now we create a pipeline object and create a pipeline. Using a PCollection we are storing into database.
Pipeline p = Pipeline.create(options);
p.apply("Reading Text", TextIO.read().from(options.getInputFile()))
.apply(ParDo.of(new FilterHeaderFn(csvHeader)))
.apply(ParDo.of(new GetRatePlanID()))
.apply("Format Result", MapElements.into(
TypeDescriptors.strings()).via(
(KV < String, Integer > ABC) - >
ABC.getKey() + "," + ABC.getValue()))
.apply("Write File", TextIO.write()
.to(options.getOutputFile())
.withoutSharding());
// Retrieving data from database
PCollection < String > data =
p.apply(JdbcIO. < String > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.cj.jdbc.Driver", "jdbc:mysql://localhost:3306/XYZ")
.withUsername("root")
.withPassword("root1234"))
.withQuery("select * from xyz where z = ?")
.withCoder(StringUtf8Coder.of())
.withStatementPreparator(new JdbcIO.StatementPreparator() {
private static final long serialVersionUID = 1 L;
#Override
public void setParameters(PreparedStatement preparedStatement) throws Exception {
preparedStatement.setString(1, "2");
}
})
.withRowMapper(new JdbcIO.RowMapper < String > () {
private static final long serialVersionUID = 1 L;
public String mapRow(ResultSet resultSet) throws Exception {
return "Symbol: " + resultSet.getInt(1) + "\nPrice: " + resultSet.getString(2) +
"\nCompany: " + resultSet.getInt(3);
}
}));
As suggested, the most efficient would probably be loading the whole file into a temporary table and then doing a query to update the requisite rows.
If that can't be done, you could instead read the table into Dataflow (i.e. "select * from xyz") and then do a join/CoGroupByKey to match records with those found in your file. If you expect the existing database to be very large compared to the files you're hoping to upload into it, you could have a DoFn that makes queries to your database directly using JDBC (possibly caching the connection in the DoFn's setUp method) rather than using JdbcIO.
I'm using save expression on an encrypted attribute named transactionAmount while updating data in dynamo DB. However the update query is failing with ConditionalCheckFailedException. The data is encrypted on client side during initial persistence in dynamodb in way same as described here. Following is the code:
Data Transfer Object:
public final class SampleDTO {
#DynamoDBHashKey(attributeName = CommonDynamoDBSchemaConstants.UNIQUE_KEY)
#Getter(onMethod = #__({ #DoNotTouch }))
private String uniqueKey;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.EVENT_RUNNING_TIME_EPOCH)
#Getter(onMethod = #__({ #DoNotTouch }))
private Long eventRunningTimeInEpoch;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.INSTRUMENT_TYPE)
#DynamoDBTypeConverted(converter = InstrumentTypeConverter.class)
#Getter(onMethod = #__({ #DoNotTouch }))
private InstrumentType instrumentType;
#DynamoDBAttribute(attributeName = CommonDynamoDBSchemaConstants.TRANSACTION_AMOUNT)
private String transactionAmount;
}
Data Access Code:
// fetches data from dynamoDB based on unique key passed to it.
SampleDTO sampleDTO = getSampleDTO("testLedgerUniqueKey");
sampleDTO.setInstrumentType(InstrumentType.MACHINE);
DynamoDBSaveExpression saveExpression = new DynamoDBSaveExpression();
Map<String, ExpectedAttributeValue> expressionAttributeValues =
new HashMap<String, ExpectedAttributeValue>();
expressionAttributeValues.put(
CommonDynamoDBSchemaConstants.LEDGER_UNIQUE_KEY,
new ExpectedAttributeValue(true)
.withValue(new AttributeValue(sampleDTO.getLedgerUniqueKey())));
expressionAttributeValues.put(
CommonDynamoDBSchemaConstants.TRANSACTION_AMOUNT,
new ExpectedAttributeValue(true).withValue(
new AttributeValue(sampleDTO.getTransactionAmount())));
saveExpression.setExpected(expressionAttributeValues);
saveExpression.setConditionalOperator(ConditionalOperator.AND);
dynamoDBMapper.save(sampleDTO, saveExpression, null /*dynamoDBMapperConfig*/);
ConditionalCheckFailedException:
You are trying to update a record that does not exist with your query condition. Please verify your query condition to make sure your query returns a record.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes
You specified a condition that evaluated to false. For example, you
might have tried to perform a conditional update on an item, but the
actual value of the attribute did not match the expected value in the
condition.
Hope it helps.
Is there a way of dynamically building a cypher query using spring data neo4j?
I have a cypher query that filters my entities similar to this one:
#Query("MATCH (n:Product) WHERE n.name IN {0} return n")
findProductsWithNames(List<String> names);
#Query("MATCH (n:Product) return n")
findProductsWithNames();
When the names list is empty or null i just want to return all products. Therefore my service impl. checks the names array and calls the correct repository method. The given example is looks clean but it really gets ugly once the cypher statements are more complex and the code starts to repeat itself.
You can create your own dynamic Cypher queries and use Neo4jOperations to execute them. Here is it an example (with a query different from your OP) that I think can ilustrate how to do that:
#Autowired
Neo4jOperations template;
public User findBySocialUser(String providerId, String providerUserId) {
String query = "MATCH (n:SocialUser{providerId:{providerId}, providerUserId:{providerUserId}})<-[:HAS]-(user) RETURN user";
final Map<String, Object> paramsMap = ImmutableMap.<String, Object>builder().
put("providerId", providerId).
put("providerUserId", providerUserId).
build();
Map<String, Object> result = template.query(query, paramsMap).singleOrNull();
return (result == null) ? null : (User) template.getDefaultConverter().convert(result.get("user"), User.class);
}
Hope it helps
Handling paging is also possible this way:
#Test
#SuppressWarnings("unchecked")
public void testQueryBuilding() {
String query = "MATCH (n:Product) return n";
Result<Map<String, Object>> result = neo4jTemplate.query(query, Collections.emptyMap());
for (Map<String, Object> r : result.slice(1, 3)) {
Product product = (Product) neo4jTemplate.getDefaultConverter().convert(r.get("n"), Product.class);
System.out.println(product.getUuid());
}
}
I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...
I have Product table which has a related table Images with a relation 1:M.
Class Product {
private Integer productId;
private String productName;
....
....
....
private List<Image> productImageList;
....
....
....
}
Class Image{
private Integer imageId;
private String imageName;
}
Class ProductLite{
private Integer productId;
private String productName;
private String imageName;
}
I am trying a JPQL query where I want to query to fetch products and the first image from the productImageList and returning a ProductLite object using the new constructor.
#TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public List<ProductLite> getAllProductLite() {
Query q = em.createQuery("SELECT NEW com.mycomp.application.entity.ProductLite(p.productId, p.productName, p.productImageList.get(0).getImageName())"
+ " from Product p"
+ " ORDER by p.productName");
List<ProductLite> prods = q.getResultList();
return prods;
}
But for some reason I am not able to get it to work. I get a NoViableException. So I tried moving the logic of getting the first image (getImage() method) to the Product Entity so in the query I could just call the getImage(). Even that does not seem to work.
java.lang.IllegalArgumentException: An exception occurred while creating a query in EntityManager:
Exception Description: Syntax error parsing the query [SELECT NEW com.meera.application.entity.ProductLite(distinct p.productId, p.productName, p.getImage()) from Product p, IN(p.productImageList) pil where p.category.categoryCode = :categoryCode ORDER by p.productName ], line 1, column 52: unexpected token [distinct].
Internal Exception: NoViableAltException(23#[452:1: constructorItem returns [Object node] : (n= scalarExpression | n= aggregateExpression );])
Any help is appreciated.
First, you cannot call methods in entity class from your JP QL query. Second, to use the order of entities in list, you need persisted order.
To create column for order to the join table between image and product, you have to add
#OrderColumn-annotation to the productImageList. For example:
#OrderColumn(name = "myimage_order")
//or dont't define name and let it default to productImageList_order
#OneToMany
private List<Image> productImageList;
Then you have to modify query to use that order to choose only first image:
SELECT NEW com.mycomp.application.entity.ProductLite(
p.productId, p.productName, pil.imageName)
FROM Product p JOIN p.productImageList pil
WHERE INDEX(pil) = 0
ORDER by p.productName