Creating Parquet Table in Apache Drill - amazon-web-services
I am currently running Apache Drill on a 20 node cluster and was running into some errors that I was wondering if you would be able to help me with this.
I am attempting to run the following query to create a parquet table in a new S3 bucket from another table that is in a tsv format:
create table s3_output.tmp.`<output file>` as select
columns[0], columns[1], columns[2], columns[3], columns[4], columns[5], columns[6], columns[7], columns[8], columns[9],
columns[10], columns[11], columns[12], columns[13], columns[14], columns[15], columns[16], columns[17], columns[18], columns[19],
columns[20], columns[21], columns[22], columns[23], columns[24], columns[25], columns[26], columns[27], columns[28], columns[29],
columns[30], columns[31], columns[32], columns[33], columns[34], columns[35], columns[36], columns[37], columns[38], columns[39],
columns[40], columns[41], columns[42], columns[43], columns[44], columns[45], columns[46], columns[47], columns[48], columns[49],
columns[50], columns[51], columns[52], columns[53], columns[54], columns[55], columns[56], columns[57], columns[58], columns[59],
columns[60], columns[61], columns[62], columns[63], columns[64], columns[65], columns[66], columns[67], columns[68], columns[69],
columns[70], columns[71], columns[72], columns[73], columns[74], columns[75], columns[76], columns[77], columns[78], columns[79],
columns[80], columns[81], columns[82], columns[83], columns[84], columns[85], columns[86], columns[87], columns[88], columns[89],
columns[90], columns[91], columns[92], columns[93], columns[94], columns[95], columns[96], columns[97], columns[98], columns[99],
columns[100], columns[101], columns[102], columns[103], columns[104], columns[105], columns[106], columns[107], columns[108], columns[109],
columns[110], columns[111], columns[112], columns[113], columns[114], columns[115], columns[116], columns[117], columns[118], columns[119],
columns[120], columns[121], columns[122], columns[123], columns[124], columns[125], columns[126], columns[127], columns[128], columns[129],
columns[130], columns[131], columns[132], columns[133], columns[134], columns[135], columns[136], columns[137], columns[138], columns[139],
columns[140], columns[141], columns[142], columns[143], columns[144], columns[145], columns[146], columns[147], columns[148], columns[149],
columns[150], columns[151], columns[152], columns[153], columns[154], columns[155], columns[156], columns[157], columns[158], columns[159],
columns[160], columns[161], columns[162], columns[163], columns[164], columns[165], columns[166], columns[167], columns[168], columns[169],
columns[170], columns[171], columns[172], columns[173] from s3input.`<input path>*.gz`;
This is the error output I get while running this query.
Error: DATA_READ ERROR: Error processing input: , line=2026, char=2449781. Content parsed: [ ]
Failure while reading file s3a://.gz. Happened at or shortly before byte position 329719.
Fragment 1:19
[Error Id: fe289e19-c7b7-4739-9960-c15b8a62af3b on :31010] (state=,code=0)
Do you have any idea how I can go about trying to solve this issue?
Related
Terraform: split function in output gives error when count is used while creating a resource
I am creating a resource by using count in it. When I use split function in the output it gives me error while normal output where split is not used just works fine. I am running on stack=dev right now. I expect the same resource to not get created on dev stack but it should get created in prod stack. I am trying to write a code in such a way Below is the piece of code which gives error data "aws_cloudformation_stack" "some_name" { count = (local.stack == "dev" ? 0 : 1) name = "${local.stack}_some_name" } output "public_alb_subnets" { value = split(",", "${data.aws_cloudformation_stack.some_name[*].outputs["PublicElbSubnets"]}") } It gives me error Error: Invalid function argument on managed_alb.tf line 138, in output "public_alb_subnets": 138: value = split(",", "${data.aws_cloudformation_stack.some_name[*].outputs["PublicElbSubnets"]}") |---------------- | data.aws_cloudformation_stack.some_name is empty tuple Invalid value for "str" parameter: string required. However below works output "public_alb_security_groups" { value = [ data.aws_cloudformation_stack.some_name[*].outputs["PublicElbSecurityGroup"], data.aws_cloudformation_stack.some_name[*].outputs["InternalElbSecurityGroup"] ] } I tried many different options on the web but none of them worked. What I am doing wrong here. Even using count.index or 0 in place of * doesn't work
You have to make your output also conditional, based on your dev or prod environments: output "public_alb_subnets" { value = length(data.aws_cloudformation_stack.some_name) > 0 ? split(",", "${data.aws_cloudformation_stack.some_name[*].outputs["PublicElbSubnets"]}") : null }
File to DB load using Apache beam
I need to load a file into my database, but before that I have to verify data is present in the database based on some file data. For instance, suppose I have 5 records in a file then I have to check 5 times in the database for separate records. So how can I get this value dynamically? We have to pass dynamic value instead of 2 in line (preparedStatement.setString(1, "2");) Here we are creating a Dataflow pipeline which loads data into the database using Apache Beam. Now we create a pipeline object and create a pipeline. Using a PCollection we are storing into database. Pipeline p = Pipeline.create(options); p.apply("Reading Text", TextIO.read().from(options.getInputFile())) .apply(ParDo.of(new FilterHeaderFn(csvHeader))) .apply(ParDo.of(new GetRatePlanID())) .apply("Format Result", MapElements.into( TypeDescriptors.strings()).via( (KV < String, Integer > ABC) - > ABC.getKey() + "," + ABC.getValue())) .apply("Write File", TextIO.write() .to(options.getOutputFile()) .withoutSharding()); // Retrieving data from database PCollection < String > data = p.apply(JdbcIO. < String > read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( "com.mysql.cj.jdbc.Driver", "jdbc:mysql://localhost:3306/XYZ") .withUsername("root") .withPassword("root1234")) .withQuery("select * from xyz where z = ?") .withCoder(StringUtf8Coder.of()) .withStatementPreparator(new JdbcIO.StatementPreparator() { private static final long serialVersionUID = 1 L; #Override public void setParameters(PreparedStatement preparedStatement) throws Exception { preparedStatement.setString(1, "2"); } }) .withRowMapper(new JdbcIO.RowMapper < String > () { private static final long serialVersionUID = 1 L; public String mapRow(ResultSet resultSet) throws Exception { return "Symbol: " + resultSet.getInt(1) + "\nPrice: " + resultSet.getString(2) + "\nCompany: " + resultSet.getInt(3); } }));
As suggested, the most efficient would probably be loading the whole file into a temporary table and then doing a query to update the requisite rows. If that can't be done, you could instead read the table into Dataflow (i.e. "select * from xyz") and then do a join/CoGroupByKey to match records with those found in your file. If you expect the existing database to be very large compared to the files you're hoping to upload into it, you could have a DoFn that makes queries to your database directly using JDBC (possibly caching the connection in the DoFn's setUp method) rather than using JdbcIO.
Save Array<T> in BigQuery using Java
I'm trying to save data into Big query using Spark Big Query connector. Let say I have a Java pojo like below #Getter #Setter #AllArgsConstructor #ToString #Builder public class TagList { private String s1; private List<String> s2; } Now when I try to save this Pojo into Big query its throwing me below error Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Failed to load to test_table1 in job JobId{project=<project_id>, job=<job_id>, location=US}. BigQuery error was Provided Schema does not match Table <Table_Name>. Field s2 has changed type from STRING to RECORD at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:156) at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:89) ... 35 more Sample code: Dataset<TagList> mapDS = inputDS.map((MapFunction<Row, TagList>) x -> { List<String> list = new ArrayList<>(); list.add(x.get(0).toString()); list.add("temp1"); return TagList.builder() .s1("Hello World") .s2(list).build(); }, Encoders.bean(TagList.class)); mapDS.write().format("bigquery") .option("temporaryGcsBucket","<bucket_name>") .option("table", "<table_name>") .option("project", projectId) .option("parentProject", projectId) .mode(SaveMode.Append) .save(); Big Query Table: create table <dataset>.<table_name> ( s1 string, s2 array<string>, ) PARTITION BY TIMESTAMP_TRUNC(_PARTITIONTIME, HOUR);
Please change the intermediateFormat to AVRO or ORC. When using Parquet, the serialization creates an intermediate structure. See more at https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties
ORA-00904: "E_MAIL": invalid identifier
I am using MVC architecture.I am trying to update a record in a table taking customer id as input. all the data is taken as input in my viewcustomer.cpp class whose method is returning an object of type customer which is passed to a function in modelcustomer.pc via controlcustomer.cpp(controller) Following is a function of my modelcustomer.pc void modelcustomer::dbUpdateCustomerDetail(customer &c) { id=c.getId(); ph=c.getId(); string memberFName=c.getFname(); string memberLName=c.getLname(); string memberStreet=c.getStreet(); string memberCity=c.getCity(); string memberState=c.getState(); string memberEmail=c.getEmail(); fn=new char[memberFName.length()+1]; ln=new char[memberLName.length()+1]; street=new char[memberStreet.length()+1]; city=new char[memberCity.length()+1]; state=new char[memberState.length()+1]; e_mail=new char[memberEmail.length()+1]; strcpy(fn,memberFName.c_str()); strcpy(ln,memberLName.c_str()); strcpy(street,memberStreet.c_str()); strcpy(city,memberCity.c_str()); strcpy(state,memberState.c_str()); strcpy(e_mail,memberEmail.c_str()); if(dbConnect()) { EXEC SQL UPDATE CUSTOMER_1030082 SET CID=:id,FNAME=:fn,LNAME=:ln,PHONE=:ph,STREET=:street,STATE=:state,CITY=:city,EMAIL=e_mail; if(sqlca.sqlcode<0) { cout<<"error in execution"<<sqlca.sqlcode<<sqlca.sqlerrm.sqlerrmc; } EXEC SQL COMMIT WORK RELEASE; } } when i'm running it a menu is displayed with some options ,i select the update option then it asks me for new details and after that i'm getting following output: connected to Oracle! error in execution-904ORA-00904: "E_MAIL": invalid identifier
e_mail is not a parameter, you forgot :: EXEC SQL … EMAIL=:e_mail; ↑
How to report invalid data while processing data with Google dataflow?
I am looking at the documentation and the provided examples to find out how I can report invalid data while processing data with Google's dataflow service. Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.named("ReadMyFile").from(options.getInput())) .apply(new SomeTransformation()) .apply(TextIO.Write.named("WriteMyFile").to(options.getOutput())); p.run(); In addition to the actual in-/output, I want to produce a 2nd output file that contains records that which are considered invalid (e.g. missing data, malformed data, values were too high). I want to troubleshoot those records and process them separately. Input: gs://.../input.csv Output: gs://.../output.csv List of invalid records: gs://.../invalid.csv How can I redirect those invalid records into a separate output?
You can use PCollectionTuples to return multiple PCollections from a single transform. For example, TupleTag<String> mainOutput = new TupleTag<>("main"); TupleTag<String> missingData = new TupleTag<>("missing"); TupleTag<String> badValues = new TupleTag<>("bad"); Pipeline p = Pipeline.create(options); PCollectionTuple all = p .apply(TextIO.Read.named("ReadMyFile").from(options.getInput())) .apply(new SomeTransformation()); all.get(mainOutput) .apply(TextIO.Write.named("WriteMyFile").to(options.getOutput())); all.get(missingData) .apply(TextIO.Write.named("WriteMissingData").to(...)); ... PCollectionTuples can either be built up directly out of existing PCollections, or emitted from ParDo operations with side outputs, e.g. PCollectionTuple partitioned = input.apply(ParDo .of(new DoFn<String, String>() { public void processElement(ProcessContext c) { if (checkOK(c.element()) { // Shows up in partitioned.get(mainOutput). c.output(...); } else if (hasMissingData(c.element())) { // Shows up in partitioned.get(missingData). c.sideOutput(missingData, c.element()); } else { // Shows up in partitioned.get(badValues). c.sideOutput(badValues, c.element()); } } }) .withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues))); Note that in general the various side outputs need not have the same type, and data can be emitted any number of times to any number of side outputs (rather than the strict partitioning we have here). Your SomeTransformation class could then look something like class SomeTransformation extends PTransform<PCollection<String>, PCollectionTuple> { public PCollectionTuple apply(PCollection<String> input) { // Filter into good and bad data. PCollectionTuple partitioned = ... // Process the good data. PCollection<String> processed = partitioned.get(mainOutput) .apply(...) .apply(...) ...; // Repackage everything into a new output tuple. return PCollectionTuple.of(mainOutput, processed) .and(missingData, partitioned.get(missingData)) .and(badValues, partitioned.get(badValues)); } }
Robert's suggestion of using sideOutputs is great, but note that this will only work if the bad data is identified by your ParDos. There currently isn't a way to identify bad records hit during initial decoding (where the error is hit in Coder.decode). We've got plans to work on that soon.