Apache Beam cant read Avro file - google-cloud-platform

Apache Beam cant read Avro file - google-cloud-platform

I need to read in an avro file from local or gcs, via java.
I followed the example from docs from https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/AvroIO.html
Pipeline p = ...;
// A Read from a GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.readGenericRecords(schema)
.from("gs://my_bucket/path/to/records-*.avro"));
But when I try to process it through a DoFn there doesnt appear to be any data there.
The avro file does have data and was able to run a function to generate a schema from it.
If anybody has advice please share.

I absolutely agree with Andrew, more information would be required. However, I think you should consider using AvroIO.Read which is a more appropriate transform to read records from one or more Avro files.
https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
.from("gs://my_bucket/path/records-*.avro")
.withSchema(schema));

Hey guys thanks for looking into this. I can't share any code because they belong to clients. I did not receive any error messages, and the debugger did see data, but we were not able to see the data in the avro file (via pardo).
I did manage to fix the issue by recreating the dataflow project using the Eclipse wizard. I even used the same code. I wonder why I did not receive any error messages.

Related

Is it possible to save reports and data transformation steps in PowerBI?

I have prepared some reports based on the files I prepared. I am wondering is it possible to save this report (measures and visualizations) and also the steps I made while transforming data? I want to be able to load new files (which in the structure are the same as the ones I used creating my report) and the data transformation and report done automatically on this updated data.
Is it possible?

You can save it as a template - file extension pbit. It saves only the structure of the file, without actual data. When opening the report it refreshes it, and if there are parameters in the report (e.g. folder/file path or server address) it will refresh it considering the input values
you can read more here
https://learn.microsoft.com/en-us/power-bi/create-reports/desktop-templates

You can simply save your file as a template.

How to use regular expression in Google Dataflow streaming templates?

Using the Dataflow streaming templates, namely the Cloud Storage Text to BigQuery (Stream) template, it used to be possible to describe the "inputFilePattern" (i.e.: the Cloud Storage location of the text you'd like to process) as a regular expression. For example you could enter gs://my-bucket/my-files/file-to-upload* as the parameter and all the files starting with "file-to-upload" would then be streamed.
Unfortunately it now throws this error message: "Object not found."
Is there another way to upload all files from a google storage location with a similar naming convention to BigQuery?
Please see screenshots below:
Thanks in advance.

This looks like a bug in the UI you can pass the file pattern when you submit the job via command line. The source code takes the file pattern as input so there should not be any problem with the actual job
PCollectionTuple transformedOutput =
pipeline
// 1) Read from the text source continuously.
.apply(
"ReadFromSource",
TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))

Beam/Dataflow ReadAllFromParquet doesn't read anything but my job still succeeds?

I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?

You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.

Is it possible to write back to data file in postman?

While working with postman, data.someVariable returns data from within a csv file that can also be used as {{someVariable}} in uri/json.
This gives us the data for that variable from that row/iteration.
Is there a mechanism to write back to the data file by doing something like postman.setData('responseCode') = responseCode.
This would be really helpful to store response code in the data file and to record call wise details in same format as the input within csv.

The only solution I figured out is
to populate json objects in the environment with information about the data file name and structure/values of information to be added
to create a separate web service (maybe in node.js) that exposes an http call to write to a file and takes in as parameter a json input as the one created in the environment as mentioned above and writes that to a file / original data file (or a copy of it) in the desired format
to call the above mentioned web service call at the end of each run or desired rest call execution to generate step wise information/debug report

There is no way to write back to data file in postman as of now .
However, you can populate that in your environment file at run time using
pm.environment.set("varname")
keep varname in such a way that you understand this is the variable you wanted to write back into data file.

how can I migrate issues from redmine to tuleap

Originally we use Redmine as issue management system, now we are planning to migrate to Tuleap system.
Both system have features to import/export issues into .csv file.
I want to know whether there is standard / simple way to migrate issues.
The main items inside issues are status, title and description.

What are "remaining_effort" and "cross_references" kind of data in remind ?

Since both system can export the csv file, which contains the item header that they needed, some header is different.
It needs scripts to map from one system to another system, code snippet is shown below.
It can work for other ALM system if they don't support from application (I mean migration).
#!/usr/bin/env python
import csv
import sys
# read sample tuleap csv header to avoid some field changes
tuleapcsvfile = open('tuleap.csv', 'rb')
reader = csv.DictReader(tuleapcsvfile)
to_del = ["remaining_effort","cross_references"]
# remove unneeded items
issueheader = [i for i in reader.fieldnames if not i in to_del]
# open stdout for output
w = csv.DictWriter(sys.stdout, fieldnames=issueheader,lineterminator="\n")
w.writeheader()
# read redmine csv files for converting
redminecsvfile = open('redmine.csv', 'rb')
redminereader = csv.DictReader(redminecsvfile)
for row in redminereader:
newrow = {}
if row['Status']=='New':
newrow['status'] = "Not Started"
# some simple one to one mapping
newrow['i_want_to' ]= row['Subject']
newrow['so_that'] = row['Description']
w.writerow(newrow)
some items in exported csv can't be imported back in tuleap like
remaining_effort,cross_references.
These two items are shown inside exported .csv file from tuleap issues.

Had the same issue and the csv solution looked too limited to me:
the field matching between tracker and csv content must fit exactly
you can't import attachments
you can't link artifacts
...
Issues can be extracted from Redmine using REST API or by directly reading the SQL database. Artifacts can be created in Tuleap using the REST API. You "just" need a script in the middle to extract issues from Redmine and then import them into Tuleap.
I created such a script in Python:
It has a plugin approach so that it could import issues/bugs from any bug tracker and later save them to any other bug tracker.
For now it only support extracting issues from Redmine SQL database and export to Tuleap using REST API.
One can extend it (new plugin) to extract issues from other trackers (bugzilla/mantis/gitlab).
One can extend it (new plugin) to generate a Tuleap xml file rather than importing the artifacts using Tuleap REST API (XML being more powerful here).
I ported hundreds of issues from Redmine to Tuleap using this and it was good enough for my needs.
Have a look at https://github.com/jpo38/TrackerIO.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Beam cant read Avro file - google-cloud-platform

Related

Is it possible to save reports and data transformation steps in PowerBI?

How to use regular expression in Google Dataflow streaming templates?

Beam/Dataflow ReadAllFromParquet doesn't read anything but my job still succeeds?

Is it possible to write back to data file in postman?

how can I migrate issues from redmine to tuleap

Categories

Resources