I am trying to read values from Google pubsub and Google storage and put those values into big query based on count conditions i.e., if the values exists, then it should not insert the value, else it can insert a value.
My code looks like this:
p_bq = beam.Pipeline(options=pipeline_options1)
logging.info('Start')
"""Pipeline starts. Create creates a PCollection from what we read from Cloud storage"""
test = p_bq | beam.Create(data)
"""The pipeline then reads from pub sub and then combines the pub sub with the cloud storage data"""
BQ_data1 = p_bq | 'readFromPubSub' >> beam.io.ReadFromPubSub(
'mytopic') | beam.Map(parse_pubsub, param=AsList(test))
where 'data' is the value from Google storage and reading from pubsub is the value from Google Analytics. Parse_pubsub returns 2 values: one is the dictionary and the other is count (which states the value is present or not in the table)
count=comparebigquery(insert_record)
return (insert_record,count)
How to provide condition to big query insertion since the value is in Pcollection
New edit:
class Process(beam.DoFn):
def process1(self, element, trans):
if element['id'] in trans:
# Emit this short word to the main output.
yield pvalue.TaggedOutput('present',element)
else:
# Emit this word's long length to the 'above_cutoff_lengths' output.
yield pvalue.TaggedOutput(
'absent', present)
test1 = p_bq | "TransProcess" >> beam.Create(trans)
where trans is the list
BQ_data2 = BQ_data1 | beam.ParDo(Process(),trans=AsList(test1)).with_outputs('present','absent')
present_value=BQ_data2.present
absent_value=BQ_data2.absent
Thank you in advance
You could use
beam.Filter(lambda_function)
after the beam.Map step to filter out elements that return False when passed to the lambda_function.
You can split the PCollection in a ParDo function using Additional-outputs based on a condition.
Don't forget to provide output tags to the ParDo function .with_outputs()
And when writing elements of a PCollection to a specific output, use .TaggedOutput()
Then you select the PCollection you need and write it to BigQuery.
Related
I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv.
After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.
We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table!
Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?
I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)
Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.
Any suggestions? Or script to parse more efficiently the json provided?
Thank you!
Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax
If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.
Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.
Here is a functioning code:
def GetResults(jobId, file_name):
maxResults = 1000
paginationToken = None
finished = False
blocks = []
while finished == False:
response = None
if paginationToken == None:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
else:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
NextToken=paginationToken)
blocks += response['Blocks']
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
table_csv = get_table_csv_results(blocks)
output_file = file_name + ".csv"
# replace content
with open(output_file, "w") as fout: # Important to change "at" to "w"
fout.write(table_csv)
# show the results
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
print('OUTPUT TO CSV FILE: ', output_file)
Hope this will help people.
I have the following Stream Dataframe
+------------------------------------+
|______sentence______________________|
| Representative is a scientist |
| Norman did a good job in the exam |
| you want to go on shopping? |
--------------------------------------
I have list as follows
val myList
as the final output i need myList contain above three sentences in the stream dataframe
output
myList = [Representative is a scientist, Norman did a good job in the exam, you want to go on shopping? ]
I tried the following which gives stream error
val myList = sentenceDataframe.select("sentence").rdd.map(r => r(0)).collect.toList
Error thrown with above method
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start()
Please note that above method work with normal datframe but not with stream dataframe.
Is there a way to iterate through each row of the stream dataframe and assign the row value into the common list using scala and spark ?
That sounds like a very weird Use-Case as the stream could theoretically never end. Are you sure you are not just looking for common spark DataFrames?
if that is not the case what you can do is use Accumulators and sparks streaming foreachBatch sink. I used a simple socket connection to demonstrate this. You can start a simple socket server under e.g. ubuntu with nc -lp 3030 and just past message there to the stream, the resulting DataFrame will have a schema of [value: String]
val acc = spark.sparkContext.collectionAccumulator[String]
val stream = spark.readStream.format("socket").option("host", "localhost").option("port", "3030").load()
val query = stream.writeStream.foreachBatch((df: DataFrame, l: Long) => {
df.collect.foreach(v => acc.add(v(0).asInstanceOf[String]))
}).start()
...
// For some reason you are stopping the stream here
query.stop()
val myList = acc.value
Now one question you might have is why are we using Accumulators and not just an ArrayBuffer. ArrayBuffers would work locally but on a cluster the code in foreachBatch might be executed on a total different node. That means it would not have any effect and thats also the reason Accumulators exist in the first place (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators)
I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" method as below and it was able to hold the load until incoming record counts were less than 50K
PS: The regular expression reference data is a broadcasted dataset.
dataset.join(regexDataset.value, expr("input_column rlike regular_exp_column")
Then I wrote a custom UDF to transform them using Scala native regex search as below,
Below val collects the reference data as Array of tuples.
val regexPreCalcArray: Array[(Int, Regex)] = {
regexDataset.value
.select( "col_1", "regex_column")
.collect
.map(row => (row.get(0).asInstanceOf[Int],row.get(1).toString.r))
}
Implementation of Regex matching UDF,
def findMatchingPatterns(regexDSArray: Array[(Int,Regex)]): UserDefinedFunction = {
udf((input_column: String) => {
for {
text <- Option(input_column)
matches = regexDSArray.filter(regexDSValue => if (regexDSValue._2.findFirstIn(text).isEmpty) false else true)
if matches.nonEmpty
} yield matches.map(x => x._1).min
}, IntegerType)
}
Joins are done as below, where a unique ID from reference data will be returned from UDF in case of multiple regex matches and joined against reference data using unique ID to retrieve other columns needed for result,
dataset.withColumn("min_unique_id", findMatchingPatterns(regexPreCalcArray)($"input_column"))
.join(regexDataset.value, $"min_unique_id" === $"unique_id" , "left")
But this too gets very slow with skew in execution [1 executor task runs for a very long time] when record count increases above 1M. Spark suggests not to use UDF as it would degrade the performance, any other best practises I should apply here or if there's a better API for Scala regex match than what I've written here? or any suggestions to do this efficiently would be very helpful.
This is a follow-up to this question:
flatMap and `Ambiguous reference to member` error
There I am using the following code to convert an array of Records to an array of Persons:
let records = // load file from bundle
let persons = records.flatMap(Person.init)
Since this conversion can take some time for big files, I would like to monitor an index to feed into a progress indicator.
Is this possible with this flatMap construction? One possibility I thought of would be to send a notification in the init function, but I am thinking counting the records is also possible from within flatMap ?
Yup! Use enumerated().
let records = // load file from bundle
let persons = records.enumerated().flatMap { index, record in
print(index)
return Person(record)
}
I am trying to integrate influxdb with my application and process the output. I am importing InfluxDBClient package to connect to influx instance running on my local machine. Using query() that returns data in 'influxdb.resultset.ResultSet' format.
However, I want to be able to pick each element specifically from the Resultset for my computations. I was using different functions like keys(), items() and values() from the influxdb-python manual here but of no use:
http://influxdb-python.readthedocs.io/en/latest/api-documentation.html
This is the sample output of the query():
Result: ResultSet({'(u'cpu', None)': [{u'usage_guest_nice': 0, u'usage_user': 0.90783871790308868, u'usage_nice': 0, u'usage_steal': 0, u'usage_iowait': 0.056348610076366427, u'host': u'xxx.xxx.hostname.com', u'usage_guest': 0, u'usage_idle': 98.184322579062794, u'usage_softirq': 0.0062609566755314457, u'time': u'2016-06-26T16:25:00Z', u'usage_irq': 0, u'cpu': u'cpu-total', u'usage_system': 0.84522915123660536}]})
I am also finding it hard to get the data in JSON format using Raw mentioned in the above link. Would be great to have any pointers to process the above output.
items() returns a tuple in below format, ((u'cpu', None), ), where the generator can be used to loop and get the actual data in Dictionary format. Took some time for me to figure out but it was fun!!
According to the docs you could use the get_points() function to retrieve results from an InfluxDB resultset. The function allows you to filter by either measurement, tag, both measurement AND tag, or simply get all the results without any filtering.
Getting all points
Using rs.get_points() will return a generator for all the points in the ResultSet.
Filtering by measurement
Using rs.get_points('cpu') will return a generator for all the points that are in a serie with measurement name cpu, no matter the tags.
rs = cli.query("SELECT * from cpu")
cpu_points = list(rs.get_points(measurement='cpu'))
Filtering by tags
Using rs.get_points(tags={'host_name': 'influxdb.com'}) will return a generator for all the points that are tagged with the specified tags, no matter the measurement name.
rs = cli.query("SELECT * from cpu")
cpu_influxdb_com_points = list(rs.get_points(tags={"host_name": "influxdb.com"}))
Filtering by measurement and tags
Using measurement name and tags will return a generator for all the points that are in a serie with the specified measurement name AND whose tags match the given tags.
rs = cli.query("SELECT * from cpu")
points = list(rs.get_points(measurement='cpu', tags={'host_name': 'influxdb.com'}))