Write to one file per window in dataflow using python - google-cloud-platform

After reading data from an unbounded source like pub/sub, I'm applying windowing. I need to write all the records belong to a window to a separate file. I found this in Java but couldn't find anything in python.

There are no details about your use case in the question so you might need to adapt some parts of the following example. One way to do it is to group elements using as key the window that they belong to. Then, we leverage filesystems.FileSystems.create to control how do we want to write the files.
Here I will be using 10s windows and some dummy data where events are separated 4s each. Generated with:
data = [{'event': '{}'.format(event), 'timestamp': time.time() + 4*event} for event in range(10)]
We use the timestamp field to assign element timestamp (this is just to emulate Pub/Sub events in a controlled way). We window the events, use the windowing info as the key, group by key and write the results to the output folder:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'Add Windows' >> beam.WindowInto(window.FixedWindows(10)) \
| 'Add Window Info' >> beam.ParDo(AddWindowingInfoFn()) \
| 'Group By Window' >> beam.GroupByKey() \
| 'Windowed Writes' >> beam.ParDo(WindowedWritesFn('output/')))
Where AddWindowingInfoFn is pretty straightforward:
class AddWindowingInfoFn(beam.DoFn):
"""output tuple of window(key) + element(value)"""
def process(self, element, window=beam.DoFn.WindowParam):
yield (window, element)
and WindowedWritesFn writes to the path that we specified in the pipeline (output/ folder in my case). Then, I use the window info for the name of the file. For convenience, I convert the epoch timestamps to human-readable dates. Finally, we iterate over all the elements and write them to the corresponding file. Of course, this behavior can be tuned at will in this function:
class WindowedWritesFn(beam.DoFn):
"""write one file per window/key"""
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
(window, elements) = element
window_start = str(window.start.to_utc_datetime()).replace(" ", "_")
window_end = str(window.end.to_utc_datetime()).replace(" ", "_")
writer = filesystems.FileSystems.create(self.outdir + window_start + ',' + window_end + '.txt')
for row in elements:
writer.write(str(row)+ "\n")
writer.close()
This will write elements belonging to each window to a different file. In my case there are 5 different ones
$ ls output/
2019-05-21_19:01:20,2019-05-21_19:01:30.txt
2019-05-21_19:01:30,2019-05-21_19:01:40.txt
2019-05-21_19:01:40,2019-05-21_19:01:50.txt
2019-05-21_19:01:50,2019-05-21_19:02:00.txt
2019-05-21_19:02:00,2019-05-21_19:02:10.txt
The first one contains only element 0 (this will vary between executions):
$ cat output/2019-05-21_19\:01\:20\,2019-05-21_19\:01\:30.txt
{'timestamp': 1558465286.933727, 'event': '0'}
The second one contains elements 1 to 3 and so on:
$ cat output/2019-05-21_19\:01\:30\,2019-05-21_19\:01\:40.txt
{'timestamp': 1558465290.933728, 'event': '1'}
{'timestamp': 1558465294.933728, 'event': '2'}
{'timestamp': 1558465298.933729, 'event': '3'}
Caveat from this approach is that all elements from the same window are grouped into the same worker. This would happen anyway if writing to a single shard or output file as per your case but, for higher loads, you might need to consider larger machine types.
Full code here

Related

How to filter pCollection on the first two characters (Python)

I hope someone can help!
I am fairly new to Apache Beam, Cloud Dataflow, and what I would like to do is
Read GZIP File Contents
File is fixed width
Filter said contents based on first two characters to another pCollection
What I have so far:
--PAR DO Function
class FilterHeader(beam.DoFn):
def process(self, element):
if element[:2] == '01':
yield element
else:
return 'Header Not found' # Return nothing - this was blank, as I was just trying to view in the return came back with anything if not the row
And my pipeline is as follows
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as p:
# Initial PCollection Full File
rows = (
p | 'Read daily Spot File' >> beam.io.ReadFromText(
file_pattern='gs://<my bucket>/filename.gz',
compression_type='gzip',
coder=coders.BytesCoder(),
skip_header_lines=0))
# Header Collection - filtered on first two characters = 01
header_collection = (
rows | 'Filter Record Type 01 to our HEADER COLLECTION' >> beam.ParDo(FilterHeader())
| 'Output Header Rows' >> beam.io.WriteToText('gs://<destination bucket>/new_fileName.txt'))
When I remove the filter, I can output all the rows so there isn't anything wrong with the file, or the initial pCollection. Once I add in the filter, the rows I am after do not come out. And yes, the data is present in the file i.e. there is one row which starts with 01 as the first characters.
Is there something simple I am missing?
Any direction greatly appreciated.

Apache Beam: How To Simultaneously Create Many PCollections That Undergo Same PTransform?

Thanks in advance!
[+] Issue:
I have a lot of files on google cloud, for every file I have to:
get the file
Make a bunch of Google-Cloud-Storage API calls on each file to index it(e.g. name = blob.name, size = blob.size)
unzip it
search for stuff in there
put the indexing information + stuff found inside file in a BigQuery Table
I've been using python2.7 and the Google-Cloud-SDK. This takes hours if I run it linearly. I was suggested Apache Beam/DataFlow to process in parallel.
[+] What I've been able to do:
I can read from one file, perform a PTransform and write to another file.
def loadMyFile(pipeline, path):
return pipeline | "LOAD" >> beam.io.ReadFromText(path)
def myFilter(request):
return request
with beam.Pipeline(options=PipelineOptions()) as p:
data = loadMyFile(pipeline,path)
output = data | "FILTER" >> beam.Filter(myFilter)
output | "WRITE" >> beam.io.WriteToText(google_cloud_options.staging_location)
[+] What I want to do:
How can I load many of those files simultaneously, perform the same transform to them in parallel, then in parallel write to big query?
Diagram Of What I Wish to Perform
[+] What I've Read:
https://beam.apache.org/documentation/programming-guide/
http://enakai00.hatenablog.com/entry/2016/12/09/104913
Again, many thanks
textio accepts a file_pattern.
From Python sdk:
file_pattern (str) – The file path to read from as a local file path or a GCS gs:// path. The path can contain glob characters
For example, suppose you have a bunch of *.txt files in storage gs://my-bucket/files/, you can say:
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| "LOAD" >> beam.io.textio.ReadFromText(file_pattern="gs://my-bucket/files/*.txt")
| "FILTER" >> beam.Filter(myFilter)
| "WRITE" >> beam.io.textio.WriteToText(output_ocation)
If you somehow do have multiple PCollections of the same type, you can also Flatten them into a single one
merged = (
(pcoll1, pcoll2, pcoll3)
# A list of tuples can be "piped" directly into a Flatten transform.
| beam.Flatten())
Ok so I resolved this by doing the following:
1) get the name of a bucket from somewhere | first PCollection
2) get a list of blobs from that bucket | second PCollection
3) do a FlatMap to get blobs individually from the list | third PCollection
4) do a ParDo that gets the metadata
5) write to BigQuery
my pipeline looks like this:
with beam.Pipeline(options=options) as pipe:
bucket = pipe | "GetBucketName" >> beam.io.ReadFromText('gs://example_bucket_eraseme/bucketName.txt')
listOfBlobs = bucket | "GetListOfBlobs" >> beam.ParDo(ExtractBlobs())
blob = listOfBlobs | "SplitBlobsIndividually" >> beam.FlatMap(lambda x: x)
dic = blob | "GetMetaData" >> beam.ParDo(ExtractMetadata())
dic | "WriteToBigQuery" >> beam.io.WriteToBigQuery(

Updating rrdtool database

My first post here so I hope I have not been too verbose.
I found I was losing datapoints due to only having 10 rows in my rrdtool config and wanted to update from a backup source file with older data.
After fixing the rows count the config was created with:
rrdtool create dailySolax.rrd \
--start 1451606400 \
--step 21600 \
DS:toGrid:GAUGE:172800:0:100000 \
DS:fromGrid:GAUGE:172800:0:100000 \
DS:totalEnerg:GAUGE:172800:0:100000 \
DS:BattNow:GAUGE:1200:0:300 \
RRA:LAST:0.5:1d:1010 \
RRA:MAX:0.5:1d:1010 \
RRA:MAX:0.5:1M:1010
and the update line in python is
newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = 'N:'+ (newline)
print UpdateE
try:
rrdtool.update(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
This all worked fine for inputting the original data (from a crontabbed website scrape) but as I said I lost data and wanted to add back the earlier datapoints.
From my backup source I had a plain text file with lines looking like
1509386401:10876.9:3446.22:18489.2:19.0
1509408001:10879.76:3446.99:18495.7:100.0
where the first field is the timestamp. And then used this code to read in the lines for the updates:
with open("rrdRecovery.txt","r") as fp:
for line in fp:
print line
## newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = line
try:
rrdtool.updatev(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
When it did not work correctly with a copy of the current version of the database I tried again on an empty database created using the same config.
In each case the update results only in the timestamp data in the database and no data from the other fields.
Python is not complaining and I expected
1509386401:10876.9:3446.22:18489.2:19.0
would update the same as does
N:10876.9:3446.22:18489.2:19.0
The dump shows the lastupdate data for all fields but then this for the rra database
<!-- 2017-10-31 11:00:00 AEDT / 1509408000 --> <row><v>NaN</v><v>NaN</v><v>NaN</v><v>NaN</v></row>
Not sure if I have a python issue - more likely a rrdtool understanding problem. Thanks for any pointers.
The problem you have is that RRDTool timestamps must be increasing. This means that, if you increase the length of your RRAs (back into the past), you cannot put data directly into these points - only add new data onto the end as time increases. Also, when you create a new RRD, the 'last update' time defaults to NOW.
If you have a log of your previous timestamp, then you should be able to add this history, as long as you don't do any 'now' updates before you finish doing so.
First, create the RRD, with a 'start' time earlier than the first historical update.
Then, process all of the historical updates in chronological order, with the appropriate timestamps.
Finally, you can start doing your regular 'now' updates.
I suspect what has happened is that you had your regular cronjob adding in new data before you have run all of your historical data input - or else you created the RRD with a start time after your historical timestamps.

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

Multithreading in Python with the threading and queue modules

I have a file with hundreds of thousands of lines, each line of which needs to be undergo the same process (calculating a co-variance). I was going to multithread because it takes pretty long as is. All the examples/tutorials I have seen have been fairly complicated for what I want to do, however. If anyone could point me to a good tutorial that explains how to use the two modules together that would be great.
Whenever I have to process something in parallel, I use something similar to this (I just ripped this out of an existing script):
#!/usr/bin/env python2
# This Python file uses the following encoding: utf-8
import os, sys, time
from multiprocessing import Queue, Manager, Process, Value, Event, cpu_count
class ThreadedProcessor(object):
def __init__(self, parser, input_file, output_file, threads=cpu_count()):
self.parser = parser
self.num_processes = threads
self.input_file = input_file
self.output_file = output_file
self.shared_proxy = Manager()
self.input_queue = Queue()
self.output_queue = Queue()
self.input_process = Process(target=self.parse_input)
self.output_process = Process(target=self.write_output)
self.processes = [Process(target=self.process_row) for i in range(self.num_processes)]
self.input_process.start()
self.output_process.start()
for process in self.processes:
process.start()
self.input_process.join()
for process in self.processes:
process.join()
self.output_process.join()
def parse_input(self):
for index, row in enumerate(self.input_file):
self.input_queue.put([index, row])
for i in range(self.num_processes):
self.input_queue.put('STOP')
def process_row(self):
for index, row in iter(self.input_queue.get, 'STOP'):
self.output_queue.put([index, row[0], self.parser.parse(row[1])])
self.output_queue.put('STOP')
def write_output(self):
current = 0
buffer = {}
for works in range(self.num_processes):
for index, id, row in iter(self.output_queue.get, 'STOP'):
if index != current:
buffer[index] = [id] + row
else:
self.output_file.writerow([id] + row)
current += 1
while current in buffer:
self.output_file.writerow(buffer[current])
del buffer[current]
current += 1
Basically, you have two processes managing the reading/writing of the file. One reads and parses the input, the other reads from the "done" queue and writes to your output file. The other processes are spawned (in this case the number is equal to the number of total processor cores your CPU has) and they all process elements from the input queue.