Dataflow is being pre-processed by reading batch data.
The workload is read from Google Cloud Storage (GCS) to process Dataflow and upload it back to GCS.
But after processing the data, I checked the GCS.
result-001.csv
result-002.csv
result-003.csv
This is how the data is divided and stored.
Can't I combine these files into one?
#-*- coding: utf-8 -*-
import apache_beam as beam
import csv
import json
import os
import re
from apache_beam.io.gcp.bigquery_tools import parse_table_schema_from_json
def preprocessing(fields):
fields = fields.split(",")
header = "label"
for i in range(0, 784):
header += (",pixel" + str(i))
label_list_str = "["
label_list = []
for i in range(0,10) :
if fields[0] == str(i) :
label_list_str+=str(i)
else :
label_list_str+=("0")
if i!=9 :
label_list_str+=","
label_list_str+="],"
for i in range(1,len(fields)) :
label_list_str+=fields[i]
if i!=len(fields)-1:
label_list_str+=","
yield label_list_str
def run(project, bucket, dataset) :
argv = [
"--project={0}".format(project),
"--job_name=kaggle-dataflow",
"--save_main_session",
"--region=asia-northeast1",
"--staging_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--temp_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--max_num_workers=8",
"--worker_region=asia-northeast3",
"--worker_disk_type=compute.googleapis.com/projects//zones//diskTypes/pd-ssd",
"--autoscaling_algorithm=THROUGHPUT_BASED",
"--runner=DataflowRunner",
"--worker_region=asia-northeast3"
]
result_output = 'gs://kaggle-bucket-v1/result/result.csv'
filename = "gs://{}/train.csv".format(bucket)
pipeline = beam.Pipeline(argv=argv)
ptransform = (pipeline
| "Read from GCS" >> beam.io.ReadFromText(filename)
| "Kaggle data pre-processing" >> beam.FlatMap(preprocessing)
)
(ptransform
| "events:out" >> beam.io.WriteToText(
result_output
)
)
pipeline.run()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Run pipeline on the cloud")
parser.add_argument("--project", dest="project", help="Unique project ID", required=True)
parser.add_argument("--bucket", dest="bucket", help="Bucket where your data were ingested", required=True)
parser.add_argument("--dataset", dest="dataset", help="BigQuery dataset")
args = vars(parser.parse_args())
print("Correcting timestamps and writing to BigQuery dataset {}".format(args["dataset"]))
run(project=args["project"], bucket=args["bucket"], dataset=args["dataset"])
Thank you for reading :)
Method beam.io.WriteToText automatically splits files when writing for best performance. You can explicitly add parameter num_shards = 1 if you want 1 file only.
num_shards (int) – The number of files (shards) used for output. If
not set, the service will decide on the optimal number of shards.
Constraining the number of shards is likely to reduce the performance
of a pipeline. Setting this value is not recommended unless you
require a specific number of output files.
Your writing to text should look like this:
(ptransform
| "events:out" >> beam.io.WriteToText(
result_output,num_shards=1
)
)
Related
I'm using SageMaker to create and deploy a SKLearn estimator (endpoint). I created a my_file.py custom script, but when I fit the estimator, in the logs I see:
TypeError: 'NoneType' object is not subscriptable
where the object is a pandas dataframe obtained by reading a parquet dataset.
If I run the my_file.py file locally, the dataset is not set to None.
Here's my SageMaker code:
import boto3
import sagemaker
from sagemaker.sklearn import SKLearn
my_estimator = SKLearn(entry_point = 'my_file.py',
source_dir = 'my_dir',
py_version = 'py3',
role = sagemaker.get_execution_role(),
instance_type = '<instance_type>',
instance_count = <number_of_instances>,
framework_version = '<version>',
output_path = 'my_path',
base_job_name = 'my_job_name')
# These directories are S3 directories
model_dir = 'my_model_dir'
dataset = 'my_dataset_dir'
file1 = 'my_file1_dir'
empty_dir1 = 'my_empty_dir1'
empty_dir2 = 'my_empty_dir2'
empty_dir3 = 'my_empty_dir3'
my_estimator.fit(inputs = {'dataset': dataset,
'file1': file1,
'emptydir1': empty_dir1,
'emptydir2': empty_dir2,
'emptydir3': empty_dir3}, logs = True)
And here's my my_file.py code:
# importing libraries
# defining some functions
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--dataset', type=str, default=os.environ['SM_CHANNEL_DATASET'])
parser.add_argument('--file1', type=str, default=os.environ['SM_CHANNEL_FILE1'])
parser.add_argument('--emptydir1', type=str, default=os.environ['SM_CHANNEL_EMPTYDIR1'])
parser.add_argument('--emptydir2', type=str, default=os.environ['SM_CHANNEL_EMPTYDIR2'])
parser.add_argument('--emptydir3', type=str, default=os.environ['SM_CHANNEL_EMPTYDIR3'])
args, _ = parser.parse_known_args()
filename = args.dataset
dataset = pd.read_parquet(filename)
print(len(dataset))
# The following line results in the error
dataset['column1'] = dataset['column1'].astype(str)
# other lines of code
Printing the length of the dataset gives me the correct length, so the dataframe is not empty.
Also, the values in the dataframe are neither None nor NaN.
Don't know why I get this error. Any advice?
EDIT
After re-running the fit code some times the error disappeared
Can you try running the following code locally with the same dataset.
dataset['column1'] = dataset['column1'].astype(str)
It seems as if CW is emitting an error specific to this API call/line, it is best to perhaps test this locally first before ingesting into a SageMaker script as this is dependent on the libraries that you are utilizing. You can also use SageMaker local mode to capture these errors early.
I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))
Upload the file to Google Cloud Storage.
Dataflow is being pre-processed by reading batch data.
The workload is read from Google Cloud Storage (GCS) to process Dataflow and upload it back to GCS.
But after processing the data, I checked the GCS.
#-*- coding: utf-8 -*-
import apache_beam as beam
import csv
import json
import os
import re
from apache_beam.io.gcp.bigquery_tools import parse_table_schema_from_json
def preprocessing(fields):
fields = fields.split(",")
header = "label"
for i in range(0, 784):
header += (",pixel" + str(i))
label_list_str = "["
label_list = []
for i in range(0,10) :
if fields[0] == str(i) :
label_list_str+=str(i)
else :
label_list_str+=("0")
if i!=9 :
label_list_str+=","
label_list_str+="],"
for i in range(1,len(fields)) :
label_list_str+=fields[i]
if i!=len(fields)-1:
label_list_str+=","
yield label_list_str
def run(project, bucket, dataset) :
argv = [
"--project={0}".format(project),
"--job_name=kaggle-dataflow",
"--save_main_session",
"--region=asia-northeast1",
"--staging_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--temp_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--max_num_workers=8",
"--worker_region=asia-northeast3",
"--worker_disk_type=compute.googleapis.com/projects//zones//diskTypes/pd-ssd",
"--autoscaling_algorithm=THROUGHPUT_BASED",
"--runner=DataflowRunner",
"--worker_region=asia-northeast3"
]
result_output = 'gs://kaggle-bucket-v1/result/result.csv'
filename = "gs://{}/train.csv".format(bucket)
pipeline = beam.Pipeline(argv=argv)
ptransform = (pipeline
| "Read from GCS" >> beam.io.ReadFromText(filename)
| "Kaggle data pre-processing" >> beam.FlatMap(preprocessing)
)
(ptransform
| "events:out" >> beam.io.WriteToText(
result_output
)
)
pipeline.run()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Run pipeline on the cloud")
parser.add_argument("--project", dest="project", help="Unique project ID", required=True)
parser.add_argument("--bucket", dest="bucket", help="Bucket where your data were ingested", required=True)
parser.add_argument("--dataset", dest="dataset", help="BigQuery dataset")
args = vars(parser.parse_args())
print("Correcting timestamps and writing to BigQuery dataset {}".format(args["dataset"]))
run(project=args["project"], bucket=args["bucket"], dataset=args["dataset"])
I am trying to do now is to upload a file containing a header to GCS and add data by working with the name of the file. However, it doesn't work well.
How can I include the header in the CSV file?
I have set up a small test using Google Dataflow (apache-beam). The use case for the experiment is to take a (csv) file and write a selected column to a (txt) file.
The code for the experiment is as listed below:
from __future__ import absolute_import
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
class EmitColDoFn(beam.DoFn):
first = True
header = ""
def __init__(self, i):
super(EmitColDoFn, self).__init__()
self.line_count = Metrics.counter(self.__class__, 'lines')
self.i = i
def process(self, element):
if self.first:
self.header = element
self.first = False
else:
self.line_count.inc()
cols = re.split(',', element)
return (cols[self.i],)
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='/users/sms/python_beam/data/MOCK_DATA (4).csv',
# default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
default="/users/sms/python_beam/data/",
# required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
column = (lines
| 'email col' >> (beam.ParDo(EmitColDoFn(3)))
| "col file" >> WriteToText(known_args.output, ".txt", shard_name_template="SS_Col"))
result = p.run()
result.wait_until_finish()
if (not hasattr(result, 'has_job') # direct runner
or result.has_job): # not just a template creation
lines_filter = MetricsFilter().with_name('lines')
query_result = result.metrics().query(lines_filter)
if query_result['counters']:
lines_counter = query_result['counters'][0]
print "Lines committed", lines_counter.committed
run()
The last few lines of sample 1 below:
990,Corabel,Feldbau,cfeldbaurh#deliciousdays.com,Female,84.102.162.190,DJ
991,Kiley,Rottcher,krottcherri#stanford.edu,Male,91.97.155.28,CA
992,Glenda,Clist,gclistrj#state.gov,Female,24.98.253.127,UA
993,Ingunna,Maher,imaherrk#army.mil,Female,159.31.127.19,PL
994,Megan,Giacopetti,mgiacopettirl#instagram.com,Female,115.6.63.52,RU
995,Briny,Dutnall,bdutnallrm#xrea.com,Female,102.81.33.24,SE
996,Jan,Caddan,jcaddanrn#jalbum.net,Female,115.142.222.106,PL
Running this produces the expected output of:
/usr/local/bin/python2.7
/Users/sms/Library/Preferences/PyCharmCE2017.1/scratches/scratch_4.py
No handlers could be found for logger "oauth2client.contrib.multistore_file"
Lines committed 996
Process finished with exit code 0
Now for the strange results. In the next run, the number of lines is increased to 1000.
994,Megan,Giacopetti,mgiacopettirl#instagram.com,Female,115.6.63.52,RU
995,Briny,Dutnall,bdutnallrm#xrea.com,Female,102.81.33.24,SE
996,Jan,Caddan,jcaddanrn#jalbum.net,Female,115.142.222.106,PL
997,Shannen,Gaisford,sgaisfordr7#rediff.com,Female,167.255.222.92,RU
998,Lorianna,Slyne,lslyner8#cbc.ca,Female,54.169.60.13,CN
999,Franklin,Yaakov,fyaakovr9#latimes.com,Male,122.1.92.236,CN
1000,Wilhelmine,Cariss,wcarissra#creativecommons.org,Female,237.48.113.255,PL
But this time the out put is
/usr/local/bin/python2.7
/Users/sms/Library/Preferences/PyCharmCE2017.1/scratches/scratch_4.py
No handlers could be found for logger "oauth2client.contrib.multistore_file"
Lines committed 999
Process finished with exit code 0
Inspection of the output file shows that the last line was NOT processed.
bdutnallrm#xrea.com
jcaddanrn#jalbum.net
sgaisfordr7#rediff.com
lslyner8#cbc.ca
fyaakovr9#latimes.com
Any ideas what is going on here?
'EditColDoFn' skips first line, assuming there is one instance of it for each file. When you have more 1000 lines, the DirectRunner creates two bundles : 1000 lines in first one, and 1 line in second. In a Beam application, the input might be split into multiple bundles for processing in parallel. There is no correlation to number of files and number of bundles. Same application can process terra bytes of data spread across many files.
ReadFromText has an option 'skip_header_lines', which you can set to 1 in order to skip header line in each of your input files.
I am trying to automate the running of and downloading nessus scans using python. I have been using the nessrest api for python, and am able to successfully run a scan, but am not being successfully download the report in nessus format.
Any ideas how I can do this? I have been using the module scan_download, but that actually executes before my scan even finishes.
Thanks for the help in advance!
Just looking back at this question, heres an example of using Nessrest API to pull down CSV report exports from you nessus host,
#!/usr/bin/python2.7
import sys
import os
import io
from nessrest import ness6rest
file_format = 'csv' # options: nessus, csv, db, html
dbpasswd = ''
scan = ness6rest.Scanner(url="https://nessus:8834", login="admin", password="P#ssword123", insecure=True)
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
if scan:
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
for f in folders:
if not os.path.exists(f['name']):
if not f['type'] == 'trash':
os.mkdir(f['name'])
for s in scans:
scan.scan_name = s['name']
scan.scan_id = s['id']
folder_name = next(f['name'] for f in folders if f['id'] == s['folder_id'])
folder_type = next(f['type'] for f in folders if f['id'] == s['folder_id'])
# skip trash items
if folder_type == 'trash':
continue
if s['status'] == 'completed':
file_name = '%s_%s.%s' % (scan.scan_name, scan.scan_id, file_format)
file_name = file_name.replace('\\','_')
file_name = file_name.replace('/','_')
file_name = file_name.strip()
relative_path_name = folder_name + '/' + file_name
# PDF not yet supported
# python API wrapper nessrest returns the PDF as a string object instead of a byte object, making writing and correctly encoding the file a chore...
# other formats can be written out in text mode
file_modes = 'wb'
# DB is binary mode
#if args.format == "db":
# file_modes = 'wb'
with io.open(relative_path_name, file_modes) as fp:
if file_format != "db":
fp.write(scan.download_scan(export_format=file_format))
else:
fp.write(scan.download_scan(export_format=file_format, dbpasswd=dbpasswd))
can see more examples here,
https://github.com/tenable/nessrest/tree/master/scripts