How to test a transformation in Palantir Foundry? - unit-testing

We try to create a test function for the whole transformation.
import os
from transforms.verbs.testing.TransformRunner import TransformRunner
from transforms.api import Pipeline
from .myproject.datasets import my_transform
# This assumes your test data exists in the folder /test/fixtures/data/ within the repo next to this test
TEST_DATA_DIR = os.path.join(os.path.dirname(__file__), 'fixtures', 'data')
def test_my_transform(spark_session):
pipeline = Pipeline()
pipeline.add_transforms(my_transform)
runner = TransformRunner(pipeline, '/my_fabulous_project', TEST_DATA_DIR)
output = runner.build_dataset(spark_session, '/my_fabulous_project/output/test')
assert output.first()['col_c'] == 3
Based on the documentation and this post, we tried to modify the import of the function, but we always get one of these errors:
transforms._errors.TransformTypeError: Expected arguments to be of type <class 'transforms.api._transform.Transform'>
ModuleNotFoundError: No module named 'test.myproject'
ValueError: attempted relative import beyond top-level package
How to create a working end-to-end testing function for a transformation?

The following transformation tests work for functions decorated both with #transform and #transform_df.
my_transform.py is located in the repository in src/myproject/datasets folder.
from transforms.api import Input, Output, transform_df
from pyspark.sql import functions as F
#transform_df(
Output('/some_foundry_path/my_dir/out'),
input_a=Input('/some_foundry_path/my_dir/in'))
def compute_sum(input_a):
df = input_a.withColumn('col_c', F.col('col_a') + F.col('col_b'))
return df
Input file:
Approach where test inputs are stored in-memory
test_my_transform.py is located in the repository in src/test folder.
from transforms.api import Pipeline
from transforms.verbs.testing.TransformRunner import TransformRunner
from transforms.verbs.testing.datastores import InMemoryDatastore
from myproject.datasets.my_transform import compute_sum
def test_compute_sum(spark_session):
df_in = spark_session.createDataFrame([
(0, 1)
], ['col_a', 'col_b'])
df_expected = spark_session.createDataFrame([
(0, 1, 1)
], ['col_a', 'col_b', 'col_c'])
path_in = '/some_foundry_path/my_dir/in'
path_out = '/some_foundry_path/my_dir/out'
pipeline = Pipeline()
pipeline.add_transforms(compute_sum)
store = InMemoryDatastore()
store.store_dataframe(path_in, df_in)
runner = TransformRunner(pipeline, datastore=store)
df_out = runner.build_dataset(spark_session, path_out)
assert df_out.subtract(df_expected).count() == 0
assert df_expected.subtract(df_out).count() == 0
assert df_out.schema == df_expected.schema
path_in and path_out are exactly the same as the Input and Output paths of the transformation. So it's easy to follow this script.
Approach where test inputs are stored in .csv in repository
This approach is in the official documentation. It is more elaborate, not so easy to understand what paths should be created, and it may be hard to maintain: in case dataset path changes, a new repository tree might be needed to create.
test_my_transform.py is located in the repository in src/test folder.
from transforms.api import Pipeline
from transforms.verbs.testing.TransformRunner import TransformRunner
import os
from myproject.datasets.my_transform import compute_sum
# Taking this .py file's dir and appending the path to the test data
TEST_DATA_DIR = os.path.join(os.path.dirname(__file__), 'fixtures/data/input')
def test_compute_sum(spark_session):
path_in_prefix = '/some_foundry_path/my_dir'
path_out = '/some_foundry_path/my_dir/out'
pipeline = Pipeline()
pipeline.add_transforms(compute_sum)
runner = TransformRunner(pipeline, path_in_prefix, TEST_DATA_DIR)
df_out = runner.build_dataset(spark_session, path_out)
assert df_out.head()['col_c'] == 1
Test CSV file (in.csv - it has the same name in as transformation Input) is created inside the repository:
col_a,col_b
0,1
Note:
for all the inputs
Input path (/some_foundry_path/my_dir/in)
less
path_in_prefix (/some_foundry_path/my_dir/)
should be equal to
CSV test file full path (...src/test/fixtures/data/input/in)
less
TEST_DATA_DIR (...src/test/fixtures/data/input)
To make tests run automatically together with checks, uncomment the following line in transforms-python/build.gradle:

After trying out several approaches with different conditions, the following approach seems cleanest to me.
no hard-coding paths to datasets
it is very explicit about adding/removing transformation inputs
in-memory dataframes are used as test inputs
test_my_transform.py
from transforms.api import Pipeline
from transforms.verbs.testing.TransformRunner import TransformRunner
from transforms.verbs.testing.datastores import InMemoryDatastore
from myproject.datasets.my_transform import compute_sum
def test_compute_sum(spark_session):
df_input1 = spark_session.createDataFrame([
(0, 2)
], ['col_a', 'col_b'])
df_input2 = spark_session.createDataFrame([
(0, 1)
], ['col_a', 'col_b'])
df_expected = spark_session.createDataFrame([
(0, 1, 1),
(0, 2, 2)
], ['col_a', 'col_b', 'col_c'])
# If #transform_df or #transform_pandas, the key is 'bound_output'
# If #transform, the key is the name of variable Output
output_map = {'out': df_expected}
input_map = {
'input_a': df_input1,
'input_b': df_input2,
}
pipeline = Pipeline()
pipeline.add_transforms(compute_sum)
store = InMemoryDatastore()
for inp_name, inp_obj in pipeline.transforms[0].inputs.items():
store.store_dataframe(inp_obj.alias, input_map[inp_name])
path_out = pipeline.transforms[0].outputs[list(output_map)[0]].alias
runner = TransformRunner(pipeline, datastore=store)
df_out = runner.build_dataset(spark_session, path_out)
assert df_out.subtract(df_expected).count() == 0
assert df_expected.subtract(df_out).count() == 0
assert df_out.schema == df_expected.schema
my_transform.py
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
#transform(
out=Output('/some_foundry_path/my_dir/out3'),
input_a=Input('/some_foundry_path/my_dir/in'),
input_b=Input('/some_foundry_path/my_dir/in2'))
def compute_sum(input_a, input_b, out):
input_a = input_a.dataframe()
input_b = input_b.dataframe()
df = input_a.unionByName(input_b)
df = df.withColumn('col_c', F.col('col_a') + F.col('col_b'))
out.write_dataframe(df)

Related

How to deploy the Deep learning model(computer vision ) in aws using lambda

I have trained in background removal with my custom images and I am new to deploying computer vision models. please, anyone, share the blog how to deploy the Pytorch deep learning model(computer vision) using lambda.
I have written some lambda functions to take the input image and predict the segment and give output as a background removal image.
I am not sure this function is correct or not. please check this function as well.
Define imports
try:
import unzip_requirements
except ImportError:
pass
import json
from io import BytesIO
import time
import os
import base64
import boto3
import numpy as np
from skimage import io
import matplotlib.pyplot as plt
from preprocessing import RescaleT, ToTensorLab
import torch
import numpy as np
from PIL import Image
from network.u2net import U2NET
# Define two functions inside handler.py: img_to_base64_str to
# convert binary images to base64 format and load_models to
# load the four pretrained model inside a dictionary and then
# keep them in memory
def img_to_base64_str(img):
buffered = BytesIO()
img.save(buffered, format="PNG")
buffered.seek(0)
img_byte = buffered.getvalue()
img_str = "data:image/png;base64," + base64.b64encode(img_byte).decode()
return img_str
def load_models(s3, bucket):
model = U2NET(3,1)
response = s3.get_object(
Bucket=bucket, Key=f"models/u2net/u2net.pth")
state = torch.load(BytesIO(response["Body"].read()),map_location=torch.device('cpu'))
model.load_state_dict(state)
model.eval()
return model
def preprocess_raw_img(raw_img_array):
"""
This function preprocesses a raw input array in a way such that it can be fed into the U-2-Net architecture
:param raw_img_array:
:return:
"""
rescaler = RescaleT(320)
rescaled_img = rescaler(raw_img_array)
tensor_converter = ToTensorLab(flag=0)
tensor_img = tensor_converter(rescaled_img)
tensor_img = tensor_img.unsqueeze(0)
return tensor_img
def normPRED(d):
ma = torch.max(d)
mi = torch.min(d)
dn = (d-mi)/(ma-mi)
return dn
def resize_img_to_orig(prediction_np, orig_img):
image = Image.fromarray(prediction_np * 255).convert('RGB')
image_original = image.resize((orig_img.shape[1], orig_img.shape[0]), resample=Image.BILINEAR)
return image_original
def mask_to_orig_size(orig_img, rescale, threshold):
mask_orig_size = np.array(orig_img, dtype=np.float64)
mask_orig_size /= rescale
mask_orig_size[mask_orig_size > threshold] = 1
mask_orig_size[mask_orig_size <= threshold] = 0
return mask_orig_size
def extract_foreground(mask_orig_size):
shape = mask_orig_size.shape
a_layer_init = np.ones(shape=(shape[0], shape[1], 1))
mul_layer = np.expand_dims(mask_orig_size[:, :, 0], axis=2)
a_layer = mul_layer * a_layer_init
rgba_out = np.append(mask_orig_size, a_layer, axis=2)
return rgba_out
def input_to_rgba_inp(input_arr, rescale):
input_arr = np.array(input_arr, dtype=np.float64)
shape = input_arr.shape
input_arr /= rescale
a_layer = np.ones(shape=(shape[0], shape[1], 1))
rgba_inp = np.append(input_arr, a_layer, axis=2)
return rgba_inp
def u2net_api_call(raw_img_array, model):
"""
This function takes as input an image array of any size. The goal is to return only the object in the foreground of
the image.
Therefore, the raw input image is preprocessed, fed into the deep learning model. Afterwards the foreground of the
original image is extracted from the mask which was generated by the deep learning model.
"""
THRESHOLD = 0.9
RESCALE = 255
preprocessed_img = preprocess_raw_img(raw_img_array)
d1, d2, d3, d4, d5, d6, d7 = model(preprocessed_img)
prediction = d1[:, 0, :, :]
prediction = normPRED(prediction)
prediction_np = prediction.squeeze().cpu().data.numpy()
img_orig_size = resize_img_to_orig(prediction_np, raw_img_array)
mask_orig_size = mask_to_orig_size(img_orig_size, RESCALE, THRESHOLD)
rgba_out = extract_foreground(mask_orig_size)
rgba_inp = input_to_rgba_inp(raw_img_array, RESCALE)
rem_back = (rgba_inp * rgba_out)
return rem_back
s3 = boto3.client("s3")
bucket = "sagemaker-m-model"
model = load_models(s3, bucket)
def lambda_handler(event,Context):
if event.get("source") in ["aws.events", "serverless-plugin-warmup"]:
print('Lambda is warm!')
return {}
data = json.loads(event["body"])
print("data keys :", data.keys())
image = data["image"]
image = image[image.find(",")+1:]
dec = base64.b64decode(image + "===")
image = Image.open(io.BytesIO(dec))
#image = image.convert("RGB")
# loading the model with the selected style based on the model_id payload
model = model
# resize the image based on the load_size payload
#load_size = int(data["load_size"])
with torch.no_grad():
background_removed = u2net_api_call(image, model)
output_image = background_removed[0]
# deprocess, (0, 1)
output_image = output_image.data.cpu().float() * 0.5 + 0.5
output_image = output_image.numpy()
output_image = np.uint8(output_image.transpose(1, 2, 0) * 255)
output_image = Image.fromarray(background_removed)
# convert the PIL image to base64
result = {
"output": img_to_base64_str(output_image)
}
# send the result back to the client inside the body field
return {
"statusCode": 200,
"body": json.dumps(result),
"headers": {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
}
}
I have tried with the Serverless framework, I got some errors. I understand how to solve this.
Running "serverless" from node_modules
Warning: Invalid configuration encountered
at 'custom.warmup.events': must be object
at 'custom.warmup.timeout': must be object
at 'functions.transformImage.warmup': must be object
Learn more about configuration validation here: http://slss.io/configuration-validation
Deploying br to stage dev (us-east-1)
Warning: WarmUp: Skipping warmer "events" creation. No functions to warm up.
Warning: WarmUp: Skipping warmer "timeout" creation. No functions to warm up.
✖ Stack br-dev failed to deploy (11s)
Environment: linux, node 16.14.0, framework 3.4.0 (local) 3.4.0v (global), plugin 6.1.2, SDK 4.3.1
Docs: docs.serverless.com
Support: forum.serverless.com
Bugs: github.com/serverless/serverless/issues
Error:
Error: `docker run --rm -v /home/suri/project1/rmbg/br/cache/cf58e2124c894818b4beab8df9ac26ac92eeb326c8c74fc7e60e8f08ea86df1e_x86_64_slspyc:/var/task:z -v /home/suri/project1/rmbg/br/cache/downloadCacheslspyc:/var/useDownloadCache:z lambci/lambda:build-python3.6 /bin/sh -c chown -R 0\:0 /var/useDownloadCache && python3.6 -m pip install -t /var/task/ -r /var/task/requirements.txt --cache-dir /var/useDownloadCache && chown -R 0\:0 /var/task && chown -R 0\:0 /var/useDownloadCache` Exited with code 1
at ChildProcess.<anonymous> (/home/suri/project1/rmbg/br/node_modules/child-process-ext/spawn.js:38:8)
at ChildProcess.emit (node:events:520:28)
at ChildProcess.emit (node:domain:475:12)
at maybeClose (node:internal/child_process:1092:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)
3 deprecations found: run 'serverless doctor' for more details

Write files to GCS using Dataflow

Dataflow is being pre-processed by reading batch data.
The workload is read from Google Cloud Storage (GCS) to process Dataflow and upload it back to GCS.
But after processing the data, I checked the GCS.
result-001.csv
result-002.csv
result-003.csv
This is how the data is divided and stored.
Can't I combine these files into one?
#-*- coding: utf-8 -*-
import apache_beam as beam
import csv
import json
import os
import re
from apache_beam.io.gcp.bigquery_tools import parse_table_schema_from_json
def preprocessing(fields):
fields = fields.split(",")
header = "label"
for i in range(0, 784):
header += (",pixel" + str(i))
label_list_str = "["
label_list = []
for i in range(0,10) :
if fields[0] == str(i) :
label_list_str+=str(i)
else :
label_list_str+=("0")
if i!=9 :
label_list_str+=","
label_list_str+="],"
for i in range(1,len(fields)) :
label_list_str+=fields[i]
if i!=len(fields)-1:
label_list_str+=","
yield label_list_str
def run(project, bucket, dataset) :
argv = [
"--project={0}".format(project),
"--job_name=kaggle-dataflow",
"--save_main_session",
"--region=asia-northeast1",
"--staging_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--temp_location=gs://{0}/kaggle-bucket-v1/".format(bucket),
"--max_num_workers=8",
"--worker_region=asia-northeast3",
"--worker_disk_type=compute.googleapis.com/projects//zones//diskTypes/pd-ssd",
"--autoscaling_algorithm=THROUGHPUT_BASED",
"--runner=DataflowRunner",
"--worker_region=asia-northeast3"
]
result_output = 'gs://kaggle-bucket-v1/result/result.csv'
filename = "gs://{}/train.csv".format(bucket)
pipeline = beam.Pipeline(argv=argv)
ptransform = (pipeline
| "Read from GCS" >> beam.io.ReadFromText(filename)
| "Kaggle data pre-processing" >> beam.FlatMap(preprocessing)
)
(ptransform
| "events:out" >> beam.io.WriteToText(
result_output
)
)
pipeline.run()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Run pipeline on the cloud")
parser.add_argument("--project", dest="project", help="Unique project ID", required=True)
parser.add_argument("--bucket", dest="bucket", help="Bucket where your data were ingested", required=True)
parser.add_argument("--dataset", dest="dataset", help="BigQuery dataset")
args = vars(parser.parse_args())
print("Correcting timestamps and writing to BigQuery dataset {}".format(args["dataset"]))
run(project=args["project"], bucket=args["bucket"], dataset=args["dataset"])
Thank you for reading :)
Method beam.io.WriteToText automatically splits files when writing for best performance. You can explicitly add parameter num_shards = 1 if you want 1 file only.
num_shards (int) – The number of files (shards) used for output. If
not set, the service will decide on the optimal number of shards.
Constraining the number of shards is likely to reduce the performance
of a pipeline. Setting this value is not recommended unless you
require a specific number of output files.
Your writing to text should look like this:
(ptransform
| "events:out" >> beam.io.WriteToText(
result_output,num_shards=1
)
)

How to create a pyspark udf, calling a class function from another class function in the same file?

I'm creating a pyspark udf inside a class based view and I have the function what I want to call, inside another class based view, both of them are in the same file (api.py), but when I inspect the content of the dataframe resulting, I get this error:
ModuleNotFoundError: No module named 'api'
I can't understand why this happens, I tried to do a similar code in the pyspark console and it worked good. A similar question was asked here but the difference is that I'm trying to do that in the same file.
This a piece of my full code:
api.py
class TextMiningMethods():
def clean_tweet(self,tweet):
'''
some logic here
'''
return "Hello: "+tweet
class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):
#action(methods=['post'], detail=False)
def word_cloud(self, request, *args, **kwargs):
'''
some previous logic here
'''
spark=SparkSession \
.builder \
.master("spark://"+SPARK_WORKERS) \
.appName('word_cloud') \
.config("spark.executor.memory", '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '2') \
.config("spark.driver.memory",'2g') \
.getOrCreate()
sc.sparkContext.addPyFile('path/to/udfFile.py')
cols = ['text']
rows = []
for tweet_account_index, tweet_account_data in enumerate(tweets_list):
tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
row= [tweet['text']]
rows.append(row)
# Create a Pandas Dataframe of tweets
tweet_pandas_df = pd.DataFrame(rows, columns = cols)
schema = StructType([
StructField("text", StringType(),True)
])
# Converts to Spark DataFrame
df = spark.createDataFrame(tweet_pandas_df,schema=schema)
clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
clean_tweet_df.show() # This line produces the error
This similar test in pyspark works good
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
def clean_tweet(name):
return "This is " + name
schema = StructType([StructField("Id", IntegerType(),True),StructField("tweet", StringType(),True)])
data = [[ 1, "tweet 1"],[2,"tweet 2"],[3,"tweet 3"]]
df = spark.createDataFrame(data,schema=schema)
clean_tweet_udf = udf(clean_tweet,StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["tweet"]))
clean_tweet_df.show()
So these are my questions:
What is this error related to? and How can I fix it?
What is the right way to create a pyspark udf when you're working with class based view? is a wrong practice to write functions that you will use as pyspark udf, in the same file where you will call them? (in my case, all my api endpoints, working with django rest framework)
Any help will be appreciated, thanks in advance
UPDATE:
This link and this link explains how to use custom classes with pyspark using SparkContext, but not with SparkSession that is my case , but I used this:
sc.sparkContext.addPyFile('path/to/udfFile.py')
The problem is that I defined the class where I have the functions to use as pyspark udf, in the same file where I'm creating the udf function for the dataframe (as a showed in my code). I couldn't found how to reach that behaviour when the path of addPyFile() is in the same code. In spite of that, I moved my code and I followed these steps (that was another error that I fixed):
Create a new folder called udf
Create a new empty __ini__.py file, to make the directory to a package.
And create a file.py for my udf functions.
core/
udf/
├── __init__.py
├── __pycache__
└── pyspark_udf.py
api/
├── admin.py
├── api.py
├── apps.py
├── __init__.py
In this file, I tried to import the dependencies either at the beginning or inside the function. In all the cases I receive ModuleNotFoundError: No module named 'udf'
pyspark_udf.py
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
# some logic here
I have tried with all of these, At the beginning of my api.py file
from udf.pyspark_udf import TextMiningMethods
# or
from udf.pyspark_udf import *
And inside the word_cloud function
class BigDataViewSet(viewsets.ViewSet):
def word_cloud(self, request, *args, **kwargs):
from udf.pyspark_udf import TextMiningMethods
In the python debugger this line works:
from udf.pyspark_udf import TextMiningMethods
But when I show the dataframe, i receive the error:
clean_tweet_df.show()
ModuleNotFoundError: No module named 'udf'
Obviously, the original problem changed to another, now my problem is more related with this question, but I couldn't find a satisfactory way to import the file yet and create a pyspark udf callinf a class function from another class function.
What I'm missing?
After different tries, I couldn't find a solution by referencing to a method in the path of addPyFile(), located in the same file where I was creating the udf (I would like to know if this is a bad practice) or in another file, technically addPyFile(path) documentation says:
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
So what I mention should be possible. Based on that, I had to used this solution and zip all the udf folder from it's highest level with:
zip -r udf.zip udf
Also, in the pyspark_udf.py I had to import my dependencies as below to avoid this problem
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
import re
import string
import unidecode
from nltk.corpus import stopwords
Instead of:
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
Then, finally this line worked good:
clean_tweet_df.show()
I hope this could be useful for anyone else
Thank you! Your approach worked for me.
Just to clarify my steps:
Made a udf module with __init__.py and pyspark_udfs.py
Made a bash file to zip udfs first and then run my files on the top level:
runner.sh
echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"
echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."
In actual code imported my udfs from udf.pyspark_udfs module and initialized my udfs in the python function I need, like so:
def _produce_period_statistics(self, df: pyspark.sql.DataFrame, period: str) -> pyspark.sql.DataFrame:
""" Produces basic and trend statistics based on user visits."""
# udfs
get_hist_vals_udf = F.udf(lambda array, bins, _range: get_histogram_values(array, bins, _range), ArrayType(IntegerType()))
get_hist_edges_udf = F.udf(lambda array, bins, _range: get_histogram_edges(array, bins, _range), ArrayType(FloatType()))
get_mean_udf = F.udf(get_mean, FloatType())
get_std_udf = F.udf(get_std, FloatType())
get_lr_coefs_udf = F.udf(lambda bar_height, bar_edges, hist_upper: get_linear_regression_coeffs(bar_height, bar_edges, hist_upper), StringType())
...

Strange behavior of Inception_v3

I am trying to create a generative network based on the pre-trained Inception_v3.
1) I fix all the weights in the model
2) create a Variable whose size is (2, 3, 299, 299)
3) create targets of size (2, 1000) that I want my final layer activations to become as close as possible to by optimizing the Variable.
(I do not set the batchsize of 1, because unlike VGG16, Inception_v3 doesn't take batchsize=1, but that's not the point).
The following code should work, but gives me the error: «RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation».
# minimalist code with Inception_v3 that throws the error:
import torch
from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn
import torchvision
torch.set_default_tensor_type('torch.FloatTensor')
Iv3 = torchvision.models.inception_v3(pretrained=True)
for i in Iv3.parameters():
i.requires_grad = False
criterion = nn.CrossEntropyLoss()
x = Variable(torch.randn(2, 3, 299, 299), requires_grad=True)
target = torch.empty(2, dtype=torch.long).random_(1000)
output = Iv3(x)
loss = criterion(output[0], target)
loss.backward()
print(x.grad)
This is very strange, because if I do the same thing with VGG16, everything works fine:
# minimalist working code with VGG16:
import torch
from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn
import torchvision
# torch.cuda.empty_cache()
# vgg16 = torchvision.models.vgg16(pretrained=True).cuda()
# torch.set_default_tensor_type('torch.cuda.FloatTensor')
torch.set_default_tensor_type('torch.FloatTensor')
vgg16 = torchvision.models.vgg16(pretrained=True)
for i in vgg16.parameters():
i.requires_grad = False
criterion = nn.CrossEntropyLoss()
x = Variable(torch.randn(2, 3, 229, 229), requires_grad=True)
target = torch.empty(2, dtype=torch.long).random_(1000)
output = vgg16(x)
loss = criterion(output, target)
loss.backward()
print(x.grad)
Please help.
Thanks to #iacolippo the issue is solved. Turns out the problem was due to Pytorch 1.0.0. No problem with Pytorch 0.4.1. though.

Google Dataflow seems to drop 1000th record

I have set up a small test using Google Dataflow (apache-beam). The use case for the experiment is to take a (csv) file and write a selected column to a (txt) file.
The code for the experiment is as listed below:
from __future__ import absolute_import
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.metrics import Metrics
from apache_beam.metrics.metric import MetricsFilter
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
class EmitColDoFn(beam.DoFn):
first = True
header = ""
def __init__(self, i):
super(EmitColDoFn, self).__init__()
self.line_count = Metrics.counter(self.__class__, 'lines')
self.i = i
def process(self, element):
if self.first:
self.header = element
self.first = False
else:
self.line_count.inc()
cols = re.split(',', element)
return (cols[self.i],)
def run(argv=None):
"""Main entry point; defines and runs the wordcount pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='/users/sms/python_beam/data/MOCK_DATA (4).csv',
# default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument('--output',
dest='output',
default="/users/sms/python_beam/data/",
# required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
# Read the text file[pattern] into a PCollection.
lines = p | 'read' >> ReadFromText(known_args.input)
column = (lines
| 'email col' >> (beam.ParDo(EmitColDoFn(3)))
| "col file" >> WriteToText(known_args.output, ".txt", shard_name_template="SS_Col"))
result = p.run()
result.wait_until_finish()
if (not hasattr(result, 'has_job') # direct runner
or result.has_job): # not just a template creation
lines_filter = MetricsFilter().with_name('lines')
query_result = result.metrics().query(lines_filter)
if query_result['counters']:
lines_counter = query_result['counters'][0]
print "Lines committed", lines_counter.committed
run()
The last few lines of sample 1 below:
990,Corabel,Feldbau,cfeldbaurh#deliciousdays.com,Female,84.102.162.190,DJ
991,Kiley,Rottcher,krottcherri#stanford.edu,Male,91.97.155.28,CA
992,Glenda,Clist,gclistrj#state.gov,Female,24.98.253.127,UA
993,Ingunna,Maher,imaherrk#army.mil,Female,159.31.127.19,PL
994,Megan,Giacopetti,mgiacopettirl#instagram.com,Female,115.6.63.52,RU
995,Briny,Dutnall,bdutnallrm#xrea.com,Female,102.81.33.24,SE
996,Jan,Caddan,jcaddanrn#jalbum.net,Female,115.142.222.106,PL
Running this produces the expected output of:
/usr/local/bin/python2.7
/Users/sms/Library/Preferences/PyCharmCE2017.1/scratches/scratch_4.py
No handlers could be found for logger "oauth2client.contrib.multistore_file"
Lines committed 996
Process finished with exit code 0
Now for the strange results. In the next run, the number of lines is increased to 1000.
994,Megan,Giacopetti,mgiacopettirl#instagram.com,Female,115.6.63.52,RU
995,Briny,Dutnall,bdutnallrm#xrea.com,Female,102.81.33.24,SE
996,Jan,Caddan,jcaddanrn#jalbum.net,Female,115.142.222.106,PL
997,Shannen,Gaisford,sgaisfordr7#rediff.com,Female,167.255.222.92,RU
998,Lorianna,Slyne,lslyner8#cbc.ca,Female,54.169.60.13,CN
999,Franklin,Yaakov,fyaakovr9#latimes.com,Male,122.1.92.236,CN
1000,Wilhelmine,Cariss,wcarissra#creativecommons.org,Female,237.48.113.255,PL
But this time the out put is
/usr/local/bin/python2.7
/Users/sms/Library/Preferences/PyCharmCE2017.1/scratches/scratch_4.py
No handlers could be found for logger "oauth2client.contrib.multistore_file"
Lines committed 999
Process finished with exit code 0
Inspection of the output file shows that the last line was NOT processed.
bdutnallrm#xrea.com
jcaddanrn#jalbum.net
sgaisfordr7#rediff.com
lslyner8#cbc.ca
fyaakovr9#latimes.com
Any ideas what is going on here?
'EditColDoFn' skips first line, assuming there is one instance of it for each file. When you have more 1000 lines, the DirectRunner creates two bundles : 1000 lines in first one, and 1 line in second. In a Beam application, the input might be split into multiple bundles for processing in parallel. There is no correlation to number of files and number of bundles. Same application can process terra bytes of data spread across many files.
ReadFromText has an option 'skip_header_lines', which you can set to 1 in order to skip header line in each of your input files.