I have microservice that contains multiple Lambda functions.
How can I share the same code between those functions?
Lets imagine I have the following structure:
handler
|
|______ a-handler.js
|
|______ b-handler.js
|
repository
|
|______ users-repository.js
I know that AWS recommended to use Lambda Layers for that, but can I simply import the relevant code to the Lambda function?
Like that:
users-repository.js:
export getUser(...){...};
export createUser(...){...};
a-handler.js:
import {getUser} from '../repository/users-repository.js';
/.../
b-handler.js:
import {createUser} from '../repository/users-repository.js';
/.../
The build process anyway will do Tree Shaking and remove the unused code from the Lambda function, so basically it will use and build only the relevant code.
So what are the benefits of Lambda Layers?
Thanks!
There’s no mandate to use layers to share code. The reasons I’ll end up with a layer over an include are:
Separately versioned items the ops team should be able to update separate from build process
Shared item is so large that including it causes bundle sizes to be unreasonably large
Shared code without deployment system (Pulumi, Terraform, etc…)
Includes has more or less the opposite set of use scenarios:
Dependency that will not change or be versioned separate from lambda
Small dependency not worth the added complexity of a layer
Related
efx/
...
aws_account/
nonprod/
account-variables.tf
dev/
account-variables.tf
common.tf
app1.tf
app2.tf
app3.tf
...
modules/
tf_efxstack_app1
tf_efxstack_app2
tf_efxstack_app3
...
In a given environment (dev in the example above), we have multiple modules (app1, app2, app3, etc.) which are based on individual applications we are running in the infrastructure.
I am trying to update the state of one module at a time (e.g. app1.tf). I am not sure how I can do this.
Use Case: I would like only one of the module's LC to be updated to use the latest AMI or security group.
I tried the -target command in terrafrom, but this does not seem to work because it does not check the terraform remote state file.
terraform plan -target=app1.tf
terraform apply -target=app1.tf
Therefor, no changes take place. I believe this is a bug with terraform.
Any ideas how I can accomplish this?
Terraform's -target should be for exceptional use cases only and you should really know what you're doing when you use it. If you genuinely need to regularly target different parts at a time then you should separate your applications into different directory so you can easily apply the whole directory at a time.
This might mean you need to use data sources or rethink the structure of things a bit more but means you also limit the blast radius of any single Terraform action which is always useful.
I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.
I'm learning web crawling with python. I have a CSV file with a lot of URLs. Using python 2.7 and selenium I'm currently crawling these websites to extract data like: body width (in pixels), HTTP response, page load speed and meta name="viewport" tag.
The results of the script I then export to a CSV file with each column containing the type of data extracted (see below). I'm planning to extract many more types of data by writing new crawlers.
How my current script exports the data to the CSV file -> CSV file looks like this:
Website | body width | HTTP response | load speed (in secs) | Viewport
www.url1.com | 690 | 200 | 2 | No
www.url2.com | 370 | 404 | 0.5 | Yes
However, my script (one single .py file) is getting longer, thus a little more complex, due to more code lines by each new function added. I worry that the more functions i add to it, the slower and more error sensitive it will get. How I see it, I right now have two options:
Option 1. Keep writing new crawling functions to the existing script file
Option 2. Writing new crawling functions to different script files: I'm thinking about, from now on, to write new crawlers on separate .py files (1 crawler = 1 .py file) and also to split my current script (one single .py file) into multiple crawlers (multiple .py files).
I then can run each crawler separately and write the results of all crawlers into one single CSV file (like illustrated above). By using multiple crawler files, (assumption), I think I'll have cleaner, less error sensitive, faster and more flexible crawlers, in comparison to having all crawlers in one .py file like I have now.
So my questions:
What are the pros and cons of option 1 & 2?
Is one option better than the other, if so why?
Is my assumption in option 2 correct?
Excuse me if my post may not be specific enough, but getting my questions answered will help me tremendously!
Clean code is good. I would look to put common functions into something like a crawler_library.py, and then have your specific scripts import functions that they need from there.
With regards to your assumption, it isn't axiomatically true - code in different scripts is no functionally different to code in one script. Realistically, though, it is generally true. It's easier to maintain and improve, and for most people, putting code into functions allows them to modularise what they are trying to do, makes it easier to understand, etc.
if by "better" you mean tidier, cleaner code. Then yes, It will be a lot tidier. If you have other web crawling projects, not just this one, importing them as modules is also great, it means they are reusable, and decoupled. Being able to seamlessly switch modules in and out is a good thing.
I have a module that I'm writing unit tests for to run with travis.ci.
In my module I perform HTTP POST operations to a web service.
One of my internal only functions validate_http_response() is integral to the functions I'm creating to wrap around web service calls, so I'd like to test it. However, because there is no such export validate_http_response the function can't be "seen" by my test script and I get the error:
validate_http_response not defined
How should I structure my test so that I don't have to copy and paste the internal functions into the test itself (there are a few of them)? I'd like to prevent having to maintain a src and test script simultaneously.
EDIT Along with the accepted answer I also found I could do the following at the beginning of the test script: include("../src/myfunctions.jl"), as I have a separate test script for each file in the src.
Check out the documentation on modules to better understand how namespacing works. There is no forced visibility in Julia, so you can always access functions, exported or non-exported, in any module by fully qualifying the reference.
So in your case, if your module is named HTTP, you could say HTTP.validate_http_response to access your unexported function to test.
There are two soltutions:
Export the function.
Create a new module which contains validation code for HTTP requests. Move the function there. Now it's part of an official/public API and can be tested independently.
The first solution is simple but soils your API. The second one is clean but probably a lot of work.
In the hybris wiki trails, there is mention of core data vs. essential data vs. sample data. What is the difference between these three types of data?
Ordinarily, I would assume that sample data is illustrative gobbledygook data created to populate the example apparel and electronics storefronts. However, the wiki trails suggest that core data is for non-store specific data and the sample data is for store specific data.
On the same page, the wiki states that core data contains cockpit and catalog definitions, email templates, CMS layout, and site definitions (countries and user groups impex are included in this as well). This seems rather store specific to me. Does anyone have an explanation for this?
Yes, I have an explanation. Actually a lot of this is down to arbitrary decisions I made on separating data between acceleratorcore and acceleratorsampledata extensions as part of the Accelerator in 4.5 (later these had y- prefix added).
Essential and Project Data are two sets of data that are used within hybris' init/update process. These steps are controlled for each extension via particular Annotations on classes and methods.
Core vs Sample data is more about if I thought the impex file, or lines, were specific to the sample store or were more general. You will notice your CoreSystemSetup has both essential and projectdata steps.
Lots of work has happened in various continents since then, so, like much of hybris now, its a bit of a mess.
There are a few fun bugs related to hybris making certain things part of essentialdata. But these are in the platform not something I can fix without complaining to various people etc.
To confuse matters further, there is the yacceleratorinitialdata extension. This extension was a way I hoped to make projects easier, by giving some impex skeletons for new sites and stores. This would be generated for you during modulegen. It has rotted heavily though since release, now very out of date.
For a better explanation, take a look at this answer from answers.sap.com.
Hybris imports two types of data on initialization and update processes; first is essentialdata and other one is projectdata.
Essentialdata is the coredata setup which is mandatory and will import when you run initialization or update.
sampledata is your projectdata and it is not mandatory it will import when you select project while updating the system.