Puppeteer: multiple user requests to the same Chromium instance

Puppeteer: multiple user requests to the same Chromium instance - amazon-web-services

To hopefully save on system resources I want to run user requests through the same Chromium version in Puppeteer.
If a user submits a form on my site which calls Puppeteer, and Chromium is already running, how can I use the same Chromium instance up to a maximum of 4 tabs?
If there are more than 4 tabs open in the Chromium instance then I want to launch a new Chromium instance.
How can I achieve this? Would I need to store the browserWSEndpoint of the Chromium instance to a file and then retrieve it every time a new user submits a request? (This would be using browserWSEndpoint with puppeteer.connect()).
If I have to do it this way, lets say there are 2 Chromium browsers active. The first most recent browser has the maximum four open tabs, so I could not use this browser. I would then check the next browserWSEndpoint and, if there are less than 4 open tabs, create a new page; and if not, launch a new browser.
Does that sound OK?

You can use Lambda which would save you cost , make sure to avoid the 30 seconds timeout of lambda if you are going to be using API Gateway.
https://github.com/alixaxel/chrome-aws-lambda
NodeJS
const chromium = require('chrome-aws-lambda');
exports.handler = async (event, context) => {
let result = null;
let browser = null;
try {
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
});
let page = await browser.newPage();
await page.goto(event.url || 'https://example.com');
result = await page.title();
} catch (error) {
return context.fail(error);
} finally {
if (browser !== null) {
await browser.close();
}
}
return context.succeed(result);
};

Related

AWS CloudWatch Synthetics Canary doesn't load the entire page

I basically want to load the following page every minute and create a screenshot of it: https://www.amazon.com/b?ie=UTF8&node=24088939011
It takes about one minute to create an AWS CloudWatch canary: I open AWS CloudWatch, click on "Synthetics Canaries" on the left hand side, click on "Create canary", enter the URL to a webpage, and just use the default settings except that I change it from running every 5 minutes to running every minute.
Availability tab of the canary I created:
Configuration tab of the canary I created:
The canary runs and says it's 100% successful but when I look at the screenshots I see that the page never loads fully:
The screenshots should show what you would see when opening the page in your browser: https://www.amazon.com/b?ie=UTF8&node=24088939011
This is the default script that is used by the canary:
const { URL } = require('url');
const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');
const syntheticsConfiguration = synthetics.getConfiguration();
const syntheticsLogHelper = require('SyntheticsLogHelper');
const loadBlueprint = async function () {
const urls = ['https://www.amazon.com/b?ie=UTF8&node=24088939011'];
// Set screenshot option
const takeScreenshot = true;
/* Disabling default step screen shots taken during Synthetics.executeStep() calls
* Step will be used to publish metrics on time taken to load dom content but
* Screenshots will be taken outside the executeStep to allow for page to completely load with domcontentloaded
* You can change it to load, networkidle0, networkidle2 depending on what works best for you.
*/
syntheticsConfiguration.disableStepScreenshots();
syntheticsConfiguration.setConfig({
continueOnStepFailure: true,
includeRequestHeaders: true, // Enable if headers should be displayed in HAR
includeResponseHeaders: true, // Enable if headers should be displayed in HAR
restrictedHeaders: [], // Value of these headers will be redacted from logs and reports
restrictedUrlParameters: [] // Values of these url parameters will be redacted from logs and reports
});
let page = await synthetics.getPage();
for (const url of urls) {
await loadUrl(page, url, takeScreenshot);
}
};
// Reset the page in-between
const resetPage = async function(page) {
try {
await page.goto('about:blank',{waitUntil: ['load', 'networkidle0'], timeout: 30000} );
} catch(ex) {
synthetics.addExecutionError('Unable to open a blank page ', ex);
}
}
const loadUrl = async function (page, url, takeScreenshot) {
let stepName = null;
let domcontentloaded = false;
try {
stepName = new URL(url).hostname;
} catch (error) {
const errorString = `Error parsing url: ${url}. ${error}`;
log.error(errorString);
/* If we fail to parse the URL, don't emit a metric with a stepName based on it.
It may not be a legal CloudWatch metric dimension name and we may not have an alarms
setup on the malformed URL stepName. Instead, fail this step which will
show up in the logs and will fail the overall canary and alarm on the overall canary
success rate.
*/
throw error;
}
await synthetics.executeStep(stepName, async function () {
const sanitizedUrl = syntheticsLogHelper.getSanitizedUrl(url);
/* You can customize the wait condition here. For instance, using 'networkidle2' or 'networkidle0' to load page completely.
networkidle0: Navigation is successful when the page has had no network requests for half a second. This might never happen if page is constantly loading multiple resources.
networkidle2: Navigation is successful when the page has no more then 2 network requests for half a second.
domcontentloaded: It's fired as soon as the page DOM has been loaded, without waiting for resources to finish loading. Can be used and then add explicit await page.waitFor(timeInMs)
*/
const response = await page.goto(url, { waitUntil: ['networkidle0'], timeout: 30000});
if (response) {
domcontentloaded = true;
const status = response.status();
const statusText = response.statusText();
logResponseString = `Response from url: ${sanitizedUrl} Status: ${status} Status Text: ${statusText}`;
//If the response status code is not a 2xx success code
if (response.status() < 200 || response.status() > 299) {
throw `Failed to load url: ${sanitizedUrl} ${response.status()} ${response.statusText()}`;
}
} else {
const logNoResponseString = `No response returned for url: ${sanitizedUrl}`;
log.error(logNoResponseString);
throw new Error(logNoResponseString);
}
});
// Wait for 15 seconds to let page load fully before taking screenshot.
if (domcontentloaded && takeScreenshot) {
await page.waitFor(15000);
await synthetics.takeScreenshot(stepName, 'loaded');
await resetPage(page);
}
};
const urls = [];
exports.handler = async () => {
return await loadBlueprint();
};
I tried creating a canary in exactly the same way but for another similar page (https://www.amazon.ca/b?ie=UTF8&node=6548466011) and it just works:
What am I doing wrong? Why aren't the screenshots that the canary takes showing the fully loaded page?
Why are parts of the page missing in the screenshot but they show up correctly when I open the page (https://www.amazon.com/b?ie=UTF8&node=24088939011) in my browser?

Authenticate AWS lambda against Google Sheets API

I am trying to create an aws lambda function that will read rows from multiple Google Sheets documents using the Google Sheet API and will merge them afterwards and write in another spreadsheet. To do so I did all the necessary steps according to several tutorials:
Create credentials for the AWS user to have the key pair.
Create a Google Service Account, download the credentials.json file.
Share each necessary spreadsheet with the Google Service Account client_email.
When executing the program locally it works perfectly, it successfully logins using the credentials.json file and reads & writes all necessary documents.
However when uploading it to AWS Lambda using the serverless framework and google-spreadsheet, the program fails silently in the authentication step. I've tried changing the permissions as recommended in this question but it still fail. The file is read properly and I can print it to the console.
This is the simplified code:
async function getData(spreadsheet, psychologistName) {
await spreadsheet.useServiceAccountAuth(clientSecret);
// It never gets to this point, it fails silently
await spreadsheet.loadInfo();
... etc ...
}
async function main() {
const promises = Object.entries(psychologistSheetIDs).map(async (psychologistSheetIdPair) => {
const [psychologistName, googleSheetId] = psychologistSheetIdPair;
const sheet = new GoogleSpreadsheet(googleSheetId);
psychologistScheduleData = await getData(sheet, psychologistName);
return psychologistScheduleData;
});
//When all sheets are available, merge their data and write back in joint view.
Promise.all(promises).then(async (psychologistSchedules) => {
... merge the data ...
});
}
module.exports.main = async (event, context, callback) => {
const result = await main();
return {
statusCode: 200,
body: JSON.stringify(
result,
null,
2
),
};

I solved it,
While locally having a Promise.all(promises).then(result =>...) eventually returned the value and executed what was inside the then(), aws lambda returned before the promises were resolved.
This solved it:
const res = await Promise.all(promises);
mergeData(res);

Google IOT per device heartbeat alert using Stackdriver

I'd like to alert on the lack of a heartbeat (or 0 bytes received) from any one of large number of Google IOT core devices. I can't seem to do this in Stackdriver. It instead appears to let me alert on the entire device registry which does not give me what I'm looking for (How would I know that a particular device is disconnected?)
So how does one go about doing this?

I have no idea why this question was downvoted as 'too broad'.
The truth is Google IOT doesn't have per device alerting, but instead offers only alerting on an entire device registry. If this is not true, please reply to this post. The page that clearly states this is here:
Cloud IoT Core exports usage metrics that can be monitored
programmatically or accessed via Stackdriver Monitoring. These metrics
are aggregated at the device registry level. You can use Stackdriver
to create dashboards or set up alerts.
The importance of having per device alerting is built into the promise assumed in this statement:
Operational information about the health and functioning of devices is
important to ensure that your data-gathering fabric is healthy and
performing well. Devices might be located in harsh environments or in
hard-to-access locations. Monitoring operational intelligence for your
IoT devices is key to preserving the business-relevant data stream.
So its not easy today to get an alert if one among many, globally dispersed devices, loses connectivity. One needs to build that, and depending on what one is trying to do, it would entail different solutions.
In my case I wanted to alert if the last heartbeat time or last event state publish was older than 5 minutes. For this I need to run a looping function that scans the device registry and performs this operation regularly. The usage of this API is outlined in this other SO post: Google iot core connection status

For reference, here's a Firebase function I just wrote to check a device's online status, probably needs some tweaks and further testing, but to help anybody else with something to start with:
// Example code to call this function
// const checkDeviceOnline = functions.httpsCallable('checkDeviceOnline');
// Include 'current' key for 'current' online status to force update on db with delta
// const isOnline = await checkDeviceOnline({ deviceID: 'XXXX', current: true })
export const checkDeviceOnline = functions.https.onCall(async (data, context) => {
if (!context.auth) {
throw new functions.https.HttpsError('failed-precondition', 'You must be logged in to call this function!');
}
// deviceID is passed in deviceID object key
const deviceID = data.deviceID
const dbUpdate = (isOnline) => {
if (('wasOnline' in data) && data.wasOnline !== isOnline) {
db.collection("devices").doc(deviceID).update({ online: isOnline })
}
return isOnline
}
const deviceLastSeen = () => {
// We only want to use these to determine "latest seen timestamp"
const stamps = ["lastHeartbeatTime", "lastEventTime", "lastStateTime", "lastConfigAckTime", "deviceAckTime"]
return stamps.map(key => moment(data[key], "YYYY-MM-DDTHH:mm:ssZ").unix()).filter(epoch => !isNaN(epoch) && epoch > 0).sort().reverse().shift()
}
await dm.setAuth()
const iotDevice: any = await dm.getDevice(deviceID)
if (!iotDevice) {
throw new functions.https.HttpsError('failed-get-device', 'Failed to get device!');
}
console.log('iotDevice', iotDevice)
// If there is no error status and there is last heartbeat time, assume device is online
if (!iotDevice.lastErrorStatus && iotDevice.lastHeartbeatTime) {
return dbUpdate(true)
}
// Add iotDevice.config.deviceAckTime to root of object
// For some reason in all my tests, I NEVER receive anything on lastConfigAckTime, so this is my workaround
if (iotDevice.config && iotDevice.config.deviceAckTime) iotDevice.deviceAckTime = iotDevice.config.deviceAckTime
// If there is a last error status, let's make sure it's not a stale (old) one
const lastSeenEpoch = deviceLastSeen()
const errorEpoch = iotDevice.lastErrorTime ? moment(iotDevice.lastErrorTime, "YYYY-MM-DDTHH:mm:ssZ").unix() : false
console.log('lastSeen:', lastSeenEpoch, 'errorEpoch:', errorEpoch)
// Device should be online, the error timestamp is older than latest timestamp for heartbeat, state, etc
if (lastSeenEpoch && errorEpoch && (lastSeenEpoch > errorEpoch)) {
return dbUpdate(true)
}
// error status code 4 matches
// lastErrorStatus.code = 4
// lastErrorStatus.message = mqtt: SERVER: The connection was closed because MQTT keep-alive check failed.
// will also be 4 for other mqtt errors like command not sent (qos 1 not acknowledged, etc)
if (iotDevice.lastErrorStatus && iotDevice.lastErrorStatus.code && iotDevice.lastErrorStatus.code === 4) {
return dbUpdate(false)
}
return dbUpdate(false)
})
I also created a function to use with commands, to send a command to the device to check if it's online:
export const isDeviceOnline = functions.https.onCall(async (data, context) => {
if (!context.auth) {
throw new functions.https.HttpsError('failed-precondition', 'You must be logged in to call this function!');
}
// deviceID is passed in deviceID object key
const deviceID = data.deviceID
await dm.setAuth()
const dbUpdate = (isOnline) => {
if (('wasOnline' in data) && data.wasOnline !== isOnline) {
console.log( 'updating db', deviceID, isOnline )
db.collection("devices").doc(deviceID).update({ online: isOnline })
} else {
console.log('NOT updating db', deviceID, isOnline)
}
return isOnline
}
try {
await dm.sendCommand(deviceID, 'alive?', 'alive')
console.log('Assuming device is online after succesful alive? command')
return dbUpdate(true)
} catch (error) {
console.log("Unable to send alive? command", error)
return dbUpdate(false)
}
})
This also uses my version of a modified DeviceManager, you can find all the example code on this gist (to make sure using latest update, and keep post on here small):
https://gist.github.com/tripflex/3eff9c425f8b0c037c40f5744e46c319
All of this code, just to check if a device is online or not ... which could be easily handled by Google emitting some kind of event or adding an easy way to handle this. COME ON GOOGLE GET IT TOGETHER!

Missing request headers in puppeteer

I want to read the request cookie during a test written with the puppeteer. But I noticed that most of the requests I inspect have only referrer and user-agent headers. If I look at the same requests in Chrome dev tools, they have a lot more headers, including Cookie. To check it out, copy-paste the code below into https://try-puppeteer.appspot.com/.
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('request', function(request) {
console.log(JSON.stringify(request.headers, null, 2));
});
await page.goto('https://google.com/', {waitUntil: 'networkidle'});
await browser.close();
Is there a restriction which request headers you can and can not access? Is it a limitation of Chrome itself or puppeteer?
Thanks for suggestions!

I also saw this when I was trying to use Puppeteer to test some CORS behaviour - I found the Origin header was missing from some requests.
Having a look around the GitHub issues I found an issue which mentioned Puppeteer does not listen to the Network.responseReceivedExtraInfo event of the underlying Chrome DevTools Protocol, this event provides extra response headers not available to the Network.responseReceived event. There is also a similar Network.requestWillBeSentExtraInfo event for requests.
Hooking up to these events seemed to get me all the headers I needed. Here is some sample code which captures the data from all these events and merges it onto a single object keyed by request ID:
// Setup.
const browser = await puppeteer.launch()
const page = await browser.newPage()
const cdpRequestDataRaw = await setupLoggingOfAllNetworkData(page)
// Make requests.
await page.goto('http://google.com/')
// Log captured request data.
console.log(JSON.stringify(cdpRequestDataRaw, null, 2))
await browser.close()
// Returns map of request ID to raw CDP request data. This will be populated as requests are made.
async function setupLoggingOfAllNetworkData(page) {
const cdpSession = await page.target().createCDPSession()
await cdpSession.send('Network.enable')
const cdpRequestDataRaw = {}
const addCDPRequestDataListener = (eventName) => {
cdpSession.on(eventName, request => {
cdpRequestDataRaw[request.requestId] = cdpRequestDataRaw[request.requestId] || {}
Object.assign(cdpRequestDataRaw[request.requestId], { [eventName]: request })
})
}
addCDPRequestDataListener('Network.requestWillBeSent')
addCDPRequestDataListener('Network.requestWillBeSentExtraInfo')
addCDPRequestDataListener('Network.responseReceived')
addCDPRequestDataListener('Network.responseReceivedExtraInfo')
return cdpRequestDataRaw
}

That's because your browser sets a bunch of headers depending on settings and capabilities, and also includes e.g. the cookies that it has stored locally for the specific page.
If you want to add additional headers, you can use methods such as:
page.setExtraHTTPHeaders docs here.
page.setUserAgent docs here.
page.setCookies docs here.
With these you can mimic the extra headers that you see your Chrome browser dispatching.

Identity Server 3 Facebook Login Get Email

Identity server is implemented and working well. Google login is working and is returning several claims including email.
Facebook login is working, and my app is live and requests email permissions when a new user logs in.
The problem is that I can't get the email back from the oauth endpoint and I can't seem to find the access_token to manually request user information. All I have is a "code" returned from the facebook login endpoint.
Here's the IdentityServer setup.
var fb = new FacebookAuthenticationOptions
{
AuthenticationType = "Facebook",
SignInAsAuthenticationType = signInAsType,
AppId = ConfigurationManager.AppSettings["Facebook:AppId"],
AppSecret = ConfigurationManager.AppSettings["Facebook:AppSecret"]
};
fb.Scope.Add("email");
app.UseFacebookAuthentication(fb);
Then of course I've customized the AuthenticateLocalAsync method, but the claims I'm receiving only include name. No email claim.
Digging through the source code for identity server, I realized that there are some claims things happening to transform facebook claims, so I extended that class to debug into it and see if it was stripping out any claims, which it's not.
I also watched the http calls with fiddler, and I only see the following (apologies as code formatting doesn't work very good on urls. I tried to format the querystring params one their own lines but it didn't take)
(facebook.com)
/dialog/oauth
?response_type=code
&client_id=xxx
&redirect_uri=https%3A%2F%2Fidentity.[site].com%2Fid%2Fsignin-facebook
&scope=email
&state=xxx
(facebook.com)
/login.php
?skip_api_login=1
&api_key=xxx
&signed_next=1
&next=https%3A%2F%2Fwww.facebook.com%2Fv2.7%2Fdialog%2Foauth%3Fredirect_uri%3Dhttps%253A%252F%252Fidentity.[site].com%252Fid%252Fsignin-facebook%26state%3Dxxx%26scope%3Demail%26response_type%3Dcode%26client_id%3Dxxx%26ret%3Dlogin%26logger_id%3Dxxx&cancel_url=https%3A%2F%2Fidentity.[site].com%2Fid%2Fsignin-facebook%3Ferror%3Daccess_denied%26error_code%3D200%26error_description%3DPermissions%2Berror%26error_reason%3Duser_denied%26state%3Dxxx%23_%3D_
&display=page
&locale=en_US
&logger_id=xxx
(facebook.com)
POST /cookie/consent/?pv=1&dpr=1 HTTP/1.1
(facebook.com)
/login.php
?login_attempt=1
&next=https%3A%2F%2Fwww.facebook.com%2Fv2.7%2Fdialog%2Foauth%3Fredirect_uri%3Dhttps%253A%252F%252Fidentity.[site].com%252Fid%252Fsignin-facebook%26state%3Dxxx%26scope%3Demail%26response_type%3Dcode%26client_id%3Dxxx%26ret%3Dlogin%26logger_id%3Dxxx
&lwv=100
(facebook.com)
/v2.7/dialog/oauth
?redirect_uri=https%3A%2F%2Fidentity.[site].com%2Fid%2Fsignin-facebook
&state=xxx
&scope=email
&response_type=code
&client_id=xxx
&ret=login
&logger_id=xxx
&hash=xxx
(identity server)
/id/signin-facebook
?code=xxx
&state=xxx
I saw the code parameter on that last call and thought that maybe I could use the code there to get the access_token from the facebook API https://developers.facebook.com/docs/facebook-login/manually-build-a-login-flow
However when I tried that I get a message from the API telling me the code has already been used.
I also tried to change the UserInformationEndpoint to the FacebookAuthenticationOptions to force it to ask for the email by appending ?fields=email to the end of the default endpoint location, but that causes identity server to spit out the error "There was an error logging into the external provider. The error message is: access_denied".
I might be able to fix this all if I can change the middleware to send the request with response_type=id_token but I can't figure out how to do that or how to extract that access token when it gets returned in the first place to be able to use the Facebook C# sdk.
So I guess any help or direction at all would be awesome. I've spent countless hours researching and trying to solve the problem. All I need to do is get the email address of the logged-in user via IdentityServer3. Doesn't sound so hard and yet I'm stuck.

I finally figured this out. The answer has something to do with Mitra's comments although neither of those answers quite seemed to fit the bill, so I'm putting another one here. First, you need to request the access_token, not code (authorization code) from Facebook's Authentication endpoint. To do that, set it up like this
var fb = new FacebookAuthenticationOptions
{
AuthenticationType = "Facebook",
SignInAsAuthenticationType = signInAsType,
AppId = ConfigurationManager.AppSettings["Facebook:AppId"],
AppSecret = ConfigurationManager.AppSettings["Facebook:AppSecret"],
Provider = new FacebookAuthenticationProvider()
{
OnAuthenticated = (context) =>
{
context.Identity.AddClaim(new System.Security.Claims.Claim("urn:facebook:access_token", context.AccessToken, ClaimValueTypes.String, "Facebook"));
return Task.FromResult(0);
}
}
};
fb.Scope.Add("email");
app.UseFacebookAuthentication(fb);
Then, you need to catch the response once it's logged in. I'm using the following file from the IdentityServer3 Samples Repository, which overrides (read, provides functionality) for the methods necessary to log a user in from external sites. From this response, I'm using the C# Facebook SDK with the newly returned access_token claim in the ExternalAuthenticationContext to request the fields I need and add them to the list of claims. Then I can use that information to create/log in the user.
public override async Task AuthenticateExternalAsync(ExternalAuthenticationContext ctx)
{
var externalUser = ctx.ExternalIdentity;
var claimsList = ctx.ExternalIdentity.Claims.ToList();
if (externalUser.Provider == "Facebook")
{
var extraClaims = GetAdditionalFacebookClaims(externalUser.Claims.First(claim => claim.Type == "urn:facebook:access_token"));
claimsList.Add(new Claim("email", extraClaims.First(k => k.Key == "email").Value.ToString()));
claimsList.Add(new Claim("given_name", extraClaims.First(k => k.Key == "first_name").Value.ToString()));
claimsList.Add(new Claim("family_name", extraClaims.First(k => k.Key == "last_name").Value.ToString()));
}
if (externalUser == null)
{
throw new ArgumentNullException("externalUser");
}
var user = await userManager.FindAsync(new Microsoft.AspNet.Identity.UserLoginInfo(externalUser.Provider, externalUser.ProviderId));
if (user == null)
{
ctx.AuthenticateResult = await ProcessNewExternalAccountAsync(externalUser.Provider, externalUser.ProviderId, claimsList);
}
else
{
ctx.AuthenticateResult = await ProcessExistingExternalAccountAsync(user.Id, externalUser.Provider, externalUser.ProviderId, claimsList);
}
}
And that's it! If you have any suggestions for simplifying this process, please let me know. I was going to modify this code to do perform the call to the API from FacebookAuthenticationOptions, but the Events property no longer exists apparently.
Edit: the GetAdditionalFacebookClaims method is simply a method that creates a new FacebookClient given the access token that was pulled out and queries the Facebook API for the other user claims you need. For example, my method looks like this:
protected static JsonObject GetAdditionalFacebookClaims(Claim accessToken)
{
var fb = new FacebookClient(accessToken.Value);
return fb.Get("me", new {fields = new[] {"email", "first_name", "last_name"}}) as JsonObject;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Puppeteer: multiple user requests to the same Chromium instance - amazon-web-services

Related

AWS CloudWatch Synthetics Canary doesn't load the entire page

Authenticate AWS lambda against Google Sheets API

Google IOT per device heartbeat alert using Stackdriver

Missing request headers in puppeteer

Identity Server 3 Facebook Login Get Email

Categories

Resources