Apify: Preserve headers in RequestQueue - cookies

I'm trying to crawl our local Confluence installation with the PuppeteerCrawler. My strategy is to login first, then extracting the session cookies and using them in the header of the start url. The code is as follows:
First, I login 'by foot' to extract the relevant credentials:
const Apify = require("apify");
const browser = await Apify.launchPuppeteer({sloMo: 500});
const page = await browser.newPage();
await page.goto('https://mycompany/confluence/login.action');
await page.focus('input#os_username');
await page.keyboard.type('myusername');
await page.focus('input#os_password');
await page.keyboard.type('mypasswd');
await page.keyboard.press('Enter');
await page.waitForNavigation();
// Get cookies and close the login session
const cookies = await page.cookies();
browser.close();
const cookie_jsession = cookies.filter( cookie => {
return cookie.name === "JSESSIONID"
})[0];
const cookie_crowdtoken = cookies.filter( cookie => {
return cookie.name === "crowd.token_key"
})[0];
Then I'm building up the crawler structure with the prepared request header:
const startURL = {
url: 'https://mycompany/confluence/index.action',
method: 'GET',
headers:
{
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
Cookie: `${cookie_jsession.name}=${cookie_jsession.value}; ${cookie_crowdtoken.name}=${cookie_crowdtoken.value}`,
}
}
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest(new Apify.Request(startURL));
const pseudoUrls = [ new Apify.PseudoUrl('https://mycompany/confluence/[.*]')];
const crawler = new Apify.PuppeteerCrawler({
launchPuppeteerOptions: {headless: false, sloMo: 500 },
requestQueue,
handlePageFunction: async ({ request, page }) => {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
console.log(page.content());
await Apify.utils.enqueueLinks({
page,
selector: 'a:not(.like-button)',
pseudoUrls,
requestQueue
});
},
maxRequestsPerCrawl: 3,
maxConcurrency: 10,
});
await crawler.run();
The by-foot-login and cookie extraction seems to be ok (the "curlified" request works perfectly), but Confluence doesn't accept the login via puppeteer / headless chromium. It seems like the headers are getting lost somehow..
What am I doing wrong?

Without first going into the details of why the headers don't work, I would suggest defining a custom gotoFunction in the PuppeteerCrawler options, such as:
{
// ...
gotoFunction: async ({ request, page }) => {
await page.setCookie(...cookies); // From page.cookies() earlier.
return page.goto(request.url, { timeout: 60000 })
}
}
This way, you don't need to do the parsing and the cookies will automatically be injected into the browser before each page load.
As a note, modifying default request headers when using a headless browser is not a good practice, because it may lead to blocking on some sites that match received headers against a list of known browser fingerprints.
Update:
The below section is no longer relevant, because you can now use the Request class to override headers as expected.
The headers problem is a complex one involving request interception in Puppeteer. Here's the related GitHub issue in Apify SDK. Unfortunately, the method of overriding headers via a Request object currently does not work in PuppeteerCrawler, so that's why you were unsuccessful.

Related

How would I update the authorization header from a cookie on a graphQL apollo mutation or query

I have the following _app.js for my NextJS app.
I want to change the authorization header on login via a cookie that will be set, I think I can handle the cookie and login functionaility, but I am stuck on how to get the cookie into the ApolloClient headers autorization. Is there a way to pass in a mutation, the headers with a token from the cookie. Any thoughts here???
I have the cookie working, so I have a logged in token, but I need to change the apolloclient Token to the new one via the cookie, in the _app.js. Not sure how this is done.
import "../styles/globals.css";
import { ApolloClient, ApolloProvider, InMemoryCache } from "#apollo/client";
const client = new ApolloClient({
uri: "https://graphql.fauna.com/graphql",
cache: new InMemoryCache(),
headers: {
authorization: `Bearer ${process.env.NEXT_PUBLIC_FAUNA_SECRET}`,
},
});
console.log(client.link.options.headers);
function MyApp({ Component, pageProps }) {
return (
<ApolloProvider client={client}>
<Component {...pageProps} />
</ApolloProvider>
);
}
export default MyApp;
UPDATE:I've read something about setting this to pass the cookie int he apollo docs, but I don't quite understand it.
const link = createHttpLink({
uri: '/graphql',
credentials: 'same-origin'
});
const client = new ApolloClient({
cache: new InMemoryCache(),
link,
});
UPDATE: So I have made good progress with the above, it allows me to pass via the context in useQuery, like below. Now the only problem is the cookieData loads before the use query or something, because if I pass in a api key it works but the fetched cookie gives me invalid db secret and its the same key.
const { data: cookieData, error: cookieError } = useSWR(
"/api/cookie",
fetcher
);
console.log(cookieData);
// const { loading, error, data } = useQuery(FORMS);
const { loading, error, data } = useQuery(FORMS, {
context: {
headers: {
authorization: "Bearer " + cookieData,
},
},
});
Any ideas on this problem would be great.
If you need to run some GraphQL queries after some other data is loaded, then I recommend putting the latter queries in a separate React component with the secret as a prop and only loading it once the former data is available. Or you can use lazy queries.
separate component
const Form = ({ cookieData }) => {
useQuery(FORMS, {
context: {
headers: {
authorization: "Bearer " + cookieData,
},
},
});
return /* ... whatever ... */
}
const FormWrapper = () => {
const { data: cookieData, error: cookieError } = useSWR(
"/api/cookie",
fetcher
);
return cookieData ? <Form cookieData={ cookieData }/> : ...loading
}
I might be missing some nuances with when/how React will mount and unmount the inner component, so I suppose you should be careful with that.
Manual Execution with useLazyQuery
https://www.apollographql.com/docs/react/data/queries/#manual-execution-with-uselazyquery

Unable to set authorization header using Apollo's setContext()

So I'm trying to pass an authorization header to Apollo according to the docs on their site.
https://www.apollographql.com/docs/react/networking/authentication/?fbclid=IwAR17YAJ0VA5G8InmAJ_PxcukXKNQxFLFI8aeT4oB7wYE3DjWNaB_F67__zs
However, when I try to log request.headers on my server I don't see the authorization property on there. Next I checked whether or not the function setContext() was being run at all by putting a console.log() within it and found that it didn't run.
const httpLink = new HttpLink({ uri: 'http://localhost:4000' });
const authLink = setContext((request, { headers }) => {
const token = localStorage.getItem('library-user-token')
console.log(token) //doesn't log at all
return ({
headers: {
...headers,
authorization: token ? `Bearer ${token}` : null,
}
})
});
const client = new ApolloClient({
link: authLink.concat(httpLink),
cache: new InMemoryCache()
});
I have almost identical code in my project and it works. Are you passing the client to your ApolloProvider?

request.cookies is undefined when using Supertest

I'm passing my authentication token via an HTTP-Only cookie in my NestJS API.
As such, when writing some E2E tests for my Auth endpoints, I'm having an issue with cookies not being where I expect them.
Here's my pared-down test code:
describe('auth/logout', () => {
it('should log out a user', async (done) => {
// ... code to create user account
const loginResponse: Response = await request(app.getHttpServer())
.post('/auth/login')
.send({ username: newUser.email, password });
// get cookie manually from response.headers['set-cookie']
const cookie = getCookieFromHeaders(loginResponse);
// Log out the new user
const logoutResponse: Response = await request(app.getHttpServer())
.get('/auth/logout')
.set('Cookie', [cookie]);
});
});
In my JWT Strategy, I'm using a custom cookie parser. The problem I'm having is that request.cookies is always undefined when it gets to the parser. However, the cookie will be present in request.headers.
I'm following the manual cookie example from this Medium article: https://medium.com/#juha.a.hytonen/testing-authenticated-requests-with-supertest-325ccf47c2bb, and there don't appear to be any other methods available on the request object to set cookies.
If I test the same functionality from Postman, everything works as expected. What am I doing wrong?
I know this is an old thread but...
I also had req.cookies undefined, but for a different reason.
I'm testing my router independently, not the top level app. So I bootstrap the app in beforeEach and add the route to test.
I was getting req.cookies undefined because express 4 requires the cookieParser middleware to be present to parse the cookies from the headers.
E.g.
const express = require('express');
const bodyParser = require('body-parser');
const cookieParser = require('cookie-parser');
const request = require('supertest');
const {router} = require('./index');
describe('router', () => {
let app;
beforeAll(() => {
app = express();
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({ extended: true }));
app.use(cookieParser());
app.use('/', router);
});
beforeEach(() => jest.clearAllMocks());
it('GET to /', async () => {
const jwt = 'qwerty-1234567890';
const resp = await request(app)
.get('/')
.set('Cookie', `jwt=${jwt};`)
.set('Content-Type', 'application/json')
.send({});
});
});
Testing this way allows me to unit test a router in isolation of the app. The req.cookies turn up as expected.
Late but I hope I can help you. The problem is in the initialization of the app object. Probably in your main.ts file you have some middlewares configured as they are: cors and queryParse. You must also put them in your tests when you create the app.
const moduleFixture: TestingModule = await Test.createTestingModule({
imports: [AppModule],
}).compile();
const app = moduleFixture.createNestApplication();
// Add cors
app.enableCors({
credentials: true,
origin: ['http://localhost:4200'],
});
// Add cookie parser
app.use(cookieParser());
await app.init();
As per the article you're following, the code at https://medium.com/#juha.a.hytonen/testing-authenticated-requests-with-supertest-325ccf47c2bb :
1) has the 'cookie' value in .set('cookie', cookie) in lowercase and in your code it's in Pascal case ==> Have you tried with lowercase in your code instead ?
2) the cookie value assigned to the 'cookie' header is not an array, whereas in your code you're assigning an array ==> Have you tried with a non array value ?
So to resume, can you try with the following code:
describe('auth/logout', () => {
it('should log out a user', async (done) => {
// ... code to create user account
const loginResponse: Response = await request(app.getHttpServer())
.post('/auth/login')
.send({ username: newUser.email, password });
// get cookie manually from response.headers['set-cookie']
const cookie = getCookieFromHeaders(loginResponse);
// Log out the new user
const logoutResponse: Response = await request(app.getHttpServer())
.get('/auth/logout')
.set('cookie', cookie) // <== here goes the diff
.expect(200, done);
});
});
Let us know if that helps :)

Next.js not persisting cookies

I have a server-side rendered Next.js/express app that communicates with a Django API (cross-origin). I login a user like so:
const response = await fetch('localhost:8000/sign-in', {
method: 'POST',
credentials: 'include',
body: JSON.stringify({ email, password }),
headers: { 'Content-Type': 'application/json' },
});
const result = await response.json();
if (response.status === 200) {
Router.push('/account');
}
Django successfully logs in the user and returns set-cookie headers for the csrftoken and sessionid cookies, however, when I navigate to a different page (like in the above code when I Router.push), the cookies don't persist.
I assume this has something to do with server-side vs. client-side, but when cookies are set in the browser I expect them to persist regardless.
How can I get these cookies, once set, to persist across all pages on the client side?
It turns out that set-cookie is the old way of doing things. It's controlled by the browser, so it's obfuscated.
I ended up sending the csrftoken and sessionid back to the client in the JSON body, and saving them to localStorage using localStorage.setItem('sessionid', 'theSessionId') and localStorage.setItem('csrftoken', 'theCsrftoken').
Then when I need to make an authenticated request, I include them in the fetch headers:
const response = await fetch(`${API_HOST}/logout`, {
method: 'POST',
headers: {
'X-CSRFToken': localStorage.getItem('csrftoken'),
sessionid: localStorage.getItem('sessionid'),
},
});

Express lambda return JSON from api call

I have a serverless express app. In the app I have a app.get called '/', which should call an api, retrieve the data from the api and send it back to the user.
https://y31q4zn654.execute-api.eu-west-1.amazonaws.com/dev
I can see data as json on the page being returned.
This is my index.js of the lambda function:
const serverless = require('serverless-http');
const express = require('express');
const request = require('request');
const app = express()
app.get('/', function (req, res) {
var options = { method: 'POST',
url: 'https://some.api.domain/getTopNstc',
headers:
{ 'Content-Type': 'application/json' },
body: {},
json: true
};
request(options, function (error, response, body) {
console.log('request call')
if (error) throw new Error(error);
// res.status(200).send(response);
res.json(response);
});
});
module.exports.handler = serverless(app);
However I would to be able to call the lambda '/' via axios (or other promise-request library)
I've tried to use the following code to make a call to my lambda:
axios.get('https://y31q4zn654.execute-api.eu-west-1.amazonaws.com/dev', {
headers: {
'Content-Type': 'application/json',
},
body:{}
}).then((res) => {
console.log(res);
});
Failed to load
https://y31q4zn654.execute-api.eu-west-1.amazonaws.com/dev: No
'Access-Control-Allow-Origin' header is present on the requested
resource. Origin 'myDomain' is therefore not
allowed access. bundle.js:31 Cross-Origin Read Blocking (CORB) blocked
cross-origin response
https://y31q4zn654.execute-api.eu-west-1.amazonaws.com/dev with MIME
type application/json. See
https://www.chromestatus.com/feature/5629709824032768 for more
details.
Api gateway config:
I concur with #KMo. Pretty sure this is a CORS issue. There is a module in npm exactly for this purpose, read up about it here.
To install it, run npm install -s cors
Then in your express app, add the following:
const express = require('express');
const app = express();
const cors = require('cors');
app.use(cors());