Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface manifest and service worker URLs in JSON results #9683

Closed
rviscomi opened this issue Sep 16, 2019 · 13 comments
Closed

Surface manifest and service worker URLs in JSON results #9683

rviscomi opened this issue Sep 16, 2019 · 13 comments

Comments

@rviscomi
Copy link
Member

When LH discovers a page with a service worker or a manifest file, it would be hugely helpful for downstream services like HTTP Archive if those file URLs were included in the JSON results somehow. Otherwise we end up doing a lot of questionable regex parsing of the HTML.

plz i can has urls?

@connorjclark
Copy link
Collaborator

There's lhr.audits['network-requests'], which will at least give you a structured array of the network requests.

If the artifacts could be added to HTTPArchive, there is the LinkElements artifact:

"LinkElements": [
    {
      "rel": "manifest",
      "href": "https://100.exterkamp.codes/manifest.json",
      "hrefRaw": "https://100.exterkamp.codes/manifest.json",
      "hreflang": "",
      "as": "",
      "crossOrigin": null,
      "source": "head"
    }
  ],

and the ServiceWorker artifact:

"ServiceWorker": {
    "versions": [
      {
        "versionId": "0",
        "registrationId": "0",
        "scriptURL": "https://100.exterkamp.codes/sw.js",
        "runningStatus": "running",
        "status": "activated",
        "scriptLastModified": 1547540209,
        "scriptResponseTime": 1568661692.977256,
        "controlledClients": [],
        "targetId": "B1249BF04CA22F295084955F7F8AE1E2"
      }
    ],
    "registrations": [
      {
        "registrationId": "0",
        "scopeURL": "https://100.exterkamp.codes/",
        "isDeleted": false
      }
    ]
  },

@brendankenny
Copy link
Member

check out report.audits['network-requests'].details.items :)

The manifest should even be resourceType: 'Manifest'.

that doesn't include the service worker URL, though, since the page itself isn't doing the network request for the file. #709 would/could do that, or we could put the service worker URL in the details of the service-worker audit

@brendankenny
Copy link
Member

I was too slow :)

If the artifacts could be added to HTTPArchive

probably won't happen, though, since the artifacts are large and growing without bound these days (and httparchive kind of already has its own artifacts it collects)

@rviscomi
Copy link
Member Author

rviscomi commented Sep 16, 2019

Aha! This sounds really promising.

(Edit: removing the part about the resource URLs, there was a bug in my lookup script :))

Would the artifacts be the only way to get the resource URLs? We already do some pruning of the LH response for size control, so it's feasible for us to only grab what we need.

https://github.com/HTTPArchive/bigquery/blob/master/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java#L268-L276

@brendankenny
Copy link
Member

(Edit: removing the part about the resource URLs, there was a bug in my lookup script :))
Would the artifacts be the only way to get the resource URLs? We already do some pruning of the LH response for size control, so it's feasible for us to only grab what we need.

ha, just saw your edit. So you do see them in there?

The only resource URL that would be missing should be the service worker (at least as far as I know the page would have to be doing something weird for it to show up in its own network requests).

@rviscomi
Copy link
Member Author

Yes, I fixed the bug and now I'm seeing the manifest URLs. 😁

As expected, I'm not seeing any ServiceWorker resource types. Oddly though, the manifest URL is missing from ~9% of reports that pass the installable manifest audit:

#standardSQL
CREATE TEMP FUNCTION getUrls(report STRING)
RETURNS STRUCT<manifest STRING, sw STRING> LANGUAGE js AS '''
var $ = JSON.parse(report);
return $.audits['network-requests'].details.items.reduce((urls, item) => {
  if (item.resourceType == 'Manifest') {
    urls.manifest = item.url;
  } else if (item.resourceType == 'ServiceWorker') {
    urls.sw = item.url;
  }
  return urls;
}, {});
''';

SELECT
  COUNTIF(urls.manifest IS NULL) AS null_manifests,
  COUNTIF(urls.sw IS NULL) AS null_sq,
  COUNT(0) AS total,
  COUNTIF(urls.manifest IS NULL) / COUNT(0) AS pct_null_manifests,
  COUNTIF(urls.sw IS NULL) / COUNT(0) AS pct_null_sw
FROM (
  SELECT
    getUrls(report) AS urls
  FROM
    `httparchive.lighthouse.2019_07_01_mobile`
  WHERE
    JSON_EXTRACT_SCALAR(report, '$.audits.installable-manifest.score') = '1')
null_manifests null_sq total pct_null_manifests pct_null_sw
7253 79345 79345 9.14% 100.00%

Any ideas why that might be?

@brendankenny
Copy link
Member

resourceType is populated by the type field of the Network.requestWillBeSent protocol event. Someone may want to dig into the implementation to see for sure, but it's possible that it's misclassified in type but still treated as a proper manifest.

Maybe try one of the problem URLs in Chrome and check out the network panel in DevTools? The resourceType in the audit should match what's in the "type" column.

@brendankenny
Copy link
Member

that's actually another one we could manually annotate (this time on the installable-manifest audit). The Page.getAppManifest return value includes the URL so we wouldn't have to go guessing from resourceType which file was actually used as the manifest.

So maybe we should fix this issue as originally requested after all :)

@connorjclark
Copy link
Collaborator

@rviscomi just confirming this is still a valuable feature for you / HTTPArchive ?

@rviscomi
Copy link
Member Author

Yes, this is still useful for us!

@brendankenny
Copy link
Member

brendankenny commented Aug 12, 2020

OK, this was a terribly confusing conversation to leave for posterity if the goal was to communicate what actually needs to be done :) but I think I've got a handle on it below. Probably best to assess/fix these two separately as they're not really related.

Service worker URL

When a page has a ServiceWorker, the browser requests the SW script, not the page, so it isn't in our network-requests audit list of requests.

Possible fix: the ServiceWorker artifact has the SW's URL, so the service-worker audit could include the SW's URL in debugData details. It looks like getControllingScopeUrl() would need to be modified to return the LH.Crdp.ServiceWorker.ServiceWorkerVersion with the most specific scopeUrl (instead of just the scopeUrl) so that its scriptURL could be included as details in the final score: 1 audit result

Web app manifest URL

When a page has a web app manifest, it should be loaded by the page (caused by lighthouse itself, I believe?) and so it will show up in report.audits['network-requests'].details.items and have the property resourceType: 'Manifest'. However, @rviscomi found that 9% of the time pages have a (supposedly) valid manifest but there is no network request with resourceType: 'Manifest'. I assume this is because it ends up based on mime type and ignores the fact that we just explicitly requested a web app manifest :)

Possible fix: Assuming that 9% is just misclassified resource types, should we find the manifest URL in the network-requests items and manually set the resourceType to 'Manifest'? Or add the URL as details on some audit (installable-manifest?)?

One issue is that GatherRunner doesn't keep the manifest URL around from driver.getAppManifest(), so we may want to add the URL to the manifest artifact? Otherwise we could go with the entry in LinkElements @connorjclark mentioned in #9683 (comment), we'll just have to deal with slightly more uncertainty than asking the protocol directly (like, what if the page declares a manifest twice, which one was the actual manifest used by Chrome?).

@patrickhulce
Copy link
Collaborator

caused by lighthouse itself, I believe?

Our fetch of the manifest does not happen while we're recording devtools logs though, so I would assume this happens somehow as a result of us loading the page and there being a link[rel=manifest]

@connorjclark
Copy link
Collaborator

connorjclark commented Sep 1, 2020

fixed by #11329 and #11330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants