Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Archive URL preview content at the time it is posted to a room #10676

Open
ajkessel opened this issue Aug 23, 2021 · 13 comments
Open

Archive URL preview content at the time it is posted to a room #10676

ajkessel opened this issue Aug 23, 2021 · 13 comments
Labels
A-URL-Preview Issues related to generating server-side previews of remote URLs T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@ajkessel
Copy link

When a user posts a URL in Slack, the preview at the time is saved in the channel. This can be very handy for sites with dynamic content, so someone going back to the channel history can see at least a snippet of the content available from the URL at the time of the posting. It's likewise useful for URLs that disappear over the time. If the user posted a Twitter link into the channel, the entire Tweet appears there, which allows for archival access when tweets are deleted. This stored preview information is also preserved when a Slack Workspace is exported, but there is no way to import that information into Matrix.

It would be very helpful for Matrix to be able to generate and store the same information for URL previews as a configurable option. This would also enable a user migrating a Slack Workspace to import this metadata and not lose it when transitioning from Slack to Matrix.

@reivilibre
Copy link
Contributor

From your message, I think you're talking more about 'archiving' rather than merely just 'caching'.

In summary: Archiving a URL preview

  • at that point in time
  • saved in the room
  • so that other users (even on other servers) can see it.

It's an interesting suggestion. It would probably mean that the client would have to generate the URL preview (or perhaps request it from the server first) so it can be embedded in the event — this would also solve some of the privacy implications of URL previews in end-to-end encrypted rooms (since the SENDER would include the URL preview inside the encrypted event).
However, this is something you'd need to bring up as a proposed spec change.
You could discuss in #matrix-spec:matrix.org and/or look at https://spec.matrix.org/unstable/proposals/ and https://github.com/matrix-org/matrix-doc.

@clokep
Copy link
Member

clokep commented Aug 27, 2021

It would probably mean that the client would have to generate the URL preview (or perhaps request it from the server first) so it can be embedded in the event

I think this is pretty much how URL previews work already.

From having chatted with the reported in #synapse:matrix.org, the idea is to keep them in the database (while right now we cache them for a short period of time in memory).

I do not believe this needs a spec change.

@clokep clokep added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Aug 27, 2021
@reivilibre
Copy link
Contributor

I think this is pretty much how URL previews work already.

I didn't think so — I thought the clients request them from their own homeserver at the time they wish to see . Indeed, my server isn't capable of seeing URL previews and so I don't see any on my client, even if the URLs were sent by other users on other homeservers.

@clokep
Copy link
Member

clokep commented Aug 27, 2021

I think this is pretty much how URL previews work already.

I didn't think so — I thought the clients request them from their own homeserver at the time they wish to see . Indeed, my server isn't capable of seeing URL previews and so I don't see any on my client, even if the URLs were sent by other users on other homeservers.

Oh, I see what you're saying. I think having each homeserver cache it at request time would be a reasonable implementation, but that's getting into the weeds. Seems there would be some trade-offs in different implementations.

@reivilibre
Copy link
Contributor

Oh, I see what you're saying. I think having each homeserver cache it at request time would be a reasonable implementation, but that's getting into the weeds. Seems there would be some trade-offs in different implementations.

I don't think the reporter's desire of importing them can be otherwise achieved except by storing them as part of the events (unless we maintain some extra writeable structure just for URL previews, but seems unlikely)

@clokep
Copy link
Member

clokep commented Aug 27, 2021

Oh, I see what you're saying. I think having each homeserver cache it at request time would be a reasonable implementation, but that's getting into the weeds. Seems there would be some trade-offs in different implementations.

I don't think the reporter's desire of importing them can be otherwise achieved except by storing them as part of the events (unless we maintain some extra writeable structure just for URL previews, but seems unlikely)

Yes, I'm suggesting we maintain a separate table or storage structure just for URL previews. Essentially moving the cache out of memory into a database (or onto the file system, if you want it more like the media repo).

@reivilibre
Copy link
Contributor

Yes, I'm suggesting we maintain a separate table or storage structure just for URL previews. Essentially moving the cache out of memory into a database (or onto the file system, if you want it more like the media repo).

This seems to have the problem that it'd be server-specific (or be a nuisance to federate).

@clokep
Copy link
Member

clokep commented Aug 27, 2021

So... TIL that Synapse kind of this. There's a local_media_repository_url_cache table which stores the results of downloading a URL. The results are also cached in memory for a period of time, however, but then falls back to the table before fetching (I think...the code is hard to follow). Note that there's no corresponding remote version of this (as @reivilibre is mentioning, this would require spec changes).

Unfortunately Synapse does remove the data after whatever the expiration on the response is (defaults to one hour, but could be different).

@ajkessel
Copy link
Author

Yes, I did mean "archive" rather than cache. I've found this very useful on other messaging systems for several years and it would be great to not lose that historical data when migrating. I'm a newbie to Matrix -- happy to file this a new issue on the spec if that's the appropriate next step, although other folks in this thread would probably do a better job of describing it there.

@ajkessel ajkessel changed the title Cache URL preview content at the time it is posted to a room Archive URL preview content at the time it is posted to a room Sep 23, 2021
@ajkessel
Copy link
Author

Is there any way in which this concept would intersect with #2752 , i.e. using oembed for URL previews? Easier to store the oembed data structure in connection with the post containing the URL as an attached item?

@clokep
Copy link
Member

clokep commented Sep 23, 2021

Is there any way in which this concept would intersect with #2752 , i.e. using oembed for URL previews? Easier to store the oembed data structure in connection with the post containing the URL as an attached item?

oEmbed support is to improve the generate previews, this ticket is more about long-term archiving of the previews once you have them.

@ajkessel
Copy link
Author

Is there any way in which this concept would intersect with #2752 , i.e. using oembed for URL previews? Easier to store the oembed data structure in connection with the post containing the URL as an attached item?

oEmbed support is to improve the generate previews, this ticket is more about long-term archiving of the previews once you have them.

Understood -- I'm wondering if the standardization that comes with oEmbed might present an opportunity to archive that data structure. Or just hoping that if people are paying attention to improving URL preview behavior, the idea in this issue might percolate up to the foreground as well :)

@bkil
Copy link

bkil commented Apr 14, 2022

Oh, I see what you're saying. I think having each homeserver cache it at request time would be a reasonable implementation, but that's getting into the weeds. Seems there would be some trade-offs in different implementations.

I wouldn't find that implementation reasonable, as users on different HS would see a snapshot from a different point in time.

I recommend client side generation

This should ideally be done through the client directly on platforms where this is possible:

  • on non-web
  • if CORS allows a direct Fetch
  • if it is a site that has a simple, unauthenticated REST API which can give back the same basic information we need (title and synopsis of article, tags, maybe a cover image)
  • otherwise as a last resort, if the room is not E2EE, fetch from the HS of the sender and cite that

Embedding the preview within the sent event also allows for a killer use case:

Searching among your shared content

Welcome to the world of social bookmarking! I've been manually copy & pasting the title or keywords from all articles I share so I can refer to it in the room history and reshare it in other rooms when the topic comes up. It would be great if search could also index such archived previews!

@MadLittleMods MadLittleMods added the A-URL-Preview Issues related to generating server-side previews of remote URLs label Jul 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-URL-Preview Issues related to generating server-side previews of remote URLs T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

5 participants