Introduce search across all of HexDocs #1811

josevalim · 2023-11-10T13:11:17Z

The goal of this feature is to provide search and autocompletion across packages. We will add a new configuration, called related_deps, which is a list of package names we find related. We will improve both autocomplete and search to use this, such that:

Autocompletion
- Without related_deps
  - Only autocompletes the current project (current behaviour)
- With related_deps
  - Autocomplete the current package and all related dependencies
Full-text search
- Without related_deps
  - By default searches the current project (current behaviour)
  - We will show radio buttons that allows you to customize the search. The options are "[ ] Current project" (default) and "[ ] HexDocs"
- With related_deps
  - By default searches the current project and all related deps
  - We will show radio buttons that allows you to customize the search. The options are "[ ] Current project", "[ ] Current project + Related packages" (default), and "[ ] HexDocs"

To power this feature, we will build a new service that does both autocompletion and search based on SQLite3 database. We have proof of concepts from:

@ruslandoga who shared his notes here: https://gist.github.com/ruslandoga/7f0f5b68d760ec5b3e650e7f73f694f2
@jeregrine who posted his code here: https://github.com/jeregrine/hex-search

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

The text was updated successfully, but these errors were encountered:

josevalim · 2023-11-10T13:45:28Z

Btw, I have a dump of the database already, in case someone wants to use it for a proof of concept. Just ping me elsewhere and I will send a link. We should also skip any license.html and changelog.html files we find.

ruslandoga · 2023-11-11T12:08:28Z

@josevalim 🙋‍♂️ I'd like to compare the dump with the data I've scraped.

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

josevalim · 2023-11-11T12:13:35Z

Getting access to logs is probably difficult but the Hex team may accept a PR that adds this computation. I cannot answer for them though, so you will have to ask. :)

jeregrine · 2023-11-11T12:49:48Z

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

The code actually only grabs new packages and re-indexes them since Hex can sort by updated_at. So you could run that daily and it would take seconds.

One of the reason's its so slow is that the the json containing the indexable items sidebar_items-<rand_id>.js and search_items-<rand-id>.js is always different and I need to GET the HTML, find the script src then GET the js; then do the same for the search page. Changing the rand_id to a query string for cache busting like search_items.js?vsn=<rand-id> would mean I could only make 2 requests and skip parsing html.

@ruslandoga who shared his notes here: https://gist.github.com/ruslandoga/7f0f5b68d760ec5b3e650e7f73f694f2

@ruslandoga nice idea with the sqlite C function I did the lazy way with SQL and its not too slow https://github.com/jeregrine/hex-search/blob/main/lib/hex_docs_search/hex.ex#L50

ruslandoga · 2023-11-11T12:55:58Z

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

josevalim · 2023-11-11T13:05:41Z

@jeregrine oh, so you skip downloading the whole docs tar?

jeregrine · 2023-11-11T14:15:20Z

@jeregrine oh, so you skip downloading the whole docs tar?

Didn't even know it was downloadable. :-) But yea I don't do that it might faster at a cost of more disk/memory usage. ¯_(ツ)_/¯

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

Actually more I think about it, nvm. Its messy.

rhcarvalho · 2023-11-12T12:46:39Z

In the current design, would this require packages to update their ex_doc dependency and release a new version or would it work regardless of which version of ex_doc was used to generate the documentation?

ruslandoga · 2023-11-12T12:48:53Z

👋 @rhcarvalho

The new search functionality (assets/js) would only be present in the new ex_doc version, so I think it's more likely that the packages would need to upgrade to get global search from their documentation pages. But for a package to be indexable, they don't need to upgrade.

zachdaniel · 2024-01-12T16:23:40Z

👋 hey everyone, just checking in. Is this in progress? If so, any way I can assist? If not, I may be able to help get it off the ground :)

josevalim · 2024-01-12T16:47:11Z

There is a delay because we are also investigating if it makes sense to add embeddings to the docs, so we can also use it to provide context for LLMs (such as OpenAI). I will try to post more information soon. :)

zachdaniel · 2024-01-12T16:48:15Z

Sounds good! Thanks for your hard work. Not trying to hurry. I'm happy to wait, just want to assist if possible/warranted.

josevalim · 2024-01-12T17:14:21Z

That's really good to know. I will reach out once we have an action plan, unless you are also happy to get involved in the "figure it out" process and write some JS too? :)

zachdaniel · 2024-01-12T17:26:07Z

Yeah, I'd be very happy to be involved in any way. Cross package search is a major win for the Ash ecosystem, and is absolutely worth me spending my time on.

couhajjou · 2024-01-12T22:03:02Z

I see 4 search planes:

repo - current repo
deps - all deps (from mix.deps)
framework (set by framework author)
pinned repos (set by user)

Please empower the user

couhajjou · 2024-01-12T22:13:38Z

I am WIP-ing 'pinned repos' in ex_doc. It's the most versatil.
the idea is that any repo should have a JSON file search_data.json

it's just the json version of this file:
https://hexdocs.pm/ash_postgres/dist/search_data-C114CB12.js

both search_data.js and search_data.json will include the package info like this:

That would allow the UI to ingest the search_data.json files of the pinned repos

and display the info like this

And we need to change the UI a bit, but that Idea was already sketched up in this post.
It just a matter of a little UI design.

Pinned repos can be stored in

local storage.
or chrome extension
or an account on hexdocs. (so that hexdocs can have all our emails ;)

It's not a big change to ex_doc.

And ofc we need to keep caching and versioning at it is now in search_results_72517.js

josevalim · 2024-01-13T09:01:53Z

We explored this but sometimes those files can be really large and building a index of all of them in realtime would become very expensive. Often the resulting index was so large that it would blow up local storage, which would cause us to index them every time, making it worse.

couhajjou · 2024-01-13T17:13:34Z

@josevalim, I am not sure that you read my comment here: #1811 (comment)

here it is again:

I see 4 search planes:

1- repo - current repo
2- deps - all deps (from mix.deps)
3- framework (set by framework author)
4- pinned repos (set by user)

I am addressing here the solution 4-pinned repos.

in the local storage we just store the list like this:

pinned-packages: [
  {
     package: 'ash',
     search-indrex-url: 'http://hexdocs/ash/searchIndex.json'
  }
  {
     package: 'ash_postgres',
     search-index-url: ....
  }
]

it's the user who decide wich repos he want to 'Pin'

Ash search index is 104KB, it's cached in browser cache
10 ash repos would be around 1MB.

So for ash framework users it will be a few bytes in the local storage.
And 1MB in the cache.

please correct me if I am missing something. as I am WIP-ing this.

couhajjou · 2024-01-13T18:13:40Z

Here is the architecture and the UI I propose for search

1- repo - current repo ====>ex_doc feature. Offline and online search
2- deps - all deps ====>hexdocs search engine. Online only. not available offline
3- framework (set by framework author) ===> ex_doc feature. Offline and online
4- pinned repos (set by user) ===> ex_doc feature. Offline and online

So ex_doc search for 1 3 and 4
And hexdocs search for 2

We have to have one UI.
THE SEARCH INPUT in ex_doc will be able to do :

call to ex_doc internal (this how it works now)
call to hexdocs search API (to be implemented)

I am.just WIP-ing 3 and 4
1 is working
2 it's an hexdocs project. Needs someone like algolia

So with 1 3 and 4 I can do some ash and phoenix coding on the plane @zachdaniel ;)

couhajjou · 2024-01-13T18:20:41Z

A complication to discuss later: You can pin online depos and/or local depos (if they are in your HD).

Like mix.deps can have local and remote packages.

Sounds complex but can be simple
....

josevalim · 2024-01-13T18:34:22Z

Right. But you can think a new user would also want to pin Elixir itself and we know for a fact Elixir was too big to cache (so we added compression). Ecto and Phoenix are also on the larger side too. So I wonder if those three would not be enough to below up session storage space?

couhajjou · 2024-01-13T20:15:54Z

Local cache is 10MB. Elixir search index is 2MB. On the plane it's not a pb, we loading from disc. Online we might have a cache miss, it's life :) Then the browser hits the CDN. If you want to cap everything to 10MB you can and make it like an amazon kindle and tell the user you don't have more storage with an UI like this:

Pinned Repos	Size	Actions
ash	108 KB	[Unpin] [Download local doc] <- Keep coding while on the plane to ElixirConf
elixir	2.1 MB	[Unpin] [Download local doc] //sessionStorage.removeItem(elixir)
Total	2.2 MB

I am saying let's empower the user. The persona is a Dev. So it's ok if the UX is technical a bit.

---This is tangent and maybe crazy ----

This is tangent but we could also Ideate a chrome extension UX. Don't we need one for phoenix ?
It can be The Phoenix Chrome extension and we could put other things in there.

A level of gamification is to track the most pinned repos. Like github (forks/stars). It can create another
dynamic, with prizes in ElixirConf. It's a design technique used in Building Architecture. You take a technical limitation and make it useful and elegant. (Designer Trick)

josevalim · 2024-01-13T20:36:11Z

Interesting...

However, I think we have to be a bit less optimistic. We still need session storage for other indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time. So maybe 7MB of custom search max.

And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos out of the box and I would focus on that instead. The good news is that I am quite sure your ideas could be fully explored as a separate project!

PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available.

couhajjou · 2024-01-13T20:58:05Z

For the plane use case, when I am working within a project all I need is within my_ash_project/deps. Ex_doc should reach there. Think of my_ash_project/deps as a cache for hexdocs. I understand you want to do 2. But till then. ex_doc or a fork of it can do 1 3 and 4. I would use it locally. My understanding is that you allow different documentation tools. So for me it's not either ex_doc or hexdocs search. It's both of them. If you decide to enforce a certain builder on hexdocs I ll respect that. And I can use ex_doc_mutirepo_search as a local book on my computer. I love to have physical copy on my disc. Ex_doc is great and we can make it better. Thanks

…

On Sat, Jan 13, 2024, 3:36 p.m. José Valim ***@***.***> wrote: Interesting... However, I think we have to be a bit less optimistic. We still need session storage for *other* indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time. And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos *out of the box* and I would focus on there instead. The good news is that I am quite sure your ideas could be fully explored as a separate project! PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available. — Reply to this email directly, view it on GitHub <#1811 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAT6TYEXMQUENULQO2NFFDYOLV4NAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQG43DENJXGU> . You are receiving this because you commented. Message ID: ***@***.***>

josevalim · 2024-01-13T21:03:14Z

I see, that definitely feels out of scope for ExDoc then. :) I recommend exploring this on your own, something that builds the docs in the deps folder and creates a unified search interface. Bonus points if it works both online and offline. Meanwhile, let's please refocus this issue on its original description. Thank you!

couhajjou · 2024-01-15T19:55:31Z

@josevalim, In that case I would suggest to move hexdocs search feature as you envisioned it to the hexdocs repo.

Here are my arguments:

ex_doc is not hexdocs.
ex_doc is an HTML eBook generator. The generated eBook is searchable and self contained. Search feature is part of the generated book.
hexdocs is a book library. The book library should have a search engine.
ex_doc search is client based
hexdocs search is server based
hexdocs search architecture is to be done within hexpm/hexdocs team/project effort
hexpm should publish a protocol that have to be satisfied by package authors who want their documentation to be searchable by hexdocs.
that protocol will be implemented by ex_doc vestion X, and the upgrade will be seamless: upgrade ex_doc, run mix docs
From business perspective:
- ex_doc is a product. (distributed thought github)
- hexdocs is a service (run by hexpm organisation)

I suggest we figure out the TechnicalDesign/Architecture of the search functionality. we have 2 product/services (ex_doc, hexdocs).

For UX I would suggest the apple approach, one UX across physically separates complementary devices.

One search experience through ex_doc and hexdocs, the user will not notice the discontinuity.

josevalim · 2024-01-15T20:13:02Z

That's historically how we have implemented features in Hexdocs that are used by ExDoc and that's most likely how we plan to implement this one too: Hexdocs provide a generic interface for others to hook into and ExDoc simply acts as one of the clients.

josevalim · 2024-01-15T20:15:09Z

It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then works as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway.

couhajjou · 2024-01-15T20:42:34Z

Interesting An ex_doc with a plugin architecture would be cool (embedding search form and search results) So that ex_doc wouldnt have code dependency to hexdocs AND Integrating different search engines. (Including Google) would be super easy and free And one day AI search within ex_doc UX You don't have to know ex_doc code base to implement a search plugin

…

On Mon, Jan 15, 2024, 3:15 p.m. José Valim ***@***.***> wrote: It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then work as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway. — Reply to this email directly, view it on GitHub <#1811 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAT6T6KLC4J3DUUBWEWCSLYOWE5TAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSG4ZDCNBYGY> . You are receiving this because you commented.Message ID: ***@***.***>

ruslandoga · 2024-04-29T06:21:34Z

👋

I'm interested in working on this and would love to collaborate with anyone else currently involved! I'll start by revisiting the SQLite approaches and checking if there are better options available now (typesense, meilisearch, etc.).

josevalim · 2024-05-05T17:09:44Z

Hi @ruslandoga! At the moment, we are thinking about going with Postgres. We will compute our own text embeddings using machine learning models and store them with pgvector. What are your thoughts?

ruslandoga · 2024-05-06T01:22:02Z

👋 @josevalim oh right, I forgot about your comment above on wanting to add semantic search... Sorry! I should probably reread this thread.

With SQLite I kept the embeddings in a BLOB and loaded them all in memory on startup and used https://github.com/elixir-nx/hnswlib as index. That was too complicated and a bit resource-intensive, pgvector would likely make it much simpler and more efficient :)

But I was rather wondering about the basic global search, like a global autocomplete, is that still planned? Would Postgres be used for that as well?

josevalim · 2024-05-06T07:08:09Z

Yes, the goal would be to use PG for that as well.

ruslandoga mentioned this issue Nov 11, 2023

Adding co-occurrence metrics to facilitate finding related packages hexpm/hexpm#1218

Open

garazdawi mentioned this issue Nov 20, 2023

Convert docs to use EEP-59 and Markdown garazdawi/otp#75

Merged

26 tasks

garazdawi mentioned this issue Mar 28, 2024

New doc search issues erlang/otp#8321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce search across all of HexDocs #1811

Introduce search across all of HexDocs #1811

josevalim commented Nov 10, 2023

josevalim commented Nov 10, 2023 •

edited

Loading

ruslandoga commented Nov 11, 2023 •

edited

Loading

josevalim commented Nov 11, 2023

jeregrine commented Nov 11, 2023 •

edited

Loading

ruslandoga commented Nov 11, 2023

josevalim commented Nov 11, 2023

jeregrine commented Nov 11, 2023

rhcarvalho commented Nov 12, 2023

ruslandoga commented Nov 12, 2023 •

edited

Loading

zachdaniel commented Jan 12, 2024

josevalim commented Jan 12, 2024

zachdaniel commented Jan 12, 2024

josevalim commented Jan 12, 2024

zachdaniel commented Jan 12, 2024

couhajjou commented Jan 12, 2024

couhajjou commented Jan 12, 2024 •

edited

Loading

josevalim commented Jan 13, 2024

couhajjou commented Jan 13, 2024 •

edited

Loading

couhajjou commented Jan 13, 2024

couhajjou commented Jan 13, 2024

josevalim commented Jan 13, 2024

couhajjou commented Jan 13, 2024 •

edited

Loading

josevalim commented Jan 13, 2024 •

edited

Loading

couhajjou commented Jan 13, 2024 via email •

edited

Loading

josevalim commented Jan 13, 2024

couhajjou commented Jan 15, 2024 •

edited

Loading

josevalim commented Jan 15, 2024 •

edited

Loading

josevalim commented Jan 15, 2024 •

edited

Loading

couhajjou commented Jan 15, 2024 via email

ruslandoga commented Apr 29, 2024 •

edited

Loading

josevalim commented May 5, 2024

ruslandoga commented May 6, 2024 •

edited

Loading

josevalim commented May 6, 2024

Introduce search across all of HexDocs #1811

Introduce search across all of HexDocs #1811

Comments

josevalim commented Nov 10, 2023

josevalim commented Nov 10, 2023 • edited Loading

ruslandoga commented Nov 11, 2023 • edited Loading

josevalim commented Nov 11, 2023

jeregrine commented Nov 11, 2023 • edited Loading

ruslandoga commented Nov 11, 2023

josevalim commented Nov 11, 2023

jeregrine commented Nov 11, 2023

rhcarvalho commented Nov 12, 2023

ruslandoga commented Nov 12, 2023 • edited Loading

zachdaniel commented Jan 12, 2024

josevalim commented Jan 12, 2024

zachdaniel commented Jan 12, 2024

josevalim commented Jan 12, 2024

zachdaniel commented Jan 12, 2024

couhajjou commented Jan 12, 2024

couhajjou commented Jan 12, 2024 • edited Loading

josevalim commented Jan 13, 2024

couhajjou commented Jan 13, 2024 • edited Loading

couhajjou commented Jan 13, 2024

couhajjou commented Jan 13, 2024

josevalim commented Jan 13, 2024

couhajjou commented Jan 13, 2024 • edited Loading

josevalim commented Jan 13, 2024 • edited Loading

couhajjou commented Jan 13, 2024 via email • edited Loading

josevalim commented Jan 13, 2024

couhajjou commented Jan 15, 2024 • edited Loading

josevalim commented Jan 15, 2024 • edited Loading

josevalim commented Jan 15, 2024 • edited Loading

couhajjou commented Jan 15, 2024 via email

ruslandoga commented Apr 29, 2024 • edited Loading

josevalim commented May 5, 2024

ruslandoga commented May 6, 2024 • edited Loading

josevalim commented May 6, 2024

josevalim commented Nov 10, 2023 •

edited

Loading

ruslandoga commented Nov 11, 2023 •

edited

Loading

jeregrine commented Nov 11, 2023 •

edited

Loading

ruslandoga commented Nov 12, 2023 •

edited

Loading

couhajjou commented Jan 12, 2024 •

edited

Loading

couhajjou commented Jan 13, 2024 •

edited

Loading

couhajjou commented Jan 13, 2024 •

edited

Loading

josevalim commented Jan 13, 2024 •

edited

Loading

couhajjou commented Jan 13, 2024 via email •

edited

Loading

couhajjou commented Jan 15, 2024 •

edited

Loading

josevalim commented Jan 15, 2024 •

edited

Loading

josevalim commented Jan 15, 2024 •

edited

Loading

ruslandoga commented Apr 29, 2024 •

edited

Loading

ruslandoga commented May 6, 2024 •

edited

Loading