Skip to content

Index Nextcloud content in a external indexation platform

Maxime LE HERICY edited this page Nov 28, 2022 · 2 revisions

For a ready-to-use Full Text Search feature with Nextcloud based on Elastic Search, see https://github.com/nextcloud/fulltextsearch_elasticsearch/wiki

Quick documentation overview

The following documentation aims to explain how to index content and metadata of documents stored on Nextcloud in a third-party indexation platform.

Full Text Search app keep updated a collection (list) of changed documents since the last indexation.

Following APIs can be used by a third-party app to:

  1. query the collection in order to know which documents changed.
  2. retrieve the document content and metadatas (that can then be indexed in whatever indexation platform)
  3. update the collection to say which documents have been indexed

Create a collection

Create one collection per script indexing the content.

To list all collections:

 ./occ fulltextsearch:collection:list

To create a new collection:

 ./occ fulltextsearch:collection:init <collectionName>

To destroy a collection:

 ./occ fulltextsearch:collection:delete <collectionName>

OCS API

Using the OCS API require admin rights on the account used.

Get list of documents that needs to be indexed: (using test as collection name)

curl -X GET "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/index?format=json&length=50" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": [
      {
        "url": "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996",
        "status": 28
      }
    ]
  }
}
  • url is the link to the document,
  • status is a bitflag based on this list:
    • 4 => meta have been modified,
    • 8 => content have been modified,
    • 16 => parts have been modified
    • 32 => document have been removed

Get data and metadata from a a document:

curl -X GET "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": {
      "id": "597996",
      "providerId": "files",
      "access": {
        "ownerId": "cult",
        "viewerId": "",
        "users": ['test1', 'test2'],
        "groups": [],
        "circles": [],
        "links": []
      },
      "index": {
        "ownerId": "cult",
        "providerId": "files",
        "collection": "test",
        "source": "files_local",
        "documentId": "597996",
        "lastIndex": 0,
        "errors": [],
        "errorCount": 0,
        "status": 28,
        "options": []
      },
      "title": "640-240-max.png",
      "link": "http://nc23.local/index.php/f/597996",
      "parts": {
        "comments": "<test1> This is a comment !"
      },
      "content": "VGhlIHF1aWNrIGJyb3duIGZveApqdW1wcyBvdmVyCnRoZSBsYXp5IGRvZy4=",
      "isContentEncoded": 1
    }
  }
}

content is encoded with base64. In case of text file, the text itself is available as encoded content In case of Office document, the whole content of the file is sent is available this way. In case of image, the content is OCR; this is the file used in our current example:

640-240-max

$ php -r "echo base64_decode('VGhlIHF1aWNrIGJyb3duIGZveApqdW1wcyBvdmVyCnRoZSBsYXp5IGRvZy4=');"
The quick brown fox
jumps over
the lazy dog.

Set document as indexed:

curl -X POST "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996/done" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": []
  }
}