diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb new file mode 100644 index 000000000000..39f655a5a9a8 --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -0,0 +1,967 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "71bdcc40", + "metadata": {}, + "source": [ + "# Learn to delete data with Druid API\n", + "\n", + "\n", + "\n", + "In working with data, Druid retains a copies of the existing data segments in deep storage and Historical processes. As new data is added into Druid, deep storage grows and becomes larger over time unless explicitly removed.\n", + "\n", + "While deep storage is an important part of Druid's elastic, fault-tolerant design, data accumulation over time in deep storage can lead to increased storage costs. Periodically deleting data can reclaim storage space and promote optimal resource allocation.\n", + "\n", + "This notebook provides a tutorial on deleting existing data in Druid using the Coordinator API endpoints. \n", + "\n", + "## Table of contents\n", + "\n", + "- [Prerequisites](#Prerequisites)\n", + "- [Ingest data](#Ingest-data)\n", + "- [Deletion steps](#Deletion-steps)\n", + "- [Delete by time interval](#Delete-by-time-interval)\n", + "- [Delete by segment ID](#Delete-by-segment-ID)\n", + "- [Delete entire table](#Delete-entire-table)\n", + "- [Conclusion](#Conclusion)\n", + "- [Learn more](#Learn-more)\n", + "\n", + "For the best experience, use JupyterLab so that you can always access the table of contents." + ] + }, + { + "cell_type": "markdown", + "id": "6fc260fc", + "metadata": {}, + "source": [ + "\n", + "## Prerequisites\n", + "\n", + "This tutorial works with Druid 26.0.0 or later.\n", + "\n", + "\n", + "Launch this tutorial and all prerequisites using the `druid-jupyter`, `kafka-jupyter`, or `all-services` profiles of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n", + "\n", + "If you do not use the Docker Compose environment, you need the following:\n", + "\n", + "* A running Druid instance.
\n", + " Update the `druid_host` variable to point to your Router endpoint. For example:\n", + " ```\n", + " druid_host = \"http://localhost:8888\"\n", + " ```" + ] + }, + { + "cell_type": "markdown", + "id": "7b8a7510", + "metadata": {}, + "source": [ + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host, where the Router service listens.\n", + "\n", + "`druid_host` is the hostname and port for your Druid deployment. In a distributed environment, you can point to other Druid services. In this tutorial, you'll use the Router service as the `druid_host`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed52d809", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment. \n", + "# In the Docker Compose tutorial environment, this is the Router\n", + "# service running at \"http://router:8888\".\n", + "# If you are not using the Docker Compose environment, edit the `druid_host`.\n", + "\n", + "druid_host = \"http://router:8888\"\n", + "druid_host" + ] + }, + { + "cell_type": "markdown", + "id": "e429b61e", + "metadata": {}, + "source": [ + "If your cluster is secure, you'll need to provide authorization information on each request. You can automate it by using the Requests `session` feature. Although this tutorial assumes no authorization, the configuration below defines a session as an example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfa75fc5", + "metadata": {}, + "outputs": [], + "source": [ + "session = requests.Session()" + ] + }, + { + "cell_type": "markdown", + "id": "6f3c9a92", + "metadata": {}, + "source": [ + "Before proceeding with the tutorial, use the `/status/health` endpoint to verify that the cluster if up and running. This endpoint returns the value `true` if the Druid cluster has finished starting up and is running. Do not move on from this point if the following call does not return `true`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18a8a495", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/status/health'\n", + "response = session.get(endpoint)\n", + "response.text" + ] + }, + { + "cell_type": "markdown", + "id": "19144be9", + "metadata": {}, + "source": [ + "In the rest of this tutorial, the `endpoint` and other variables are updated in code cells to call a different Druid endpoint to accomplish a task." + ] + }, + { + "cell_type": "markdown", + "id": "7a281144", + "metadata": {}, + "source": [ + "## Ingest data\n", + "\n", + "Apache Druid stores data partitioned by time chunks into segments and supports deleting data by dropping segments. To start, ingest the quickstart Wikipedia data and partition it by hour to create multiple segments.\n", + "\n", + "First, set the endpoint to the `sql/task` endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa1e227f", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql/task'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "02e4f551", + "metadata": {}, + "source": [ + "Next, use the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion and create a `wikipedia_hour` datasource with hour segmentation. \n", + "\n", + "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1208f3ac", + "metadata": {}, + "outputs": [], + "source": [ + "sql = '''\n", + "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", + "WITH \"ext\" AS (SELECT *\n", + "FROM TABLE(\n", + " EXTERN(\n", + " '{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n", + " '{\"type\":\"json\"}'\n", + " )\n", + ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, \"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, \"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, \"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n", + "SELECT\n", + " TIME_PARSE(\"time\") AS \"__time\",\n", + " \"channel\",\n", + " \"cityName\",\n", + " \"comment\",\n", + " \"countryIsoCode\",\n", + " \"countryName\",\n", + " \"isAnonymous\",\n", + " \"isMinor\",\n", + " \"isNew\",\n", + " \"isRobot\",\n", + " \"isUnpatrolled\",\n", + " \"metroCode\",\n", + " \"namespace\",\n", + " \"page\",\n", + " \"regionIsoCode\",\n", + " \"regionName\",\n", + " \"user\",\n", + " \"delta\",\n", + " \"added\",\n", + " \"deleted\"\n", + "FROM \"ext\"\n", + "PARTITIONED BY HOUR\n", + "'''" + ] + }, + { + "cell_type": "markdown", + "id": "1cf78bb7", + "metadata": {}, + "source": [ + "The following cell cell builds up a Python map that represents the Druid `SqlRequest` object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "543b03ee", + "metadata": {}, + "outputs": [], + "source": [ + "sql_request = {\n", + " 'query': sql\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "8312c9d4", + "metadata": {}, + "source": [ + "With the SQL request ready, use the the `json` parameter to the `Session` `post` method to send a `POST` request with the `sql_request` object as the payload. The result is a Requests `Response` which is saved in a variable.\n", + "\n", + "Now, run the next cell to start the ingestion." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9540926f", + "metadata": {}, + "outputs": [], + "source": [ + "response = session.post(endpoint, json=sql_request)\n", + "response.status_code" + ] + }, + { + "cell_type": "markdown", + "id": "e79ec7a3-7924-4397-b032-f21a6fa1873f", + "metadata": {}, + "source": [ + "It takes a while for Druid to load the resulting segments. Run the following cell and and wait for the ingestion status to display \"The ingestion is complete\" before moving on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9776919-5240-4dde-9a09-a4bf76ae9a44", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "json = response.json()\n", + "ingestion_taskId = json['taskId']\n", + "\n", + "endpoint = druid_host + f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", + "json = session.get(endpoint).json()\n", + "\n", + "ingestion_status = json['status']['status']\n", + "\n", + "if ingestion_status == \"RUNNING\":\n", + " print(\"The ingestion is running...\")\n", + "\n", + "while ingestion_status != \"SUCCESS\":\n", + " time.sleep(5) # 5 seconds \n", + " json = session.get(endpoint).json()\n", + " ingestion_status = json['status']['status']\n", + " \n", + "if ingestion_status == \"SUCCESS\": \n", + " print(\"The ingestion is complete.\")\n", + "else:\n", + " print(\"The ingestion task failed:\", json)" + ] + }, + { + "cell_type": "markdown", + "id": "cab33e7e", + "metadata": {}, + "source": [ + "Once the data has been ingested, Druid is populated with segments for each segment interval that contains data. You should see 24 segments associated with `wikipedia_hour`. \n", + "\n", + "For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n", + "\n", + "Using this endpoint, you can query the `sys` [metadata table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "956abeee", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "701550dd", + "metadata": {}, + "source": [ + "Now, you can query the metadata table to retrieve segment information. The following cell sends a SQL query to retrieve `segment_id` information for the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to `objectLines`. This helps format the response with newlines and makes it easier to parse the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb54a6b7", + "metadata": {}, + "outputs": [], + "source": [ + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "f06e24e5", + "metadata": {}, + "source": [ + "Observe the response retrieved from the previous cell. In total, there are 24 `segment_id` records, each containing the datasource name `wikipedia_hour`, along with the start and end hour interval. The tail end of the ID also contains the timestamp of when the request was made. \n", + "\n", + "For this tutorial, we are concerned with observing the start and end interval for each `segment_id`. \n", + "\n", + "For example: \n", + "`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}` indicates this segment contains data from `2015-09-12T00:00:00.000` to `2015-09-12T01:00:00.000Z`." + ] + }, + { + "cell_type": "markdown", + "id": "ca79f5f9", + "metadata": {}, + "source": [ + "## Deletion steps" + ] + }, + { + "cell_type": "markdown", + "id": "b6cd1c8c", + "metadata": {}, + "source": [ + "Permanent deletion of a segment in Druid has two steps:\n", + "\n", + "1. Mark a segment as \"unused.\" This step occurs when a segment is dropped by a [drop rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules) or manually marked as \"unused\" through the Coordinator API or web console. Note that marking a segment as \"unused\" is a soft delete, it is no longer available for querying but the segment files remain in deep storage and segment records remain in the metadata store. \n", + "2. Send a kill task to permanently remove \"unused\" segments. This deletes the segment file from deep storage and removes its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup." + ] + }, + { + "cell_type": "markdown", + "id": "b9bc7f00", + "metadata": {}, + "source": [ + "## Delete by time interval" + ] + }, + { + "cell_type": "markdown", + "id": "1040bdaf", + "metadata": {}, + "source": [ + "Segments can be deleted in a specified time interval. This begins with marking all segments in the interval as \"unused\", then sending a kill request to delete it permanently from deep storage.\n", + "\n", + "First, set the endpoint variable to the Coordinator API endpoint `/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the datasource ingested is `wikipedia_hour`, let's specify that in the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9db8786d", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "863576a9", + "metadata": {}, + "source": [ + "The following cell constructs a JSON payload with the interval of segments to be deleted. This marks the intervals from `18:00:00.000` to `20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the endpoint in a `POST` request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79387e72", + "metadata": {}, + "outputs": [], + "source": [ + "sql_request = {\n", + " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n", + "}\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "response.text" + ] + }, + { + "cell_type": "markdown", + "id": "89e2fcb4", + "metadata": {}, + "source": [ + "The response from the above cell should return a JSON object with the property `\"numChangedSegments\"` and the value `2`. This refers to the following segments:\n", + "\n", + "* `{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n", + "* `{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`" + ] + }, + { + "cell_type": "markdown", + "id": "e61cae23", + "metadata": {}, + "source": [ + "Next, verify that the segments have been soft deleted. The following cell sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request querying for the existing `segment_id`s. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea7c0d26", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "747bd12c", + "metadata": {}, + "source": [ + "Observe the response above. There should now be only 22 segments, and the \"unused\" segments have been soft deleted. \n", + "\n", + "However, as you've only soft deleted the segments, they remain in deep storage.\n", + "\n", + "Before permanently deleting the segments, you can verify that they've only been soft deleted by inspecting your deep storage. The soft deleted segments are still there. This step is optional, you can move onto the next set of cells without completing this step." + ] + }, + { + "cell_type": "markdown", + "id": "943b36cc", + "metadata": {}, + "source": [ + "[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to view segments in deep storage:\n", + " \n", + "* Navigate to the distribution directory for Druid, this is the same place where you run `./bin/start-druid` to start up Druid.\n", + "* Run this command: `ls -l1 var/druid/segments/wikipedia_hour`." + ] + }, + { + "cell_type": "markdown", + "id": "8ecedcaa", + "metadata": {}, + "source": [ + "[OPTIONAL] If you are running Druid within the Docker Compose environment, follow these instructions to view the segments in deep storage:\n", + "\n", + "* Navigate to your Docker terminal.\n", + "* Run this command: `docker exec -it historical ls /opt/shared/segments/wikipedia_hour`" + ] + }, + { + "cell_type": "markdown", + "id": "12737023", + "metadata": {}, + "source": [ + "The output should look similar to this:\n", + "\n", + "```bash\n", + "$ ls -l1 var/druid/segments/wikipedia_hour/\n", + "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n", + "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", + "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", + "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n", + "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", + "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", + "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", + "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n", + "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n", + "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n", + "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n", + "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n", + "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n", + "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n", + "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n", + "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n", + "2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z\n", + "2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z\n", + "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n", + "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n", + "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n", + "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "38cca397", + "metadata": {}, + "source": [ + "Now, you can move onto sending a kill task to permanently delete the segments from deep storage. This can be done with the `/druid/coordinator/v1/datasources/:dataSource/intervals/:interval` endpoint.\n", + "\n", + "The following cell uses the endpoint, setting the `dataSource` path parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n", + "\n", + "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid only permanently deletes the \"unused\" segments within this interval. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "672ad739", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "751380d5", + "metadata": {}, + "source": [ + "Run the next cell to send the `DELETE` request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a6bdc6c", + "metadata": {}, + "outputs": [], + "source": [ + "response = session.delete(endpoint);\n", + "print(response.status_code)" + ] + }, + { + "cell_type": "markdown", + "id": "69d6e89a", + "metadata": {}, + "source": [ + "You can verify that the segments are completely gone by inspecting your deep storage like the optional step earlier. The command should return an output like the following:" + ] + }, + { + "cell_type": "markdown", + "id": "84ee47fd", + "metadata": {}, + "source": [ + "```bash\n", + "$ ls -l1 var/druid/segments/wikipedia_hour/\n", + "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n", + "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", + "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", + "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n", + "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", + "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", + "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", + "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n", + "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n", + "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n", + "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n", + "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n", + "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n", + "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n", + "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n", + "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n", + "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n", + "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n", + "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n", + "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "8b2d59c8", + "metadata": {}, + "source": [ + "## Delete by segment ID" + ] + }, + { + "cell_type": "markdown", + "id": "a4e8453e", + "metadata": {}, + "source": [ + "In addition to deleting by interval, you can delete segments by using `segment_id`. Run the next cell to retrieve the `segment_id` of each segment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f4e4ed7", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "017d63e9", + "metadata": {}, + "source": [ + "If you know the `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e320037a", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "16861873", + "metadata": {}, + "source": [ + "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segment IDs responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", + "\n", + "Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1cec240", + "metadata": {}, + "outputs": [], + "source": [ + "sql_request = {\n", + " \"segmentIds\": [\n", + " \"\",\n", + " \"\"\n", + " ]\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "response.text" + ] + }, + { + "cell_type": "markdown", + "id": "cda3213c", + "metadata": {}, + "source": [ + "You should see a response with the `numChangedSegments` property and the value `2` for the two segments marked as \"unused.\"\n", + "\n", + "Run the cell below to view changes in the datasource's segments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9301e6df", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "671521b6", + "metadata": {}, + "source": [ + "Last, run the following cells to permanently delete the segments from deep storage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dd59374", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5f31f08", + "metadata": {}, + "outputs": [], + "source": [ + "response = session.delete(endpoint)\n", + "response.status_code" + ] + }, + { + "cell_type": "markdown", + "id": "a2bb5047-20bd-46bd-be82-de19acb5c1a1", + "metadata": {}, + "source": [ + "If you inspect your deep storage again, the directory for the datasource is removed along with its corresponding segments." + ] + }, + { + "cell_type": "markdown", + "id": "f0d0578c", + "metadata": {}, + "source": [ + "## Delete entire table" + ] + }, + { + "cell_type": "markdown", + "id": "b8d0260a", + "metadata": {}, + "source": [ + "You can delete entire tables the same way you can delete segments of a table, using intervals.\n", + "\n", + "Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd354886", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "ed3cf7d3", + "metadata": {}, + "source": [ + "Next, send a `POST` with the payload `{\"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of the table as \"unused.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25639752", + "metadata": {}, + "outputs": [], + "source": [ + "sql_request = {\n", + " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request);\n", + "\n", + "response.status_code" + ] + }, + { + "cell_type": "markdown", + "id": "bbbba823", + "metadata": {}, + "source": [ + "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and sends a SQL-based request. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eac1db4c", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request);" + ] + }, + { + "cell_type": "markdown", + "id": "deae8727", + "metadata": {}, + "source": [ + "Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. \n", + "\n", + "The response returns the remaining segments, but since you deleted the table, there are no segments to return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12c11291", + "metadata": {}, + "outputs": [], + "source": [ + "response.text\n", + "response.status_code" + ] + }, + { + "cell_type": "markdown", + "id": "2e9a74da", + "metadata": {}, + "source": [ + "So far, you've soft deleted the table. Run the following cells to permanently delete the table from deep storage:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c3d7ec9", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98834167", + "metadata": {}, + "outputs": [], + "source": [ + "response = session.delete(endpoint);\n", + "response.status_code" + ] + }, + { + "cell_type": "markdown", + "id": "55745cb5-2869-4086-a247-62d78956c8b4", + "metadata": {}, + "source": [ + "If you inspect your deep storage again, the directory for the datasource is removed along with its corresponding segments." + ] + }, + { + "cell_type": "markdown", + "id": "998682dc", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanently delete segments from deep storage. \n", + "\n", + "You learned two different methods for marking segments as \"unused\": \n", + "1. Deletion by time interval\n", + "2. Deletion by segment ID\n", + "\n", + "With the knowledge gained from this tutorial, you can now programmatically manage your storage for Druid. Go forth and Druid!" + ] + }, + { + "cell_type": "markdown", + "id": "048a8a1f", + "metadata": {}, + "source": [ + "## Learn more\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "0bc13660", + "metadata": {}, + "source": [ + "To dive deeper into concepts associated with data deletion, see the following:\n", + "\n", + "* [Deep storage](https://druid.apache.org/docs/latest/dependencies/deep-storage.html)\n", + "* [Segments](https://druid.apache.org/docs/latest/design/segments.html)\n", + "* [Data management](https://druid.apache.org/docs/latest/data-management/index.html)\n", + "* [API references](https://druid.apache.org/docs/latest/operations/api-reference.html)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}