From 5917958eec72f341111a5e7bc22c4be61a27828d Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 8 Aug 2023 14:56:40 -0700 Subject: [PATCH 1/8] draft --- .../04-api/01-delete-api-tutorial.ipynb | 938 ++++++++++++++++++ 1 file changed, 938 insertions(+) create mode 100644 examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb new file mode 100644 index 000000000000..256f751c734d --- /dev/null +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -0,0 +1,938 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "71bdcc40", + "metadata": {}, + "source": [ + "# Learn to delete data with Druid API\n", + "\n", + "\n", + "\n", + "In working with data, Druid retains a copies of the existing data segments in deep storage and Historical processes. As new data is added into Druid, deep storage grows and becomes larger over time unless explicitly removed.\n", + "\n", + "While deep storage is an important part of Druid's elastic, fault-tolerant design, over time, data accumulation in deep storage can lead to increased storage costs. Periodically deleting data can reclaim storage space and promote optimal resource allocation.\n", + "\n", + "This notebook provides a tutorial on deleting existing data in Druid using the Coordinator API endpoints. \n", + "\n", + "## Table of contents\n", + "\n", + "- [Prerequisites](#Prerequisites)\n", + "- [Ingest data](#Ingest-data)\n", + "- [Deletion steps](#Deletion-steps)\n", + "- [Delete by time interval](#Delete-by-time-interval)\n", + "- [Delete entire table](#Delete-entire-table)\n", + "- [Delete by segment ID](#Delete-by-segment-ID)\n", + "\n", + "For the best experience, use JupyterLab so that you can always access the table of contents." + ] + }, + { + "cell_type": "markdown", + "id": "6fc260fc", + "metadata": {}, + "source": [ + "\n", + "## Prerequisites\n", + "\n", + "This tutorial works with Druid 26.0.0 or later.\n", + "\n", + "\n", + "Launch this tutorial and all prerequisites using the `druid-jupyter`, `kafka-jupyter`, or `all-services` profiles of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n", + "\n", + "If you do not use the Docker Compose environment, you need the following:\n", + "\n", + "* A running Druid instance.
\n", + " Update the `druid_host` variable to point to your Router endpoint. For example:\n", + " ```\n", + " druid_host = \"http://localhost:8888\"\n", + " ```" + ] + }, + { + "cell_type": "markdown", + "id": "7b8a7510", + "metadata": {}, + "source": [ + "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines a variable for the the Druid host, where the Router service listens.\n", + "\n", + "`druid_host` is the hostname and port for your Druid deployment. In a distributed environment, you can point to other Druid services. In this tutorial, you'll use the Router service as the `druid_host`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed52d809", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# druid_host is the hostname and port for your Druid deployment. \n", + "# In the Docker Compose tutorial environment, this is the Router\n", + "# service running at \"http://router:8888\".\n", + "# If you are not using the Docker Compose environment, edit the `druid_host`.\n", + "\n", + "druid_host = \"http://host.docker.internal:8888\"\n", + "druid_host" + ] + }, + { + "cell_type": "markdown", + "id": "6f3c9a92", + "metadata": {}, + "source": [ + "Before we proceed with the tutorial, let's use the `/status/health` endpoint to verify that the cluster if up and running. This endpoint returns the Python value `true` if the Druid cluster has finished starting up and is running. Do not move on from this point if the following call does not return `true`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18a8a495", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/status/health'\n", + "response = requests.request(\"GET\", endpoint)\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "19144be9", + "metadata": {}, + "source": [ + "In the rest of this tutorial, the `endpoint` and other variables are updated in code cells to call a different Druid endpoint to accomplish a task." + ] + }, + { + "cell_type": "markdown", + "id": "7a281144", + "metadata": {}, + "source": [ + "## Ingest data\n", + "\n", + "Apache Druid stores data partitioned by time chunks into segments and supports deleting data by dropping segments. Before dropping data, we will use the quickstart Wikipedia data ingested with an indexing spec that creates hourly segments.\n", + "\n", + "The following cell sets `endpoint` to `/druid/indexer/v1/task`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "051655c9", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/indexer/v1/task'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "02e4f551", + "metadata": {}, + "source": [ + "Next, construct a JSON payload with the ingestion specs to create a `wikipedia_hour` datasource with hour segmentation. There are many different [methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods) to ingest data, this tutorial uses [native batch ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html) and the `/druid/indexer/v1/task` endpoint. For more information on construction an ingestion spec, see [ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ff9d098", + "metadata": {}, + "outputs": [], + "source": [ + "payload = json.dumps({\n", + " \"type\": \"index_parallel\",\n", + " \"spec\": {\n", + " \"dataSchema\": {\n", + " \"dataSource\": \"wikipedia_hour\",\n", + " \"timestampSpec\": {\n", + " \"column\": \"time\",\n", + " \"format\": \"iso\"\n", + " },\n", + " \"dimensionsSpec\": {\n", + " \"useSchemaDiscovery\": True\n", + " },\n", + " \"metricsSpec\": [],\n", + " \"granularitySpec\": {\n", + " \"type\": \"uniform\",\n", + " \"segmentGranularity\": \"hour\",\n", + " \"queryGranularity\": \"none\",\n", + " \"intervals\": [\n", + " \"2015-09-12/2015-09-13\"\n", + " ],\n", + " \"rollup\": False\n", + " }\n", + " },\n", + " \"ioConfig\": {\n", + " \"type\": \"index_parallel\",\n", + " \"inputSource\": {\n", + " \"type\": \"local\",\n", + " \"baseDir\": \"quickstart/tutorial/\",\n", + " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n", + " },\n", + " \"inputFormat\": {\n", + " \"type\": \"json\"\n", + " },\n", + " \"appendToExisting\": False\n", + " },\n", + " \"tuningConfig\": {\n", + " \"type\": \"index_parallel\",\n", + " \"maxRowsPerSegment\": 5000000,\n", + " \"maxRowsInMemory\": 25000\n", + " }\n", + " }\n", + "})\n", + "\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "1cf78bb7", + "metadata": {}, + "source": [ + "With the payload and headers ready, run the next cell to send a `POST` request to the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "543b03ee", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + " \n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "cab33e7e", + "metadata": {}, + "source": [ + "Once the data has been ingested, Druid will be populated with segments for each segment interval that contains data. Since the `wikipedia_hour` was ingested with `HOUR` granularity, there will be 24 segments associated with `wikipedia_hour`. \n", + "\n", + "For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n", + "\n", + "Using this endpoint, you can query the `sys` [metadata table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "956abeee", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "701550dd", + "metadata": {}, + "source": [ + "Now, you can query the metadata table to retrieve segment information. The following cell sends a SQL query to retrieve `segment_id` information for the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to `objectLines`. This helps format the response with newlines and makes it easier to parse the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb54a6b7", + "metadata": {}, + "outputs": [], + "source": [ + "payload = json.dumps({\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + " \n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "f06e24e5", + "metadata": {}, + "source": [ + "Observe the response retrieved from the previous cell. In total, there are 24 `segment_id`, each containing the datasource name `wikipedia_hour`, along with the start and end hour interval. The tail end of the ID also contains the timestamp of when the request was made. \n", + "\n", + "For this tutorial, we are concerned with observing the start and end interval for each `segment_id`. \n", + "\n", + "For example: \n", + "`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}` indicates this segment contains data from `2015-09-12T00:00:00.000` to `2015-09-12T01:00:00.000Z`." + ] + }, + { + "cell_type": "markdown", + "id": "ca79f5f9", + "metadata": {}, + "source": [ + "## Deletion steps" + ] + }, + { + "cell_type": "markdown", + "id": "b6cd1c8c", + "metadata": {}, + "source": [ + "Permanent deletion of a segment in Apache Druid has two steps:\n", + "\n", + "1. A segment is marked as \"unused.\" This step occurs when a segment is dropped by a [drop rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules) or manually marked as \"unused\" through the Coordinator API or web console. Note that marking a segment as \"unused\" is a soft delete, it is no longer available for querying but the segment files remain in deep storage and segment records remain in the metadata store. \n", + "2. A kill task is sent to permanently remove \"unused\" segments. This deletes the segment file from deep storage and removes its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup." + ] + }, + { + "cell_type": "markdown", + "id": "b9bc7f00", + "metadata": {}, + "source": [ + "## Delete by time interval" + ] + }, + { + "cell_type": "markdown", + "id": "1040bdaf", + "metadata": {}, + "source": [ + "Segments can be deleted in a specified time interval. This begins with marking all segments in the interval as \"unused\", then sending a kill request to delete it permanently from deep storage.\n", + "\n", + "First, set the endpoint variable to the Coordinator API endpoint `/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the datasource ingested is `wikipedia_hour`, let's specify that in the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9db8786d", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "863576a9", + "metadata": {}, + "source": [ + "The following cell constructs a JSON payload with the interval of segments to be deleted. This will mark the intervals from `18:00:00.000` to `20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the endpoint in a `POST` request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79387e72", + "metadata": {}, + "outputs": [], + "source": [ + "payload = json.dumps({\n", + " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "89e2fcb4", + "metadata": {}, + "source": [ + "The response from the above cell should return a JSON object with the property `\"numChangedSegments\"` and the value `2`. This refers to the following segments:\n", + "\n", + "* `{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n", + "* `{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`" + ] + }, + { + "cell_type": "markdown", + "id": "e61cae23", + "metadata": {}, + "source": [ + "Next, verify that the segments have been soft deleted. The following cell sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request querying for the existing `segment_id`s. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea7c0d26", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "747bd12c", + "metadata": {}, + "source": [ + "Observe the response above. There should now be only 22 segments, and the \"unused\" segments have been soft deleted. \n", + "\n", + "However, as you've only soft deleted the segments, it remains in deep storage.\n", + "\n", + "Before permanently deleting the segments, let's observe how this can change in deep storage. This step is optional, you can move onto the next set of cells without completing this step." + ] + }, + { + "cell_type": "markdown", + "id": "943b36cc", + "metadata": {}, + "source": [ + "[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to retrieve segments from deep storage:\n", + " \n", + "* Navigate to the distribution directory for Druid, this is the same place where you run `./bin/start-druid` to start up Druid.\n", + "* Run this command: `ls -l1 var/druid/segments/wikipedia-hour/`." + ] + }, + { + "cell_type": "markdown", + "id": "8ecedcaa", + "metadata": {}, + "source": [ + "[OPTIONAL] If you are running Druid within the Docker Compose environment, follow these instructions to retrieve segments from deep storage:\n", + "\n", + "* Navigate to your Docker terminal.\n", + "* Run this command: `docker exec -it historical ls /opt/shared/segments/wikipedia_hour`" + ] + }, + { + "cell_type": "markdown", + "id": "12737023", + "metadata": {}, + "source": [ + "The output should look similar to this:\n", + "\n", + "```bash\n", + "$ ls -l1 var/druid/segments/wikipedia_hour/\n", + "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", + "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", + "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", + "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", + "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", + "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n", + "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n", + "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n", + "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n", + "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n", + "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n", + "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n", + "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n", + "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n", + "2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z\n", + "2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z\n", + "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n", + "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n", + "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n", + "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "38cca397", + "metadata": {}, + "source": [ + "Now, you can move onto sending a kill task to permanently delete the segments from deep storage. This can be done with the `/druid/coordinator/v1/datasources/:dataSource/intervals/:interval` endpoint.\n", + "\n", + "The following cell uses the endpoint, setting the `dataSource` path parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n", + "\n", + "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid will only permanently delete the \"unused\" segments within this interval. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "672ad739", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "751380d5", + "metadata": {}, + "source": [ + "Run the next cell to send the `DELETE` request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a6bdc6c", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.request(\"DELETE\", endpoint)\n", + "print(response.status_code)" + ] + }, + { + "cell_type": "markdown", + "id": "69d6e89a", + "metadata": {}, + "source": [ + "Last, observe that the segments have been deleted from deep storage in the following sample output. " + ] + }, + { + "cell_type": "markdown", + "id": "84ee47fd", + "metadata": {}, + "source": [ + "```bash\n", + "$ ls -l1 var/druid/segments/wikipedia_hour/\n", + "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", + "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", + "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", + "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", + "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", + "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n", + "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n", + "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n", + "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n", + "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n", + "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n", + "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n", + "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n", + "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n", + "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n", + "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n", + "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n", + "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "f0d0578c", + "metadata": {}, + "source": [ + "## Delete entire table" + ] + }, + { + "cell_type": "markdown", + "id": "b8d0260a", + "metadata": {}, + "source": [ + "You can delete entire tables the same way you can delete parts of a table, using intervals.\n", + "\n", + "Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd354886", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "ed3cf7d3", + "metadata": {}, + "source": [ + "Next, send a `POST` with the payload `{\"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of the table as \"unused.\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25639752", + "metadata": {}, + "outputs": [], + "source": [ + "payload = json.dumps({\n", + " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.status_code)" + ] + }, + { + "cell_type": "markdown", + "id": "bbbba823", + "metadata": {}, + "source": [ + "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and send a SQL-based request. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eac1db4c", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)" + ] + }, + { + "cell_type": "markdown", + "id": "deae8727", + "metadata": {}, + "source": [ + "Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. \n", + "\n", + "The response should return the remaining segments, but since the table was deleted, there are no segments to return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12c11291", + "metadata": {}, + "outputs": [], + "source": [ + "print(response.text)\n", + "print(response.status_code)" + ] + }, + { + "cell_type": "markdown", + "id": "2e9a74da", + "metadata": {}, + "source": [ + "So far, you've soft deleted the table. Run the following cells to permanently delete the table from deep storage:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c3d7ec9", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98834167", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.request(\"DELETE\", endpoint, headers=headers, data=payload)\n", + "print(response.status_code)" + ] + }, + { + "cell_type": "markdown", + "id": "8b2d59c8", + "metadata": {}, + "source": [ + "## Delete by segment ID" + ] + }, + { + "cell_type": "markdown", + "id": "a4e8453e", + "metadata": {}, + "source": [ + "In addition to deleting by interval, you can delete segments by using `segment_id`. Let's load in some new data to work with.\n", + "\n", + "Run the following cell to ingest a new set of data for `wikipedia_hour`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f512f4e", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/indexer/v1/task'\n", + "payload = json.dumps({\n", + " \"type\": \"index_parallel\",\n", + " \"spec\": {\n", + " \"dataSchema\": {\n", + " \"dataSource\": \"wikipedia_hour\",\n", + " \"timestampSpec\": {\n", + " \"column\": \"time\",\n", + " \"format\": \"iso\"\n", + " },\n", + " \"dimensionsSpec\": {\n", + " \"useSchemaDiscovery\": True\n", + " },\n", + " \"metricsSpec\": [],\n", + " \"granularitySpec\": {\n", + " \"type\": \"uniform\",\n", + " \"segmentGranularity\": \"hour\",\n", + " \"queryGranularity\": \"none\",\n", + " \"intervals\": [\n", + " \"2015-09-12/2015-09-13\"\n", + " ],\n", + " \"rollup\": False\n", + " }\n", + " },\n", + " \"ioConfig\": {\n", + " \"type\": \"index_parallel\",\n", + " \"inputSource\": {\n", + " \"type\": \"local\",\n", + " \"baseDir\": \"quickstart/tutorial/\",\n", + " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n", + " },\n", + " \"inputFormat\": {\n", + " \"type\": \"json\"\n", + " },\n", + " \"appendToExisting\": False\n", + " },\n", + " \"tuningConfig\": {\n", + " \"type\": \"index_parallel\",\n", + " \"maxRowsPerSegment\": 5000000,\n", + " \"maxRowsInMemory\": 25000\n", + " }\n", + " }\n", + "})\n", + "\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + " \n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "e51f6b54", + "metadata": {}, + "source": [ + "Now that you have a brand new datasource to work with, let's view the segment information for it.\n", + "\n", + "Run the next cell to retrieve the `segment_id` of each segment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f4e4ed7", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "017d63e9", + "metadata": {}, + "source": [ + "With known `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e320037a", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "16861873", + "metadata": {}, + "source": [ + "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segments responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", + "\n", + "Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1cec240", + "metadata": {}, + "outputs": [], + "source": [ + "payload = json.dumps({\n", + " \"segmentIds\": [\n", + " \"\",\n", + " \"\"\n", + " ]\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "cda3213c", + "metadata": {}, + "source": [ + "You should see a response with the `numChangedSegments` property and the value `2` for the two segments marked as \"unused.\"\n", + "\n", + "Run the cell below to view changes in the datasource's segments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9301e6df", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", + "payload = json.dumps({\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "})\n", + "headers = {\n", + " 'Content-Type': 'application/json'\n", + "}\n", + "\n", + "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "print(response.text)" + ] + }, + { + "cell_type": "markdown", + "id": "671521b6", + "metadata": {}, + "source": [ + "Last, run the following cells to permanently delete the segments from deep storage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dd59374", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", + "endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5f31f08", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.request(\"DELETE\", endpoint, headers=headers, data=payload)\n", + "print(response.status_code)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 59d119cc3e1b7e688606ec06462cdc5d58f06510 Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 8 Aug 2023 16:13:47 -0700 Subject: [PATCH 2/8] Add learn more and conclusion --- .../04-api/01-delete-api-tutorial.ipynb | 89 +++++++++++++++++-- 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 256f751c734d..50f475c512d0 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -40,6 +40,8 @@ "- [Delete by time interval](#Delete-by-time-interval)\n", "- [Delete entire table](#Delete-entire-table)\n", "- [Delete by segment ID](#Delete-by-segment-ID)\n", + "- [Conclusion](#Conclusion)\n", + "- [Learn more](#Learn-more)\n", "\n", "For the best experience, use JupyterLab so that you can always access the table of contents." ] @@ -91,7 +93,7 @@ "# service running at \"http://router:8888\".\n", "# If you are not using the Docker Compose environment, edit the `druid_host`.\n", "\n", - "druid_host = \"http://host.docker.internal:8888\"\n", + "druid_host = \"http://router:8888\"\n", "druid_host" ] }, @@ -712,10 +714,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 224, "id": "5f512f4e", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"task\":\"index_parallel_wikipedia_hour_kphccjgh_2023-08-08T22:04:06.269Z\"}\n" + ] + } + ], "source": [ "endpoint = druid_host + '/druid/indexer/v1/task'\n", "payload = json.dumps({\n", @@ -782,10 +792,41 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 225, "id": "7f4e4ed7", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "{\"segment_id\":\"wikipedia_hour_2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", + "\n", + "\n" + ] + } + ], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", "payload = json.dumps({\n", @@ -912,6 +953,44 @@ "response = requests.request(\"DELETE\", endpoint, headers=headers, data=payload)\n", "print(response.status_code)" ] + }, + { + "cell_type": "markdown", + "id": "998682dc", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanent deletion from deep storage. \n", + "\n", + "By working through this tutorial, you learned two different methods:\n", + "1. Deletion by time interval\n", + "2. Deletion by segment ID\n", + "\n", + "With the knowledge gained here, you can now efficiently manage storage allocation. Go forth and Druid!" + ] + }, + { + "cell_type": "markdown", + "id": "048a8a1f", + "metadata": {}, + "source": [ + "## Learn more\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "0bc13660", + "metadata": {}, + "source": [ + "To dive deeper into concepts associated with data deletion, see the following:\n", + "\n", + "* [Deep storage](https://druid.apache.org/docs/latest/dependencies/deep-storage.html)\n", + "* [Segments](https://druid.apache.org/docs/latest/design/segments.html)\n", + "* [Data management](https://druid.apache.org/docs/latest/data-management/index.html)\n", + "* [API references](https://druid.apache.org/docs/latest/operations/api-reference.html)" + ] } ], "metadata": { From 6c6ade2ecb969c575e960f081cf45ce8e3c97c2c Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 8 Aug 2023 16:17:06 -0700 Subject: [PATCH 3/8] clear cell output --- .../04-api/01-delete-api-tutorial.ipynb | 47 ++----------------- 1 file changed, 4 insertions(+), 43 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 50f475c512d0..4d0136b20bf2 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -714,18 +714,10 @@ }, { "cell_type": "code", - "execution_count": 224, + "execution_count": null, "id": "5f512f4e", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{\"task\":\"index_parallel_wikipedia_hour_kphccjgh_2023-08-08T22:04:06.269Z\"}\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = druid_host + '/druid/indexer/v1/task'\n", "payload = json.dumps({\n", @@ -792,41 +784,10 @@ }, { "cell_type": "code", - "execution_count": 225, + "execution_count": null, "id": "7f4e4ed7", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "{\"segment_id\":\"wikipedia_hour_2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z_2023-08-08T17:49:07.610Z\"}\n", - "\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", "payload = json.dumps({\n", From 831668175f6887f40423bcb3fed52bd412656f30 Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Wed, 9 Aug 2023 14:03:00 -0700 Subject: [PATCH 4/8] update to sql-based ingestion, change requests to session.post --- .../04-api/01-delete-api-tutorial.ipynb | 331 +++++++++--------- 1 file changed, 162 insertions(+), 169 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 4d0136b20bf2..5c305c47aaf5 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -28,7 +28,7 @@ "\n", "In working with data, Druid retains a copies of the existing data segments in deep storage and Historical processes. As new data is added into Druid, deep storage grows and becomes larger over time unless explicitly removed.\n", "\n", - "While deep storage is an important part of Druid's elastic, fault-tolerant design, over time, data accumulation in deep storage can lead to increased storage costs. Periodically deleting data can reclaim storage space and promote optimal resource allocation.\n", + "While deep storage is an important part of Druid's elastic, fault-tolerant design, data accumulation over time in deep storage can lead to increased storage costs. Periodically deleting data can reclaim storage space and promote optimal resource allocation.\n", "\n", "This notebook provides a tutorial on deleting existing data in Druid using the Coordinator API endpoints. \n", "\n", @@ -97,12 +97,30 @@ "druid_host" ] }, + { + "cell_type": "markdown", + "id": "e429b61e", + "metadata": {}, + "source": [ + "If your cluster is secure, you'll need to provide authorization information on each request. You can automate it by using the Requests `session` feature. Although this tutorial assumes no authorization, the configuration below defines a session as an example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfa75fc5", + "metadata": {}, + "outputs": [], + "source": [ + "session = requests.Session()" + ] + }, { "cell_type": "markdown", "id": "6f3c9a92", "metadata": {}, "source": [ - "Before we proceed with the tutorial, let's use the `/status/health` endpoint to verify that the cluster if up and running. This endpoint returns the Python value `true` if the Druid cluster has finished starting up and is running. Do not move on from this point if the following call does not return `true`." + "Before proceeding with the tutorial, use the `/status/health` endpoint to verify that the cluster if up and running. This endpoint returns the value `true` if the Druid cluster has finished starting up and is running. Do not move on from this point if the following call does not return `true`." ] }, { @@ -132,19 +150,19 @@ "source": [ "## Ingest data\n", "\n", - "Apache Druid stores data partitioned by time chunks into segments and supports deleting data by dropping segments. Before dropping data, we will use the quickstart Wikipedia data ingested with an indexing spec that creates hourly segments.\n", + "Apache Druid stores data partitioned by time chunks into segments and supports deleting data by dropping segments. To start, ingest the quickstart Wikipedia data and partition it by hour to create multiple segments.\n", "\n", - "The following cell sets `endpoint` to `/druid/indexer/v1/task`. " + "First, set the endpoint to the `sql/task` endpoint." ] }, { "cell_type": "code", "execution_count": null, - "id": "051655c9", + "id": "aa1e227f", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/indexer/v1/task'\n", + "endpoint = druid_host + '/druid/v2/sql/task'\n", "endpoint" ] }, @@ -153,62 +171,51 @@ "id": "02e4f551", "metadata": {}, "source": [ - "Next, construct a JSON payload with the ingestion specs to create a `wikipedia_hour` datasource with hour segmentation. There are many different [methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods) to ingest data, this tutorial uses [native batch ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html) and the `/druid/indexer/v1/task` endpoint. For more information on construction an ingestion spec, see [ingestion spec reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)." + "Next, use the multi-stage query (MSQ) task engine and its `sql/task` endpoint to perform SQL-based ingestion and create a `wikipedia_hour` datasource with hour segmentation. \n", + "\n", + "To learn more about SQL-based ingestion, see [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). For information about the endpoint specifically, see [SQL-based ingestion and multi-stage query task API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)." ] }, { "cell_type": "code", "execution_count": null, - "id": "9ff9d098", + "id": "1208f3ac", "metadata": {}, "outputs": [], "source": [ - "payload = json.dumps({\n", - " \"type\": \"index_parallel\",\n", - " \"spec\": {\n", - " \"dataSchema\": {\n", - " \"dataSource\": \"wikipedia_hour\",\n", - " \"timestampSpec\": {\n", - " \"column\": \"time\",\n", - " \"format\": \"iso\"\n", - " },\n", - " \"dimensionsSpec\": {\n", - " \"useSchemaDiscovery\": True\n", - " },\n", - " \"metricsSpec\": [],\n", - " \"granularitySpec\": {\n", - " \"type\": \"uniform\",\n", - " \"segmentGranularity\": \"hour\",\n", - " \"queryGranularity\": \"none\",\n", - " \"intervals\": [\n", - " \"2015-09-12/2015-09-13\"\n", - " ],\n", - " \"rollup\": False\n", - " }\n", - " },\n", - " \"ioConfig\": {\n", - " \"type\": \"index_parallel\",\n", - " \"inputSource\": {\n", - " \"type\": \"local\",\n", - " \"baseDir\": \"quickstart/tutorial/\",\n", - " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n", - " },\n", - " \"inputFormat\": {\n", - " \"type\": \"json\"\n", - " },\n", - " \"appendToExisting\": False\n", - " },\n", - " \"tuningConfig\": {\n", - " \"type\": \"index_parallel\",\n", - " \"maxRowsPerSegment\": 5000000,\n", - " \"maxRowsInMemory\": 25000\n", - " }\n", - " }\n", - "})\n", - "\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", - "}" + "sql = '''\n", + "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", + "WITH \"ext\" AS (SELECT *\n", + "FROM TABLE(\n", + " EXTERN(\n", + " '{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n", + " '{\"type\":\"json\"}'\n", + " )\n", + ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, \"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, \"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, \"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n", + "SELECT\n", + " TIME_PARSE(\"time\") AS \"__time\",\n", + " \"channel\",\n", + " \"cityName\",\n", + " \"comment\",\n", + " \"countryIsoCode\",\n", + " \"countryName\",\n", + " \"isAnonymous\",\n", + " \"isMinor\",\n", + " \"isNew\",\n", + " \"isRobot\",\n", + " \"isUnpatrolled\",\n", + " \"metroCode\",\n", + " \"namespace\",\n", + " \"page\",\n", + " \"regionIsoCode\",\n", + " \"regionName\",\n", + " \"user\",\n", + " \"delta\",\n", + " \"added\",\n", + " \"deleted\"\n", + "FROM \"ext\"\n", + "PARTITIONED BY HOUR\n", + "'''" ] }, { @@ -216,7 +223,7 @@ "id": "1cf78bb7", "metadata": {}, "source": [ - "With the payload and headers ready, run the next cell to send a `POST` request to the endpoint." + "The following cell cell builds up a Python map that represents the Druid `SqlRequest` object." ] }, { @@ -226,9 +233,30 @@ "metadata": {}, "outputs": [], "source": [ - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", - " \n", - "print(response.text)" + "sql_request = {\n", + " 'query': sql\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "8312c9d4", + "metadata": {}, + "source": [ + "With the SQL request ready, use the the `json` parameter to the `Session` `post` method to send a `POST` request with the `sql_request` object as the payload. The result is a Requests `Response` which is saved in a variable.\n", + "\n", + "Now, run the next cell to start the ingestion. You will see an asterisk `[*]` in the left margin while the task runs. It takes a while for Druid to load the resulting segments. Wait for the table to become ready." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9540926f", + "metadata": {}, + "outputs": [], + "source": [ + "response = session.post(endpoint, json=sql_request)\n", + "response.status_code" ] }, { @@ -236,7 +264,7 @@ "id": "cab33e7e", "metadata": {}, "source": [ - "Once the data has been ingested, Druid will be populated with segments for each segment interval that contains data. Since the `wikipedia_hour` was ingested with `HOUR` granularity, there will be 24 segments associated with `wikipedia_hour`. \n", + "Once the data has been ingested, Druid is populated with segments for each segment interval that contains data. You should see 24 segments associated with `wikipedia_hour`. \n", "\n", "For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n", "\n", @@ -269,15 +297,12 @@ "metadata": {}, "outputs": [], "source": [ - "payload = json.dumps({\n", + "sql_request = {\n", " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", " \"resultFormat\": \"objectLines\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", - " \n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", "\n", "print(response.text)" ] @@ -287,7 +312,7 @@ "id": "f06e24e5", "metadata": {}, "source": [ - "Observe the response retrieved from the previous cell. In total, there are 24 `segment_id`, each containing the datasource name `wikipedia_hour`, along with the start and end hour interval. The tail end of the ID also contains the timestamp of when the request was made. \n", + "Observe the response retrieved from the previous cell. In total, there are 24 `segment_id` records, each containing the datasource name `wikipedia_hour`, along with the start and end hour interval. The tail end of the ID also contains the timestamp of when the request was made. \n", "\n", "For this tutorial, we are concerned with observing the start and end interval for each `segment_id`. \n", "\n", @@ -308,10 +333,10 @@ "id": "b6cd1c8c", "metadata": {}, "source": [ - "Permanent deletion of a segment in Apache Druid has two steps:\n", + "Permanent deletion of a segment in Druid has two steps:\n", "\n", - "1. A segment is marked as \"unused.\" This step occurs when a segment is dropped by a [drop rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules) or manually marked as \"unused\" through the Coordinator API or web console. Note that marking a segment as \"unused\" is a soft delete, it is no longer available for querying but the segment files remain in deep storage and segment records remain in the metadata store. \n", - "2. A kill task is sent to permanently remove \"unused\" segments. This deletes the segment file from deep storage and removes its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup." + "1. Mark a segment as \"unused.\" This step occurs when a segment is dropped by a [drop rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules) or manually marked as \"unused\" through the Coordinator API or web console. Note that marking a segment as \"unused\" is a soft delete, it is no longer available for querying but the segment files remain in deep storage and segment records remain in the metadata store. \n", + "2. Send a kill task to permanently remove \"unused\" segments. This deletes the segment file from deep storage and removes its record from the metadata store. This is a hard delete: the data is unrecoverable unless you have a backup." ] }, { @@ -348,7 +373,7 @@ "id": "863576a9", "metadata": {}, "source": [ - "The following cell constructs a JSON payload with the interval of segments to be deleted. This will mark the intervals from `18:00:00.000` to `20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the endpoint in a `POST` request." + "The following cell constructs a JSON payload with the interval of segments to be deleted. This marks the intervals from `18:00:00.000` to `20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the endpoint in a `POST` request." ] }, { @@ -358,16 +383,12 @@ "metadata": {}, "outputs": [], "source": [ - "payload = json.dumps({\n", + "sql_request = {\n", " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", + "response = session.post(endpoint, json=sql_request)\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", - "\n", - "print(response.text)" + "response.text" ] }, { @@ -397,17 +418,14 @@ "outputs": [], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", - "payload = json.dumps({\n", + "sql_request = {\n", " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", " \"resultFormat\": \"objectLines\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "response = session.post(endpoint, json=sql_request)\n", "\n", - "print(response.text)" + "response.text" ] }, { @@ -419,7 +437,7 @@ "\n", "However, as you've only soft deleted the segments, it remains in deep storage.\n", "\n", - "Before permanently deleting the segments, let's observe how this can change in deep storage. This step is optional, you can move onto the next set of cells without completing this step." + "Before permanently deleting the segments, you can verify that they've only been soft deleted by inspecting your deep storage. The soft deleted segments are still there. This step is optional, you can move onto the next set of cells without completing this step." ] }, { @@ -427,7 +445,7 @@ "id": "943b36cc", "metadata": {}, "source": [ - "[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to retrieve segments from deep storage:\n", + "[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to view segments in deep storage:\n", " \n", "* Navigate to the distribution directory for Druid, this is the same place where you run `./bin/start-druid` to start up Druid.\n", "* Run this command: `ls -l1 var/druid/segments/wikipedia-hour/`." @@ -438,7 +456,7 @@ "id": "8ecedcaa", "metadata": {}, "source": [ - "[OPTIONAL] If you are running Druid within the Docker Compose environment, follow these instructions to retrieve segments from deep storage:\n", + "[OPTIONAL] If you are running Druid within the Docker Compose environment, follow these instructions to view the segments in deep storage:\n", "\n", "* Navigate to your Docker terminal.\n", "* Run this command: `docker exec -it historical ls /opt/shared/segments/wikipedia_hour`" @@ -487,7 +505,7 @@ "\n", "The following cell uses the endpoint, setting the `dataSource` path parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n", "\n", - "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid will only permanently delete the \"unused\" segments within this interval. " + "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid only permanently delete the \"unused\" segments within this interval. " ] }, { @@ -516,7 +534,7 @@ "metadata": {}, "outputs": [], "source": [ - "response = requests.request(\"DELETE\", endpoint)\n", + "response = session.delete(endpoint);\n", "print(response.status_code)" ] }, @@ -602,14 +620,11 @@ "metadata": {}, "outputs": [], "source": [ - "payload = json.dumps({\n", + "sql_request = {\n", " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "response = session.post(endpoint, json=sql_request);\n", "\n", "print(response.status_code)" ] @@ -630,15 +645,12 @@ "outputs": [], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", - "payload = json.dumps({\n", + "sql_request = {\n", " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", " \"resultFormat\": \"objectLines\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)" + "response = session.post(endpoint, json=sql_request);" ] }, { @@ -690,7 +702,7 @@ "metadata": {}, "outputs": [], "source": [ - "response = requests.request(\"DELETE\", endpoint, headers=headers, data=payload)\n", + "response = session.delete(endpoint);\n", "print(response.status_code)" ] }, @@ -719,57 +731,48 @@ "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/indexer/v1/task'\n", - "payload = json.dumps({\n", - " \"type\": \"index_parallel\",\n", - " \"spec\": {\n", - " \"dataSchema\": {\n", - " \"dataSource\": \"wikipedia_hour\",\n", - " \"timestampSpec\": {\n", - " \"column\": \"time\",\n", - " \"format\": \"iso\"\n", - " },\n", - " \"dimensionsSpec\": {\n", - " \"useSchemaDiscovery\": True\n", - " },\n", - " \"metricsSpec\": [],\n", - " \"granularitySpec\": {\n", - " \"type\": \"uniform\",\n", - " \"segmentGranularity\": \"hour\",\n", - " \"queryGranularity\": \"none\",\n", - " \"intervals\": [\n", - " \"2015-09-12/2015-09-13\"\n", - " ],\n", - " \"rollup\": False\n", - " }\n", - " },\n", - " \"ioConfig\": {\n", - " \"type\": \"index_parallel\",\n", - " \"inputSource\": {\n", - " \"type\": \"local\",\n", - " \"baseDir\": \"quickstart/tutorial/\",\n", - " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n", - " },\n", - " \"inputFormat\": {\n", - " \"type\": \"json\"\n", - " },\n", - " \"appendToExisting\": False\n", - " },\n", - " \"tuningConfig\": {\n", - " \"type\": \"index_parallel\",\n", - " \"maxRowsPerSegment\": 5000000,\n", - " \"maxRowsInMemory\": 25000\n", - " }\n", - " }\n", - "})\n", - "\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", + "endpoint = druid_host + '/druid/v2/sql/task'\n", + "sql = '''\n", + "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", + "WITH \"ext\" AS (SELECT *\n", + "FROM TABLE(\n", + " EXTERN(\n", + " '{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n", + " '{\"type\":\"json\"}'\n", + " )\n", + ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, \"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, \"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, \"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n", + "SELECT\n", + " TIME_PARSE(\"time\") AS \"__time\",\n", + " \"channel\",\n", + " \"cityName\",\n", + " \"comment\",\n", + " \"countryIsoCode\",\n", + " \"countryName\",\n", + " \"isAnonymous\",\n", + " \"isMinor\",\n", + " \"isNew\",\n", + " \"isRobot\",\n", + " \"isUnpatrolled\",\n", + " \"metroCode\",\n", + " \"namespace\",\n", + " \"page\",\n", + " \"regionIsoCode\",\n", + " \"regionName\",\n", + " \"user\",\n", + " \"delta\",\n", + " \"added\",\n", + " \"deleted\"\n", + "FROM \"ext\"\n", + "PARTITIONED BY HOUR\n", + "'''\n", + "\n", + "\n", + "sql_request = {\n", + " 'query': sql\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", - " \n", - "print(response.text)" + "response = session.post(endpoint, json=sql_request)\n", + "response.status_code" ] }, { @@ -790,16 +793,12 @@ "outputs": [], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", - "payload = json.dumps({\n", + "sql_request = {\n", " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", " \"resultFormat\": \"objectLines\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", - "\n", + "response = session.post(endpoint, json=sql_request)\n", "print(response.text)" ] }, @@ -839,19 +838,16 @@ "metadata": {}, "outputs": [], "source": [ - "payload = json.dumps({\n", + "sql_request = {\n", " \"segmentIds\": [\n", - " \"\",\n", - " \"\"\n", + " \"wikipedia_hour_2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z_2023-08-09T20:47:22.402Z\",\n", + " \"wikipedia_hour_2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z_2023-08-09T20:47:22.402Z\"\n", " ]\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "response = session.post(endpoint, json=sql_request)\n", "\n", - "print(response.text)" + "response.text" ] }, { @@ -872,15 +868,12 @@ "outputs": [], "source": [ "endpoint = druid_host + '/druid/v2/sql'\n", - "payload = json.dumps({\n", + "sql_request = {\n", " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", " \"resultFormat\": \"objectLines\"\n", - "})\n", - "headers = {\n", - " 'Content-Type': 'application/json'\n", "}\n", "\n", - "response = requests.request(\"POST\", endpoint, headers=headers, data=payload)\n", + "response = session.post(endpoint, json=sql_request)\n", "\n", "print(response.text)" ] @@ -911,8 +904,8 @@ "metadata": {}, "outputs": [], "source": [ - "response = requests.request(\"DELETE\", endpoint, headers=headers, data=payload)\n", - "print(response.status_code)" + "response = session.delete(endpoint)\n", + "response.status_code" ] }, { From 7da612310fc4e838c778341734c1fa23a91d7bb4 Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Wed, 9 Aug 2023 21:27:19 -0700 Subject: [PATCH 5/8] typo --- .../04-api/01-delete-api-tutorial.ipynb | 32 +++++++++++-------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 5c305c47aaf5..1963b0ee0711 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -131,8 +131,8 @@ "outputs": [], "source": [ "endpoint = druid_host + '/status/health'\n", - "response = requests.request(\"GET\", endpoint)\n", - "print(response.text)" + "response = session.get(endpoint)\n", + "response.text" ] }, { @@ -266,7 +266,7 @@ "source": [ "Once the data has been ingested, Druid is populated with segments for each segment interval that contains data. You should see 24 segments associated with `wikipedia_hour`. \n", "\n", - "For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n", + "For demonstration, let's view the segments generated for the `wikipedia_hour` datasource before any deletion is made. Run the following cell to set the endpoint to `/druid/v2/sql`. For more information on this endpoint, see [Druid SQL API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n", "\n", "Using this endpoint, you can query the `sys` [metadata table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)." ] @@ -425,7 +425,7 @@ "\n", "response = session.post(endpoint, json=sql_request)\n", "\n", - "response.text" + "print(response.text)" ] }, { @@ -448,7 +448,7 @@ "[OPTIONAL] If you are running Druid externally from the Docker Compose environment, follow these instructions to view segments in deep storage:\n", " \n", "* Navigate to the distribution directory for Druid, this is the same place where you run `./bin/start-druid` to start up Druid.\n", - "* Run this command: `ls -l1 var/druid/segments/wikipedia-hour/`." + "* Run this command: `ls -l1 var/druid/segments/wikipedia_hour`." ] }, { @@ -472,9 +472,11 @@ "```bash\n", "$ ls -l1 var/druid/segments/wikipedia_hour/\n", "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n", "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n", "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", @@ -554,9 +556,11 @@ "```bash\n", "$ ls -l1 var/druid/segments/wikipedia_hour/\n", "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n", + "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n", "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n", "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n", "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n", + "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n", "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n", "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n", "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n", @@ -626,7 +630,7 @@ "\n", "response = session.post(endpoint, json=sql_request);\n", "\n", - "print(response.status_code)" + "response.status_code" ] }, { @@ -670,8 +674,8 @@ "metadata": {}, "outputs": [], "source": [ - "print(response.text)\n", - "print(response.status_code)" + "response.text\n", + "response.status_code" ] }, { @@ -703,7 +707,7 @@ "outputs": [], "source": [ "response = session.delete(endpoint);\n", - "print(response.status_code)" + "response.status_code" ] }, { @@ -840,8 +844,8 @@ "source": [ "sql_request = {\n", " \"segmentIds\": [\n", - " \"wikipedia_hour_2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z_2023-08-09T20:47:22.402Z\",\n", - " \"wikipedia_hour_2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z_2023-08-09T20:47:22.402Z\"\n", + " \"\",\n", + " \"\"\n", " ]\n", "}\n", "\n", @@ -915,13 +919,13 @@ "source": [ "## Conclusion\n", "\n", - "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanent deletion from deep storage. \n", + "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanently delete segment from deep storage. \n", "\n", - "By working through this tutorial, you learned two different methods:\n", + "You learned two different methods for marking segments as \"unused\": \n", "1. Deletion by time interval\n", "2. Deletion by segment ID\n", "\n", - "With the knowledge gained here, you can now efficiently manage storage allocation. Go forth and Druid!" + "With the knowledge gained from this tutorial, you can now efficiently manage storage allocation. Go forth and Druid!" ] }, { From 85474cf26380593479af7ea337267acaec9c14aa Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Mon, 14 Aug 2023 20:50:19 -0700 Subject: [PATCH 6/8] feedback --- .../04-api/01-delete-api-tutorial.ipynb | 69 ++++++++++++++++--- 1 file changed, 58 insertions(+), 11 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 1963b0ee0711..954300b3b1a0 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -184,7 +184,7 @@ "outputs": [], "source": [ "sql = '''\n", - "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", + "REPLACE INTO \"wikipedia_hour4\" OVERWRITE ALL\n", "WITH \"ext\" AS (SELECT *\n", "FROM TABLE(\n", " EXTERN(\n", @@ -245,7 +245,7 @@ "source": [ "With the SQL request ready, use the the `json` parameter to the `Session` `post` method to send a `POST` request with the `sql_request` object as the payload. The result is a Requests `Response` which is saved in a variable.\n", "\n", - "Now, run the next cell to start the ingestion. You will see an asterisk `[*]` in the left margin while the task runs. It takes a while for Druid to load the resulting segments. Wait for the table to become ready." + "Now, run the next cell to start the ingestion." ] }, { @@ -259,6 +259,45 @@ "response.status_code" ] }, + { + "cell_type": "markdown", + "id": "e79ec7a3-7924-4397-b032-f21a6fa1873f", + "metadata": {}, + "source": [ + "It takes a while for Druid to load the resulting segments. Run the following cell and and wait for the ingestion status to display \"The ingestion is complete\" before moving on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9776919-5240-4dde-9a09-a4bf76ae9a44", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "json = response.json()\n", + "ingestion_taskId = json['taskId']\n", + "\n", + "endpoint = druid_host + f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n", + "json = session.get(endpoint).json()\n", + "\n", + "ingestion_status = json['status']['status']\n", + "\n", + "if ingestion_status == \"RUNNING\":\n", + " print(\"The ingestion is running...\")\n", + "\n", + "while ingestion_status != \"SUCCESS\":\n", + " time.sleep(5) # 5 seconds \n", + " json = session.get(endpoint).json()\n", + " ingestion_status = json['status']['status']\n", + " \n", + "if ingestion_status == \"SUCCESS\": \n", + " print(\"The ingestion is complete.\")\n", + "else:\n", + " print(\"The ingestion task failed:\", json)" + ] + }, { "cell_type": "markdown", "id": "cab33e7e", @@ -435,7 +474,7 @@ "source": [ "Observe the response above. There should now be only 22 segments, and the \"unused\" segments have been soft deleted. \n", "\n", - "However, as you've only soft deleted the segments, it remains in deep storage.\n", + "However, as you've only soft deleted the segments, they remain in deep storage.\n", "\n", "Before permanently deleting the segments, you can verify that they've only been soft deleted by inspecting your deep storage. The soft deleted segments are still there. This step is optional, you can move onto the next set of cells without completing this step." ] @@ -507,7 +546,7 @@ "\n", "The following cell uses the endpoint, setting the `dataSource` path parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n", "\n", - "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid only permanently delete the \"unused\" segments within this interval. " + "Notice that the interval is set to `2015-09-12_2015-09-13` which covers the entirety of the 22 segments. Druid only permanently deletes the \"unused\" segments within this interval. " ] }, { @@ -545,7 +584,7 @@ "id": "69d6e89a", "metadata": {}, "source": [ - "Last, observe that the segments have been deleted from deep storage in the following sample output. " + "You can verify that the segments are completely gone by inspecting your deep storage like the optional step earlier. The command should return an output like the following:" ] }, { @@ -593,7 +632,7 @@ "id": "b8d0260a", "metadata": {}, "source": [ - "You can delete entire tables the same way you can delete parts of a table, using intervals.\n", + "You can delete entire tables the same way you can delete segments of a table, using intervals.\n", "\n", "Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`." ] @@ -638,7 +677,7 @@ "id": "bbbba823", "metadata": {}, "source": [ - "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and send a SQL-based request. " + "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and sends a SQL-based request. " ] }, { @@ -664,7 +703,7 @@ "source": [ "Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. \n", "\n", - "The response should return the remaining segments, but since the table was deleted, there are no segments to return." + "The response returns the remaining segments, but since you deleted the table, there are no segments to return." ] }, { @@ -710,6 +749,14 @@ "response.status_code" ] }, + { + "cell_type": "markdown", + "id": "55745cb5-2869-4086-a247-62d78956c8b4", + "metadata": {}, + "source": [ + "If you inspect your deep storage again, the directory for the datasource would be removed along with its corresponding segments." + ] + }, { "cell_type": "markdown", "id": "8b2d59c8", @@ -919,13 +966,13 @@ "source": [ "## Conclusion\n", "\n", - "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanently delete segment from deep storage. \n", + "In this tutorial, you learned how to mark data segments as \"unused\" for soft deletion and send a kill task to permanently delete segments from deep storage. \n", "\n", "You learned two different methods for marking segments as \"unused\": \n", "1. Deletion by time interval\n", "2. Deletion by segment ID\n", "\n", - "With the knowledge gained from this tutorial, you can now efficiently manage storage allocation. Go forth and Druid!" + "With the knowledge gained from this tutorial, you can now programmatically manage your storage for Druid. Go forth and Druid!" ] }, { @@ -967,7 +1014,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.8" + "version": "3.11.4" } }, "nbformat": 4, From ad8f82d73b362b985238f452d4873bab9ac66c55 Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 15 Aug 2023 11:58:52 -0700 Subject: [PATCH 7/8] feedback 2, update structure, remove wikipedia_hour4 mistake --- .../04-api/01-delete-api-tutorial.ipynb | 247 +++++++----------- 1 file changed, 96 insertions(+), 151 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index 954300b3b1a0..ac84a97f65b2 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -38,8 +38,8 @@ "- [Ingest data](#Ingest-data)\n", "- [Deletion steps](#Deletion-steps)\n", "- [Delete by time interval](#Delete-by-time-interval)\n", - "- [Delete entire table](#Delete-entire-table)\n", "- [Delete by segment ID](#Delete-by-segment-ID)\n", + "- [Delete entire table](#Delete-entire-table)\n", "- [Conclusion](#Conclusion)\n", "- [Learn more](#Learn-more)\n", "\n", @@ -184,7 +184,7 @@ "outputs": [], "source": [ "sql = '''\n", - "REPLACE INTO \"wikipedia_hour4\" OVERWRITE ALL\n", + "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", "WITH \"ext\" AS (SELECT *\n", "FROM TABLE(\n", " EXTERN(\n", @@ -621,117 +621,126 @@ }, { "cell_type": "markdown", - "id": "f0d0578c", + "id": "8b2d59c8", "metadata": {}, "source": [ - "## Delete entire table" + "## Delete by segment ID" ] }, { "cell_type": "markdown", - "id": "b8d0260a", + "id": "a4e8453e", "metadata": {}, "source": [ - "You can delete entire tables the same way you can delete segments of a table, using intervals.\n", - "\n", - "Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`." + "In addition to deleting by interval, you can delete segments by using `segment_id`. Run the next cell to retrieve the `segment_id` of each segment." ] }, { "cell_type": "code", "execution_count": null, - "id": "dd354886", + "id": "7f4e4ed7", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", - "endpoint" + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "print(response.text)" ] }, { "cell_type": "markdown", - "id": "ed3cf7d3", + "id": "017d63e9", "metadata": {}, "source": [ - "Next, send a `POST` with the payload `{\"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of the table as \"unused.\"" + "With known `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." ] }, { "cell_type": "code", "execution_count": null, - "id": "25639752", + "id": "e320037a", "metadata": {}, "outputs": [], "source": [ - "sql_request = {\n", - " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n", - "}\n", - "\n", - "response = session.post(endpoint, json=sql_request);\n", - "\n", - "response.status_code" + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" ] }, { "cell_type": "markdown", - "id": "bbbba823", + "id": "16861873", "metadata": {}, "source": [ - "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and sends a SQL-based request. " + "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segments responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", + "\n", + "Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell." ] }, { "cell_type": "code", "execution_count": null, - "id": "eac1db4c", + "id": "e1cec240", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/v2/sql'\n", "sql_request = {\n", - " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", - " \"resultFormat\": \"objectLines\"\n", + " \"segmentIds\": [\n", + " \"\",\n", + " \"\"\n", + " ]\n", "}\n", "\n", - "response = session.post(endpoint, json=sql_request);" + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "response.text" ] }, { "cell_type": "markdown", - "id": "deae8727", + "id": "cda3213c", "metadata": {}, "source": [ - "Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. \n", + "You should see a response with the `numChangedSegments` property and the value `2` for the two segments marked as \"unused.\"\n", "\n", - "The response returns the remaining segments, but since you deleted the table, there are no segments to return." + "Run the cell below to view changes in the datasource's segments." ] }, { "cell_type": "code", "execution_count": null, - "id": "12c11291", + "id": "9301e6df", "metadata": {}, "outputs": [], "source": [ - "response.text\n", - "response.status_code" + "endpoint = druid_host + '/druid/v2/sql'\n", + "sql_request = {\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", + "}\n", + "\n", + "response = session.post(endpoint, json=sql_request)\n", + "\n", + "print(response.text)" ] }, { "cell_type": "markdown", - "id": "2e9a74da", + "id": "671521b6", "metadata": {}, "source": [ - "So far, you've soft deleted the table. Run the following cells to permanently delete the table from deep storage:" + "Last, run the following cells to permanently delete the segments from deep storage." ] }, { "cell_type": "code", "execution_count": null, - "id": "6c3d7ec9", - "metadata": { - "scrolled": true - }, + "id": "9dd59374", + "metadata": {}, "outputs": [], "source": [ "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", @@ -741,17 +750,17 @@ { "cell_type": "code", "execution_count": null, - "id": "98834167", + "id": "f5f31f08", "metadata": {}, "outputs": [], "source": [ - "response = session.delete(endpoint);\n", + "response = session.delete(endpoint)\n", "response.status_code" ] }, { "cell_type": "markdown", - "id": "55745cb5-2869-4086-a247-62d78956c8b4", + "id": "a2bb5047-20bd-46bd-be82-de19acb5c1a1", "metadata": {}, "source": [ "If you inspect your deep storage again, the directory for the datasource would be removed along with its corresponding segments." @@ -759,189 +768,117 @@ }, { "cell_type": "markdown", - "id": "8b2d59c8", + "id": "f0d0578c", "metadata": {}, "source": [ - "## Delete by segment ID" + "## Delete entire table" ] }, { "cell_type": "markdown", - "id": "a4e8453e", + "id": "b8d0260a", "metadata": {}, "source": [ - "In addition to deleting by interval, you can delete segments by using `segment_id`. Let's load in some new data to work with.\n", + "You can delete entire tables the same way you can delete segments of a table, using intervals.\n", "\n", - "Run the following cell to ingest a new set of data for `wikipedia_hour`. " + "Run the following cell to reset the endpoint to `/druid/coordinator/v1/datasources/:dataSource/markUnused`." ] }, { "cell_type": "code", "execution_count": null, - "id": "5f512f4e", + "id": "dd354886", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/v2/sql/task'\n", - "sql = '''\n", - "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n", - "WITH \"ext\" AS (SELECT *\n", - "FROM TABLE(\n", - " EXTERN(\n", - " '{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n", - " '{\"type\":\"json\"}'\n", - " )\n", - ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, \"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, \"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, \"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n", - "SELECT\n", - " TIME_PARSE(\"time\") AS \"__time\",\n", - " \"channel\",\n", - " \"cityName\",\n", - " \"comment\",\n", - " \"countryIsoCode\",\n", - " \"countryName\",\n", - " \"isAnonymous\",\n", - " \"isMinor\",\n", - " \"isNew\",\n", - " \"isRobot\",\n", - " \"isUnpatrolled\",\n", - " \"metroCode\",\n", - " \"namespace\",\n", - " \"page\",\n", - " \"regionIsoCode\",\n", - " \"regionName\",\n", - " \"user\",\n", - " \"delta\",\n", - " \"added\",\n", - " \"deleted\"\n", - "FROM \"ext\"\n", - "PARTITIONED BY HOUR\n", - "'''\n", - "\n", - "\n", - "sql_request = {\n", - " 'query': sql\n", - "}\n", - "\n", - "response = session.post(endpoint, json=sql_request)\n", - "response.status_code" + "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", + "endpoint" ] }, { "cell_type": "markdown", - "id": "e51f6b54", + "id": "ed3cf7d3", "metadata": {}, "source": [ - "Now that you have a brand new datasource to work with, let's view the segment information for it.\n", - "\n", - "Run the next cell to retrieve the `segment_id` of each segment." + "Next, send a `POST` with the payload `{\"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of the table as \"unused.\"" ] }, { "cell_type": "code", "execution_count": null, - "id": "7f4e4ed7", + "id": "25639752", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/v2/sql'\n", "sql_request = {\n", - " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", - " \"resultFormat\": \"objectLines\"\n", + " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n", "}\n", "\n", - "response = session.post(endpoint, json=sql_request)\n", - "print(response.text)" - ] - }, - { - "cell_type": "markdown", - "id": "017d63e9", - "metadata": {}, - "source": [ - "With known `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e320037a", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n", - "endpoint" + "response = session.post(endpoint, json=sql_request);\n", + "\n", + "response.status_code" ] }, { "cell_type": "markdown", - "id": "16861873", + "id": "bbbba823", "metadata": {}, "source": [ - "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segments responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", - "\n", - "Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell." + "To verify the segment changes, the following cell sets the endpoint to `/druid/v2/sql` and sends a SQL-based request. " ] }, { "cell_type": "code", "execution_count": null, - "id": "e1cec240", + "id": "eac1db4c", "metadata": {}, "outputs": [], "source": [ + "endpoint = druid_host + '/druid/v2/sql'\n", "sql_request = {\n", - " \"segmentIds\": [\n", - " \"\",\n", - " \"\"\n", - " ]\n", + " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", + " \"resultFormat\": \"objectLines\"\n", "}\n", "\n", - "response = session.post(endpoint, json=sql_request)\n", - "\n", - "response.text" + "response = session.post(endpoint, json=sql_request);" ] }, { "cell_type": "markdown", - "id": "cda3213c", + "id": "deae8727", "metadata": {}, "source": [ - "You should see a response with the `numChangedSegments` property and the value `2` for the two segments marked as \"unused.\"\n", + "Run the next cells to view the response. You should see that the `response.text` returns nothing, but `response.status_code` returns a 200. \n", "\n", - "Run the cell below to view changes in the datasource's segments." + "The response returns the remaining segments, but since you deleted the table, there are no segments to return." ] }, { "cell_type": "code", "execution_count": null, - "id": "9301e6df", + "id": "12c11291", "metadata": {}, "outputs": [], "source": [ - "endpoint = druid_host + '/druid/v2/sql'\n", - "sql_request = {\n", - " \"query\": \"SELECT segment_id FROM sys.segments WHERE \\\"datasource\\\" = 'wikipedia_hour'\",\n", - " \"resultFormat\": \"objectLines\"\n", - "}\n", - "\n", - "response = session.post(endpoint, json=sql_request)\n", - "\n", - "print(response.text)" + "response.text\n", + "response.status_code" ] }, { "cell_type": "markdown", - "id": "671521b6", + "id": "2e9a74da", "metadata": {}, "source": [ - "Last, run the following cells to permanently delete the segments from deep storage." + "So far, you've soft deleted the table. Run the following cells to permanently delete the table from deep storage:" ] }, { "cell_type": "code", "execution_count": null, - "id": "9dd59374", - "metadata": {}, + "id": "6c3d7ec9", + "metadata": { + "scrolled": true + }, "outputs": [], "source": [ "endpoint = druid_host + '/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n", @@ -951,14 +888,22 @@ { "cell_type": "code", "execution_count": null, - "id": "f5f31f08", + "id": "98834167", "metadata": {}, "outputs": [], "source": [ - "response = session.delete(endpoint)\n", + "response = session.delete(endpoint);\n", "response.status_code" ] }, + { + "cell_type": "markdown", + "id": "55745cb5-2869-4086-a247-62d78956c8b4", + "metadata": {}, + "source": [ + "If you inspect your deep storage again, the directory for the datasource would be removed along with its corresponding segments." + ] + }, { "cell_type": "markdown", "id": "998682dc", From 8290505a21c3e43be7160da69db6d936a25e5679 Mon Sep 17 00:00:00 2001 From: demo-kratia <56242907+demo-kratia@users.noreply.github.com> Date: Tue, 15 Aug 2023 16:15:04 -0700 Subject: [PATCH 8/8] feedback --- .../notebooks/04-api/01-delete-api-tutorial.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb index ac84a97f65b2..39f655a5a9a8 100644 --- a/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb +++ b/examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb @@ -657,7 +657,7 @@ "id": "017d63e9", "metadata": {}, "source": [ - "With known `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." + "If you know the `segment_id`, you can mark specific segments \"unused\" by sending a request to the `/druid/coordinator/v1/datasources/wikipedia_hour/markUnused` endpoint with an array of `segment_id` values." ] }, { @@ -676,7 +676,7 @@ "id": "16861873", "metadata": {}, "source": [ - "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segments responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", + "In the next cell, construct a payload with `segmentIds` property and an array of `segment_id`. This payload should send the segment IDs responsible for the interval `01:00:00.000` to `02:00:00.000` and `5:00:00.000` to `6:00:00.000` to be marked as \"unused.\"\n", "\n", "Fill in the `segmentIds` array with the `segment_id` corresponding to these intervals, then run the cell." ] @@ -763,7 +763,7 @@ "id": "a2bb5047-20bd-46bd-be82-de19acb5c1a1", "metadata": {}, "source": [ - "If you inspect your deep storage again, the directory for the datasource would be removed along with its corresponding segments." + "If you inspect your deep storage again, the directory for the datasource is removed along with its corresponding segments." ] }, { @@ -901,7 +901,7 @@ "id": "55745cb5-2869-4086-a247-62d78956c8b4", "metadata": {}, "source": [ - "If you inspect your deep storage again, the directory for the datasource would be removed along with its corresponding segments." + "If you inspect your deep storage again, the directory for the datasource is removed along with its corresponding segments." ] }, {