Memory estimation endpoint returns "0" for non-empty dataset. #49140

przemekwitek · 2019-11-15T08:50:21Z

Dataset: barcelona_accidents

With the following request:

{
  "source": {
    "index": "barcelona_accidents"
  },
  "analysis": {
    "outlier_detection": {}
  }
}

'_estimate_memory_usage' endpoint returns the following response:

{
  "expected_memory_without_disk" : "0",
  "expected_memory_with_disk" : "0"
}

Apparently problem lies in data extraction, as the following search query produced by data extractor yields no results:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "exists": {
            "field": "day",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.day",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.hour",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.location.lat",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.location.lon",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.mild_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.serious_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.vehicles_involved",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "doc.victims",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "hour",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "mild_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "serious_injuries",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "vehicles_involved",
            "boost": 1
          }
        },
        {
          "exists": {
            "field": "victims",
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "track_total_hits": 2147483647
}

It starts working fine, however if fields without a doc. prefix are removed from the query.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-15T08:50:22Z

Pinging @elastic/ml-core (:ml)

przemekwitek · 2019-11-15T09:02:27Z

Turns out that although fields like day, hour, etc. exist in the mapping, they do not exist in any document. That's why the query fails to find results.

Mapping:

{
  "barcelona_accidents": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "day": {
          "type": "short"
        },
        "district_name": {
          "type": "keyword"
        },
        "doc": {
          "properties": {
            "@timestamp": {
              "type": "date"
            },
            "day": {
              "type": "long"
            },
            "district_name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "hour": {
              "type": "long"
            },
            "id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "location": {
              "properties": {
                "lat": {
                  "type": "float"
                },
                "lon": {
                  "type": "float"
                }
              }
            },
            "mild_injuries": {
              "type": "long"
            },
            "month": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "part_of_the_day": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "serious_injuries": {
              "type": "long"
            },
            "street": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "vehicles_involved": {
              "type": "long"
            },
            "victims": {
              "type": "long"
            },
            "weekday": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "hour": {
          "type": "short"
        },
        "id": {
          "type": "text"
        },
        "location": {
          "type": "geo_point"
        },
        "mild_injuries": {
          "type": "short"
        },
        "month": {
          "type": "keyword"
        },
        "part_of_the_day": {
          "type": "keyword"
        },
        "serious_injuries": {
          "type": "short"
        },
        "street": {
          "type": "keyword"
        },
        "vehicles_involved": {
          "type": "short"
        },
        "victims": {
          "type": "short"
        },
        "weekday": {
          "type": "keyword"
        }
      }
    }
  }
}

Documents:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "barcelona_accidents",
        "_id": "NAawx2gBoWuqnhwrb0Ry",
        "_score": 1,
        "_source": {
          "doc": {
            "id": "2017S000002    ",
            "district_name": "Eixample",
            "street": "GV CORTS CATALANES                                ",
            "weekday": "Sunday",
            "month": "January",
            "day": 1,
            "hour": 2,
            "part_of_the_day": "Night",
            "mild_injuries": 1,
            "serious_injuries": 0,
            "victims": 1,
            "vehicles_involved": 2,
            "location": {
              "lat": 41.39968,
              "lon": 2.1823759999999996
            },
            "@timestamp": "2017-01-01T00:00:00"
          }
        }
      }
    ]
  }
}

przemekwitek · 2019-11-15T09:20:19Z

After talking to @dimitris-athanasiou, we agreed that the data extractor behavior is correct, i.e. the extractor should require all the analysable fields to exist for the outlier detection analysis (which does not support missing values).

So in order to run the analysis correctly the user should exclude the missing fields explicitly from their analysis using:

{
  ...
  "analyzed_fields": {
    "excludes": [ "day", "hour", "mild_injuries", "serious_injuries", "vehicles_involved", "victims" ]
  },
  ...
}

In order to improve the UX, the memory estimation endpoint should throw early if it sees no analysable data so that the user is not confused that they receive "0" estimation.

przemekwitek added the :ml Machine learning label Nov 15, 2019

przemekwitek self-assigned this Nov 15, 2019

przemekwitek mentioned this issue Nov 15, 2019

Throw an exception when memory usage estimation endpoint encounters empty data frame. #49143

Merged

przemekwitek closed this as completed Nov 19, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory estimation endpoint returns "0" for non-empty dataset. #49140

Memory estimation endpoint returns "0" for non-empty dataset. #49140

przemekwitek commented Nov 15, 2019 •

edited

Loading

elasticmachine commented Nov 15, 2019

przemekwitek commented Nov 15, 2019

przemekwitek commented Nov 15, 2019 •

edited

Loading

Memory estimation endpoint returns "0" for non-empty dataset. #49140

Memory estimation endpoint returns "0" for non-empty dataset. #49140

Comments

przemekwitek commented Nov 15, 2019 • edited Loading

elasticmachine commented Nov 15, 2019

przemekwitek commented Nov 15, 2019

przemekwitek commented Nov 15, 2019 • edited Loading

przemekwitek commented Nov 15, 2019 •

edited

Loading

przemekwitek commented Nov 15, 2019 •

edited

Loading