Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In hierarchical documents, under certain scenarios, snippets for descendant documents don't turn up #40572

Closed
kowndinyav opened this issue Mar 28, 2019 · 5 comments

Comments

@kowndinyav
Copy link

kowndinyav commented Mar 28, 2019

Elasticsearch version : AWS Elastic Search Service 5.1

Plugins installed: []

JVM version : Should be what ever JVM bundled with AWS Elastic Search Service

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
We have an application where we store documents in hierarchical manner. At a high level, structure is "board -> post -> comment -> reply".

The requirement is to be able to search given text inside "Post" hierarchy i.e if text matches any of the fields from post, comment and reply documents, return post with matched snippets. One problem that we are observing is, post object is always returned when it matches any of the post, comment and reply fields. But in some cases snippets are not turning up.

For example, in the given scenario from the attachments, when searched for text "document", it fails to produce comment snippet for '202_post' (post it-self is returned).

According to my understanding, given that post object is always returned with matched criteria, there is no issue in matching but there is some issue preparing the response for snippets.

Steps to reproduce
Attachments should help in reproducing the issue

    • create.txt - contains ES fragment to create a test index and the documents
    • query.txt - contains intended ES query
    • response.txt - contains the response
    • picture1.png - a visual representation of what goes wrong in the response

create.txt
Picture1
query.txt
response.txt

@jimczi
Copy link
Contributor

jimczi commented Mar 28, 2019

Your query contains two inner_hits at the same level, one named comment:

"inner_hits": {
	"name": "comment",
	"highlight": {
		"fields": {
			"commentText_en": {}
		}
	}
}

and the other one with no name:

"inner_hits": {}

inner_hits without names are keyed in the response with the type of the has_child query where they appear. Since the type in this case is comment, there is a clash between the two inner_hits definitions. The last inner_hits always wins so the results of the first one are discarded. You can "fix" your query by setting an explicit name in the empty inner_hits definition or change the first one to use another name. Note that starting in 6.7.0 we throw an error if two inner_hits have the same name: #37645.

@jimczi jimczi closed this as completed Mar 28, 2019
@kowndinyav
Copy link
Author

kowndinyav commented Mar 30, 2019

Hello @jimczi,

If I remove, empty inner_hits, reply document snippets are not returned. Please try the same at your end once. If my understanding is wrong, could you please share the query that returns both comments and replies without missing any of the matching snippets. That would be immensely helpful

-Kowndinya

@jimczi
Copy link
Contributor

jimczi commented Apr 2, 2019

I tested the following query in 5.6:

{
	"query": {
		"bool": {
			"should": [{
					"match_phrase_prefix": {
						"postText_en": "document"
					}
				}, {
					"has_child": {
						"type": "comment",
						"query": {
							"match_phrase_prefix": {
								"commentText_en": "document"
							}
						},
						"inner_hits": {
							"name": "comment",
							"highlight": {
								"fields": {
									"commentText_en": {}
								}
							}
						}
					}
				}, {
					"has_child": {
						"type": "comment",
						"query": {
							"has_child": {
								"type": "reply",
								"query": {
									"match_phrase_prefix": {
										"replyText_en": "document"
									}
								},
								"inner_hits": {
									"name": "reply",
									"highlight": {
										"fields": {
											"replyText_en": {}
										}
									}
								}
							}
						},
						"inner_hits": {
							"name": "comment2"
						}
					}
				}
			]
		}
	},
	"highlight": {
		"fields": {
			"postText_en": {}
		}
	}
}

As I said above you need to set an explicit name in your inner_hits in order to avoid name clash. In the example above I replaced the empty inner_hits with:

"inner_hits": {
  "name": "comment2"
}

and the highlights are returned as expected.

@kowndinyav
Copy link
Author

kowndinyav commented Apr 3, 2019

Thanks @jimczi for the query. With this I could get comments and replies as expected but I have few queries.

When I first time wrote the query, I did not have empty "inner_hits" fragment. When reply documents were not turning up in highlights section, it was a guess work to add the empty inner_hits as it was not clear from documentation how to resolve this. So, why is it required to add an additional dummy inner_hits for reply documents to turn-up?

Also, the clash that you are referring to is between two different "match" sections inside "should". One match is for matching documents for comment fields and the other one is matching reply fields. As an end user of elastic search, my expectation was that it should have worked without the additional inner_hits. Is this some implementation issue that is addressed in 5.1+ releases already?

For me the issue with adding a random name for inner_hits is that it breaks certain assumptions that we make while parsing the response. Essentially we go by type name.

@jimczi
Copy link
Contributor

jimczi commented Apr 5, 2019

When I first time wrote the query, I did not have empty "inner_hits" fragment. When reply documents were not turning up in highlights section, it was a guess work to add the empty inner_hits as it was not clear from documentation how to resolve this. So, why is it required to add an additional dummy inner_hits for reply documents to turn-up?

You have two level of parent-child, comment and reply. Inner hits for the reply section are grandchildren of the root document so you need to provide inner_hits for the intermediate level (comment) to link the reply documents in the response.

Also, the clash that you are referring to is between two different "match" sections inside "should".

These two inner_hits are at the same level so we need a way to differentiate them, hence the name should be different.

For me the issue with adding a random name for inner_hits is that it breaks certain assumptions that we make while parsing the response. Essentially we go by type name.

This works if you have a single inner_hits per level but in your example you have two has_child query at the same level. One thing you can do is to group the comment query in a single has_child clause:

{
    "query": {
        "bool": {
            "should": [
                {
                    "match_phrase_prefix": {
                        "postText_en": "document"
                    }
                },
                {
                    "has_child": {
                        "type": "comment",
                        "query": {
                            "bool": {
                                "should": [
                                    {
                                        "match_phrase_prefix": {
                                            "commentText_en": "document"
                                        }
                                    },
                                    {
                                        "has_child": {
                                            "type": "reply",
                                            "query": {
                                                "match_phrase_prefix": {
                                                    "replyText_en": "document"
                                                }
                                            },
                                            "inner_hits": {
                                                "name": "reply",
                                                "highlight": {
                                                    "fields": {
                                                        "replyText_en": {}
                                                    }
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        },
                        "inner_hits": {
                            "name": "comment",
                            "highlight": {
                                "fields": {
                                    "commentText_en": {}
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "postText_en": {}
        }
    }
}

@kowndinyav if you have any further questions please don't answer in this issue but open a topic in the discuss forum, we reserve github for verified bugs and feature requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants