Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_uid() returns incorrect UID for non-existent taxon #436

Closed
snubian opened this issue Jun 26, 2015 · 12 comments
Closed

get_uid() returns incorrect UID for non-existent taxon #436

snubian opened this issue Jun 26, 2015 · 12 comments
Assignees
Labels
Milestone

Comments

@snubian
Copy link

snubian commented Jun 26, 2015

Using taxize 0.6.0

Have hit a couple of cases where searching on a scientific name which should return NA, get_uid() returns the UID of an unrelated taxon. It seems that get_uid() is returning a match based on only part of the search term, such as matching the species epithet to an unrelated genus. E.g.:

> get_uid(sciname = "Fringella morel")

Retrieving data for taxon 'Fringella morel'

[1] "39407"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"uri")
[1] "http://www.ncbi.nlm.nih.gov/taxonomy/39407"

The search term Fringella morel is a typo error of a bird species and does not actually exist, but the UID returned 39407 is for taxon Morchella esculenta, a fungus which has GenBank common name morel (in case that is relevant).

If I alter the search term slightly I get the correct result:

> get_uid(sciname = "Fringella morelx")

Retrieving data for taxon 'Fringella morelx'

[1] NA
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"

I tried using the division parameter to narrow the search but without success.

Another example is:

> get_uid(sciname = "Aratinga acuticauda")

Retrieving data for taxon 'Aratinga acuticauda'

[1] "866279"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"uri")
[1] "http://www.ncbi.nlm.nih.gov/taxonomy/866279"

which returns the UID for aphid genus Acuticauda. I tried using rank = "species" here but had no effect.

Note that when manually doing the above searches in the NCBI Taxonomy Browser you get the standard Did you mean ... response with a list of suggested taxa.

@sckott sckott added the bug label Jun 26, 2015
@sckott sckott added this to the 0.6.2 milestone Jun 26, 2015
@sckott
Copy link
Contributor

sckott commented Jun 26, 2015

Hi @snubian looking at this. There's a few things here. NCBI doesn't do fuzzy searching as far as I know. So they don't attempt to match Fringella to the very close Fringilla, So in this case they did have a match for morel and returned those records.

For the division and rank fields, we are using those after the request comes back to filter results. We could instead add any division or rank - I'm looking at this now...more soon

One approach with mis-spelled names (which you may be aware of already) is to try to make sure you have correct spellings first. e.g., using gnr_resolve(), data source 4 is NCBI

<r> gnr_resolve(names = "Fringella morel", data_source_ids = 4)
$results
[1] "no results found"

$preferred
NULL

Then search for Fringella alone since no results above

<r> gnr_resolve(names = "Fringella", data_source_ids = 4)
$results
  submitted_name matched_name data_source_title score
1      Fringella    Fringilla              NCBI   0.5

$preferred
NULL

@snubian
Copy link
Author

snubian commented Jun 26, 2015

Thanks @sckott - very helpful advice. I wasn't expecting NCBI to return a fuzzy match, but I thought it would only return a UID for an exact match or a synonym.

After posting the above I had a quick check to see what was happening under the hood. The XML returned by the Entrez search:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Fringella+morel
is:

<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult>
  <Count>1</Count>
  <RetMax>1</RetMax>
  <RetStart>0</RetStart>
  <IdList>
    <Id>39407</Id>
  </IdList>
  <TranslationSet/>
  <TranslationStack>
    <TermSet>
      <Term>morel[All Names]</Term>
      <Field>All Names</Field>
      <Count>1</Count>
      <Explode>N</Explode>
    </TermSet>
    <OP>GROUP</OP>
  </TranslationStack>
  <QueryTranslation>morel[All Names]</QueryTranslation>
  <ErrorList>
    <PhraseNotFound>Fringella</PhraseNotFound>
  </ErrorList>
</eSearchResult>

So it's matched on the partial term morel. I had a look at the docs on this and couldn't see any way of forcing it to search for the entire term. I couldn't get any results using AND with the taxonomy db. E.g., this:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Homo+AND+sapiens
returned No items found.

I will definitely check names beforehand from now on, but are you aware of a way to force a search to match only on the full term parameter?

@snubian
Copy link
Author

snubian commented Jun 26, 2015

Tried adding a field to the term and [ALLN] seemed to have the desired effect but not exactly sure why:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Fringella+morel[ALLN]

does not return the false-positive UID for morel.

While for a valid taxon:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Aratinga+brevipes[ALLN]

it returns the correct UID.

@sckott
Copy link
Contributor

sckott commented Jun 26, 2015

where did you find ALLN - I can't seem to find any documentation on that

@sckott
Copy link
Contributor

sckott commented Jun 26, 2015

I don't see anything about searching on an entire term when more than one word, unfortunately, unless that's what ALLN does? Ah, it looks like ALLN stands for [All Names]

sckott added a commit that referenced this issue Jun 26, 2015
…difier params, #436

changed filt() fxn to actually filter by given value instead of filter if present
@sckott
Copy link
Contributor

sckott commented Jun 26, 2015

@snubian I made some changes, see egs, added a few new params, and changed division and rank params to division_filter and rank_filter, respectively

The filtering across all get_*() functions now actually filters, whereas before I was only filtering if results were found from the filter, but giving back unfiltered results if no match, so now filtering only returns stuff found in the filter, see changes

@snubian
Copy link
Author

snubian commented Jun 27, 2015

On the ALLN thing - I found some explanation of the fields here:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=taxonomy

As you say, ALLN seems to set the scope of the search to all names, so why it had the effect I saw above I have no idea. I would still expect it to find a partial match on morel with common name. But nevermind.

Thanks Scott for making these changes so quickly. I installed the updates, but not sure if I'm doing the right thing to use the new parameters. I can see from the examples the distinction between rank_query and rank_filter. But I still can't get it to return NA when given a bad taxon name - it is still wanting to return a partial match for anything else it can find. I tried using rank_query = "species" and modifier = "Scientific Name" to prevent a match on an incorrect common name, but no change. In fact I couldn't get modifier = "Scientific Name" to return anything. Maybe I'm on the wrong track.

@snubian
Copy link
Author

snubian commented Jun 27, 2015

Also, I tried to use +AND+ in an Entrez search to separate genus and species, as a surrogate for a "full phrase" search, but couldn't get it to work, e.g.:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Homo+AND+sapiens

returns no matches.

@sckott
Copy link
Contributor

sckott commented Jun 27, 2015

Right, i'm aware that this still doesn't fix your problem, but these fixes I think should help in general. I hope we can in fact solve your problem, just no there yet

an eg for using the modifier parameter

<r> get_uid(sciname = "Fringilla", modifier = "Scientific Name")

Retrieving data for taxon 'Fringilla'

[1] "36254"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"uri")
[1] "http://www.ncbi.nlm.nih.gov/taxonomy/36254"

@snubian
Copy link
Author

snubian commented Jun 27, 2015

Thanks again Scott, these changes are a nice improvement for sure. I can work around my problem given your suggestions above. Your effort is really appreciated, don't get me wrong.

My problem with modifier = "Scientific Name" was that I was using it on a search for a species name that is actually a synonym. Using the modifier returned nothing, omitting the modifier gave the correct UID.

> get_uid(sciname = "Aratinga brevipes", modifier = "Scientific Name")

Retrieving data for taxon 'Aratinga brevipes'

[1] NA
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
> get_uid(sciname = "Aratinga brevipes")

Retrieving data for taxon 'Aratinga brevipes'

[1] "867385"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"uri")
[1] "http://www.ncbi.nlm.nih.gov/taxonomy/867385"

@sckott
Copy link
Contributor

sckott commented Jul 10, 2015

Thanks for further info. Getting back to this soon.

@sckott sckott self-assigned this Aug 5, 2015
@sckott sckott modified the milestones: 0.6.4 - SQL, 0.6.2 Aug 5, 2015
@sckott sckott modified the milestones: v0.6.4, v0.6.6 - SQL Sep 30, 2015
@sckott
Copy link
Contributor

sckott commented Sep 30, 2015

get_uid(sciname = "Aratinga brevipes", modifier = "Synonym") works for this eg

Been looking over this today. I don't think there's anything else I can do besides making the documentation a bit more clear, letting users know their options of how to modify requests with modifiers and other arguments. And note that Entrez does funny things with fuzzy search, matching epithets alone to other unrelated taxa, etc.

@sckott sckott closed this as completed in 80dcda6 Sep 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants