Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check duplicate DOI #6333

Merged
merged 21 commits into from
May 15, 2020
Merged

Check duplicate DOI #6333

merged 21 commits into from
May 15, 2020

Conversation

koppor
Copy link
Member

@koppor koppor commented Apr 21, 2020

Result of #latemob session with @skufer322.

Fixes koppor#339.

  • Includes quick refactorings to increase code quality.
  • Should also speedup our integrity check and reduce memory footprint (since the checkers are created per database check run - and not per entry check run)
  • Tests missing. Proposal: Leave tests for students 😇 ---> Add simple Unit tests #6207

grafik

  • Change in CHANGELOG.md described (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 21, 2020
@koppor koppor changed the title Check duplicate doi Check duplicate DOI Apr 22, 2020
@Override
public List<IntegrityMessage> check(BibEntry entry) {
if (errors == null) {
errors = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly initialize that HashMap in the field declaration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAybe you did not see that the later code uses the BibtexDatabase from the constructor. We decided to put the code into check - and not into the constructor - to have a lean constructor and not have high CPU usage when initializing the checker, but at the first time an entry is checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like premature optimization for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you caching the errors anyway? Also in case you are worrying about performance, why do you choose to implement a solution that has O(n^2) (with n = entries). Why not implement a O(n) solution (e.g. https://stackoverflow.com/a/31341963/873661) which you call in checkDatabase below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See in line 30 that the method is called once per entry. Is is by design of the Checker interface that there is method for checking each entry. There is no method for checking a complete database. There is also no interface for a complete database.

At the first call, the result map is filled. At all subsequent calls do not fill the map and reuse it.

We are O(n+m) where n is the size of the database and m is the number of entries with DOIs.

Copy link
Member Author

@koppor koppor Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the code more reable, I moved the call to the error check to the constructor. See fd7621d

BiMap<DOI, List<BibEntry>> duplicateMap = HashBiMap.create(bibEntries.size());
for (BibEntry bibEntry : bibEntries) {
bibEntry.getDOI().ifPresent(doi ->
duplicateMap.computeIfAbsent(doi, x -> new ArrayList<>()).add(bibEntry));
Copy link
Member Author

@koppor koppor Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doi, not entry. However, doi cannot be used as the variable name is already bound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koppor Please have a close look. The second argument of duplicateMap.computeIfAbsent refers to the value and this clearly is of type List. So then maybe name it entries

Copy link
Member Author

@koppor koppor Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Siedlerchr Please explain the JavaDoc to me. The Map is FROM doi TO list. https://www.geeksforgeeks.org/hashmap-computeifabsent-method-in-java-with-examples/ explains the computeIfAbsent.

To me, it reads, that the key is passed to Function<? super K, ? extends V> remappingFunction. The key is clearly (!) a DOI not a list. -- Am I wrong?

Copy link
Member Author

@koppor koppor Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the code more reable, I called the variable absentDoi. See 6f9b148

BiMap<DOI, List<BibEntry>> duplicateMap = HashBiMap.create(bibEntries.size());
for (BibEntry bibEntry : bibEntries) {
bibEntry.getDOI().ifPresent(doi ->
duplicateMap.computeIfAbsent(doi, x -> new ArrayList<>()).add(bibEntry));
Copy link
Member Author

@koppor koppor Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doi, not entry. However, doi cannot be used as the variable name is already bound.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems way more complicated than it need to be: https://stackoverflow.com/a/31341963/873661

}

public List<IntegrityMessage> checkDatabase() {
private void initCheckers(BibDatabaseContext bibDatabaseContext, BibtexKeyPatternPreferences bibtexKeyPatternPreferences, JournalAbbreviationRepository journalAbbreviationRepository) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't see any advantage of this refactoring. Just makes the code more complex and harder to maintain in my opinion.

If you really feel like the original code needs a refactoring, then you create a list of all checkers that need to be run a) always, b) bibtex c)biblatex (still I would create these lists only in the checkdatabase method). This would get ride of the repeated check methods below. But to be honest, I don't think this yields a more readable code either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize the checkers once per database check - not once per entry check. In case of a 20k database, approx 100k less java objects in memory per quality check.

We ensured that each checker is stateless - thus, it can be reused for acroess entries.

I agree with your proposal though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your proposal, the code is much more reable. 🎉 03dd2a0

@Override
public List<IntegrityMessage> check(BibEntry entry) {
if (errors == null) {
errors = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you caching the errors anyway? Also in case you are worrying about performance, why do you choose to implement a solution that has O(n^2) (with n = entries). Why not implement a O(n) solution (e.g. https://stackoverflow.com/a/31341963/873661) which you call in checkDatabase below?

@koppor koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Apr 22, 2020
@koppor
Copy link
Member Author

koppor commented Apr 22, 2020

That seems way more complicated than it need to be: stackoverflow.com/a/31341963/873661

I dont get in the stack overlfow thing, how they check

Stream((entry1, doi1), (entry2, doi1), (enty3, doi3))

Think, I have to implement a comparator comparing two bibentries by doi only. At the DOI comparison, I need to make a DOI object. I could cache that (which is unwanted) or I can reconstruct it every time (which is slower than oru approach above -- there, we construct the DOI object once per bib entry)

We need to construct the DOI object to be able to compare two DOIs. We decided that equals is a domain function belonging to the DOI object. (This respects the antipattern AnemicDomainModel https://martinfowler.com/bliki/AnemicDomainModel.html)

@koppor koppor marked this pull request as draft April 23, 2020 10:38
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactoring! Looks good to me, just a bit of nitpicking.

import org.jabref.model.entry.BibEntry;

@FunctionalInterface
public interface Checker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to EntryChecker?

return duplicateMap.inverse().keySet().stream()
.filter(list -> list.size() > 1)
.flatMap(list -> list.stream())
.map(item -> new IntegrityMessage(Localization.lang("Unique DOI used in multiple entries"), item, StandardField.DOI))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "The same DOI is used..." is better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "Same DOI..." will do.

}

public List<IntegrityMessage> checkDatabase() {
List<IntegrityMessage> executeAllCheckers() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name it simply check?

@koppor koppor self-assigned this Apr 26, 2020
# Conflicts:
#	src/main/java/org/jabref/logic/integrity/IntegrityCheck.java
#	src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java
@koppor koppor marked this pull request as ready for review May 15, 2020 09:29
Copy link
Member Author

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@koppor koppor merged commit 7bfbe44 into master May 15, 2020
@koppor koppor deleted the check-duplicate-doi branch May 15, 2020 09:37
Siedlerchr added a commit that referenced this pull request May 16, 2020
* upstream/master: (50 commits)
  Keep group pane size when resizing window (#6180) (#6423)
  Changelog: Fix missing citation for biblatex-mla
  Update AUTHORS
  Check duplicate DOI (#6333)
  Fix missing citation for biblatex-mla
  Change EasyBind dependency (#6480)
  Add testing of latest dev version as mandatory
  Squashed 'src/main/resources/csl-styles/' changes from 5dad23d..586e0b8
  Fix libre office connection and other progress dialogs (#6478)
  Fix clear year and month field when converting to biblatex (#6434)
  Add truncate as a BibTex key modifier (#6427)
  Add new authors (not all - they need more work)
  Remove empty line
  Add simple Unit Tests for #6207 (#6240)
  Enforce LeftCurly rule (#6452)
  Implement task progress indicator (and dialog) in the toolbar (#6443)
  Consider empty brackets
  Changelog update
  Added a test
  Fixed brackets in regular expressions
  ...
Siedlerchr added a commit that referenced this pull request Jul 1, 2020
# By dependabot-preview[bot] (18) and others
# Via GitHub (17) and others
* upstream/master: (77 commits)
  Reenable caching of gradle
  Refactor BibtexKeyPatternPreferences (#6489)
  Update CHANGELOG.md
  Add changelog entry and remove unnecessary code
  EasyBind revision part two
  Fix Drag and Drop on empty database
  Truncates DOIs and URLs in the column "Linked identifiers" in main table, if too long (#6498)
  Bump flexmark-ext-gfm-tasklist from 0.61.26 to 0.61.30
  Bump flexmark from 0.61.26 to 0.61.30
  Bump xmlunit-matchers from 2.6.4 to 2.7.0
  Bump java-string-similarity from 1.2.1 to 2.0.0
  Bump flexmark-ext-gfm-strikethrough from 0.61.26 to 0.61.30
  Bump xmlunit-core from 2.6.4 to 2.7.0
  Truncates the link and/or the link description in the column "linked files" in main table, if too long (#6179)
  Keep group pane size when resizing window (#6180) (#6423)
  Changelog: Fix missing citation for biblatex-mla
  Update AUTHORS
  Check duplicate DOI (#6333)
  Fix missing citation for biblatex-mla
  Change EasyBind dependency (#6480)
  ...

# Conflicts:
#	src/main/java/org/jabref/gui/actions/ActionHelper.java
#	src/main/java/org/jabref/gui/customentrytypes/CustomEntryTypeDialogViewModel.java
#	src/main/java/org/jabref/gui/customentrytypes/CustomizeEntryTypeDialogView.java
#	src/main/java/org/jabref/model/entry/field/FieldFactory.java
koppor pushed a commit that referenced this pull request Dec 15, 2022
f6c778e Update emerald-harvard.csl (#6335)
d6c6a16 Fix Brazilian quotes on  chicago-author-date.csl (#6317)
a1549b6 Update medizinische-hochschule-hannover.csl (#6330)
da88073 Update journal-of-the-american-college-of-cardiology.csl (#6334)
a520d8e Bump nokogiri from 1.13.9 to 1.13.10 (#6333)
ba54b44 Update royal-society-of-chemistry.csl (#6328)
1378ba7 LUSEM: Remove full stop (#6332)
9e3cf89 Create interpreting.csl (#6254)
bef74ed Create conservation-science-and-practice.csl (#6258)
9fb7eb7 Bug fixes triangle.csl (#6251)
e6112ba Update ucl-university-college-apa.csl (#6250)
6dcba3a Update engineering-technology-and-applied-science-research.csl (#6247)
00fe4a2 Create constructivist-foundations.csl (#6243)
03ad71b Create les-mondes-du-travail.csl (#6234)
a2bce86 Corrections based on author instructions (#6306)

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: f6c778e
koppor pushed a commit that referenced this pull request Dec 17, 2022
84dba23 Update international-union-of-crystallography.csl (#6279)
13dd9e8 Update zeitschrift-fur-deutsche-philologie.csl (#6340)
d95b652 Create cahiers-mondes-anciens.csl (#6203)
ded567c Create rassegna-degli-archivi-di-stato-bibliografia-generale.csl (#6275)
124777a Create scientific-online-letters-on-the-atmosphere.csl (#6261)
3c276e7 Create american-medical-association-no-url-alphabetical.csl (#6252)
595ad95 Bump mathieudutour/github-tag-action from 6.0 to 6.1 (#6287)
7008128 Create cambridge-a (#6336)
17e930c Update norsk-apa-manual-note.csl (#6338)
b360859 Update norsk-apa-manual.csl (#6337)
f6c778e Update emerald-harvard.csl (#6335)
d6c6a16 Fix Brazilian quotes on  chicago-author-date.csl (#6317)
a1549b6 Update medizinische-hochschule-hannover.csl (#6330)
da88073 Update journal-of-the-american-college-of-cardiology.csl (#6334)
a520d8e Bump nokogiri from 1.13.9 to 1.13.10 (#6333)
ba54b44 Update royal-society-of-chemistry.csl (#6328)
1378ba7 LUSEM: Remove full stop (#6332)
9e3cf89 Create interpreting.csl (#6254)
bef74ed Create conservation-science-and-practice.csl (#6258)
9fb7eb7 Bug fixes triangle.csl (#6251)
e6112ba Update ucl-university-college-apa.csl (#6250)
6dcba3a Update engineering-technology-and-applied-science-research.csl (#6247)
00fe4a2 Create constructivist-foundations.csl (#6243)
03ad71b Create les-mondes-du-travail.csl (#6234)
a2bce86 Corrections based on author instructions (#6306)

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: 84dba23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quality Check: A DOI should not appear at two entries
5 participants