Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smiles importer test for improvements #31

Merged
merged 22 commits into from
Mar 1, 2024

Conversation

JonasSchaub
Copy link
Collaborator

@SamuelBehr what was the purpose of this branch / these changes? Are these changes worth keeping/merging? Do you still need to do sth here?

@FelixBaensch
Copy link
Owner

Please relax the restriction, why should a file be limited to 2 or 3 entries per line.
And check more (maybe 10) lines to identify the delimiter.

@FelixBaensch FelixBaensch requested review from SamuelBehr and removed request for SamuelBehr February 20, 2024 15:43
# Conflicts:
#	src/main/java/de/unijena/cheminf/mortar/model/io/Importer.java
#	src/test/java/de/unijena/cheminf/mortar/model/io/ImporterTest.java
@JonasSchaub
Copy link
Collaborator Author

@FelixBaensch

Please relax the restriction, why should a file be limited to 2 or 3 entries per line.

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

And check more (maybe 10) lines to identify the delimiter.

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

@SamuelBehr
Copy link
Collaborator

Raising the number of lines to be checked when identifying the delimiter would also statistically increase the risk of falsely identifying a file as SMILES file or of identifying the false delimiter. It would also increase the time consumption the delimiter identification could take in a worst case scenario. So I would not raise this number to infinity ...

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

@FelixBaensch
Copy link
Owner

FelixBaensch commented Feb 22, 2024

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

Sounds good

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

I am not talking about a header, but about a large file with many SMILES and an undefined number of lines that do not contain any parsable SMILES. However, the valid SMILES should still be imported. I think testing only the first 3 of 500 lines, i.e. not even 1%, is too small. What is the problem with testing the first 10? In terms of time, this should have no influence.

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

As mentioned above I don't see any problems with non-parsable SMILES.

…or SMILES file separators; increased maximum nr of lines of SMILES files to check to 10; reworking of SMILES file import code, WIP; identified bug at parsing molecule from String containing SMILES, tests are failing;
… first white space character, even if the input string goes on after that, this creates the problems;
tmpMolSet = DynamicSMILESFileReader.readFile(Paths.get(tmpURL.toURI()).toFile(), tmpFormat);
Assertions.assertEquals(50, tmpMolSet.getAtomContainerCount());
Assertions.assertEquals("CNP0000001", tmpMolSet.getAtomContainer(0).getProperty(Importer.MOLECULE_NAME_PROPERTY_KEY));
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to split this test (above) into several smaller tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7145dec

String tmpSmilesFileFirstLine = tmpSmilesFileCurrentLine;
findSeparatorLoop:
while (!Thread.currentThread().isInterrupted() && tmpCurrentLineInFileCounter < DynamicSMILESFileReader.MAXIMUM_LINE_NUMBER_TO_CHECK_IN_SMILES_FILES) {
tmpSmilesFileCurrentLine = tmpSmilesFileBufferedReader.readLine();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the first line skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it might be a headline not containing a SMILES code. But we can actually try to parse it anyway. Fixed in cd86c88

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

* are tested first.
*/
public static final String[] POSSIBLE_SMILES_FILE_SEPARATORS = {"\n", ",", ";", " ", "\t"};
//
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline as a separator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround for files that have only the SMILES column, hence no separator. Yes, it was dirty, so fixed in cd86c88

private static final Logger LOGGER = Logger.getLogger(DynamicSMILESFileReader.class.getName());
//
private DynamicSMILESFileReader() {

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cd86c88, constructor actually does sth now.

@JonasSchaub
Copy link
Collaborator Author

Still to do:

  • look at / fix SonarCloud issues
  • think about how to report faulty molecules that could not be parsed from the input file.

@JonasSchaub JonasSchaub added bug Something isn't working enhancement New feature or request labels Feb 28, 2024
@JonasSchaub
Copy link
Collaborator Author

@FelixBaensch and @SamuelBehr , if you have time and motivation, you are welcome to re-review these changes. Otherwise, please simply approve, thank you.

Copy link

sonarcloud bot commented Feb 29, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
16 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@JonasSchaub JonasSchaub merged commit d2f3481 into production Mar 1, 2024
2 checks passed
@JonasSchaub JonasSchaub deleted the SMILES-Importer_TestForImprovements branch March 1, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants