-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smiles importer test for improvements #31
Conversation
…xpected results, but further testing should be done
Please relax the restriction, why should a file be limited to 2 or 3 entries per line. |
# Conflicts: # src/main/java/de/unijena/cheminf/mortar/model/io/Importer.java # src/test/java/de/unijena/cheminf/mortar/model/io/ImporterTest.java
The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?
Do you think a SMILES file can have such an extensive header? Or am I missing the point? |
Raising the number of lines to be checked when identifying the delimiter would also statistically increase the risk of falsely identifying a file as SMILES file or of identifying the false delimiter. It would also increase the time consumption the delimiter identification could take in a worst case scenario. So I would not raise this number to infinity ... I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account. |
Sounds good
I am not talking about a header, but about a large file with many SMILES and an undefined number of lines that do not contain any parsable SMILES. However, the valid SMILES should still be imported. I think testing only the first 3 of 500 lines, i.e. not even 1%, is too small. What is the problem with testing the first 10? In terms of time, this should have no influence.
As mentioned above I don't see any problems with non-parsable SMILES. |
…or SMILES file separators; increased maximum nr of lines of SMILES files to check to 10; reworking of SMILES file import code, WIP; identified bug at parsing molecule from String containing SMILES, tests are failing;
… first white space character, even if the input string goes on after that, this creates the problems;
…used for Chemaxon cxSMILES;
tmpMolSet = DynamicSMILESFileReader.readFile(Paths.get(tmpURL.toURI()).toFile(), tmpFormat); | ||
Assertions.assertEquals(50, tmpMolSet.getAtomContainerCount()); | ||
Assertions.assertEquals("CNP0000001", tmpMolSet.getAtomContainer(0).getProperty(Importer.MOLECULE_NAME_PROPERTY_KEY)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to split this test (above) into several smaller tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 7145dec
String tmpSmilesFileFirstLine = tmpSmilesFileCurrentLine; | ||
findSeparatorLoop: | ||
while (!Thread.currentThread().isInterrupted() && tmpCurrentLineInFileCounter < DynamicSMILESFileReader.MAXIMUM_LINE_NUMBER_TO_CHECK_IN_SMILES_FILES) { | ||
tmpSmilesFileCurrentLine = tmpSmilesFileBufferedReader.readLine(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the first line skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it might be a headline not containing a SMILES code. But we can actually try to parse it anyway. Fixed in cd86c88
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
* are tested first. | ||
*/ | ||
public static final String[] POSSIBLE_SMILES_FILE_SEPARATORS = {"\n", ",", ";", " ", "\t"}; | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline as a separator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a workaround for files that have only the SMILES column, hence no separator. Yes, it was dirty, so fixed in cd86c88
private static final Logger LOGGER = Logger.getLogger(DynamicSMILESFileReader.class.getName()); | ||
// | ||
private DynamicSMILESFileReader() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in cd86c88, constructor actually does sth now.
… the separators removed, clean-up, and documentation;
Still to do:
|
…word can be interpreted as an empty SMILES code followed by the name of the structure
…th the other import methods, primarily SDF import;
@FelixBaensch and @SamuelBehr , if you have time and motivation, you are welcome to re-review these changes. Otherwise, please simply approve, thank you. |
Quality Gate passedIssues Measures |
@SamuelBehr what was the purpose of this branch / these changes? Are these changes worth keeping/merging? Do you still need to do sth here?