Smiles importer test for improvements #31

JonasSchaub · 2024-01-25T12:26:31Z

@SamuelBehr what was the purpose of this branch / these changes? Are these changes worth keeping/merging? Do you still need to do sth here?

…xpected results, but further testing should be done

…urces

FelixBaensch · 2024-02-20T15:43:32Z

Please relax the restriction, why should a file be limited to 2 or 3 entries per line.
And check more (maybe 10) lines to identify the delimiter.

# Conflicts: # src/main/java/de/unijena/cheminf/mortar/model/io/Importer.java # src/test/java/de/unijena/cheminf/mortar/model/io/ImporterTest.java

JonasSchaub · 2024-02-22T12:55:06Z

@FelixBaensch

Please relax the restriction, why should a file be limited to 2 or 3 entries per line.

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

And check more (maybe 10) lines to identify the delimiter.

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

SamuelBehr · 2024-02-22T13:49:45Z

Raising the number of lines to be checked when identifying the delimiter would also statistically increase the risk of falsely identifying a file as SMILES file or of identifying the false delimiter. It would also increase the time consumption the delimiter identification could take in a worst case scenario. So I would not raise this number to infinity ...

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

FelixBaensch · 2024-02-22T14:22:44Z

The file can have as many columns as it wants, but only the first 2 (or 3) are checked for valid SMILES strings. The remaining columns are ignored. What do you think about this?

Sounds good

Do you think a SMILES file can have such an extensive header? Or am I missing the point?

I am not talking about a header, but about a large file with many SMILES and an undefined number of lines that do not contain any parsable SMILES. However, the valid SMILES should still be imported. I think testing only the first 3 of 500 lines, i.e. not even 1%, is too small. What is the problem with testing the first 10? In terms of time, this should have no influence.

I chose the number of 3 lines to cover the case of potentially having a headline followed up by a blank line. Increasing the number further to a count of maybe 5 would also take the risk of having lines with unparseable SMILES codes into account.

As mentioned above I don't see any problems with non-parsable SMILES.

…or SMILES file separators; increased maximum nr of lines of SMILES files to check to 10; reworking of SMILES file import code, WIP; identified bug at parsing molecule from String containing SMILES, tests are failing;

… first white space character, even if the input string goes on after that, this creates the problems;

…strings;

…used for Chemaxon cxSMILES;

… WIP;

FelixBaensch · 2024-02-27T08:00:10Z

src/test/java/de/unijena/cheminf/mortar/model/io/DynamicSMILESFileReaderTest.java

+        tmpMolSet = DynamicSMILESFileReader.readFile(Paths.get(tmpURL.toURI()).toFile(), tmpFormat);
+        Assertions.assertEquals(50, tmpMolSet.getAtomContainerCount());
+        Assertions.assertEquals("CNP0000001", tmpMolSet.getAtomContainer(0).getProperty(Importer.MOLECULE_NAME_PROPERTY_KEY));
+    }


I think it would be better to split this test (above) into several smaller tests

Fixed in 7145dec

FelixBaensch · 2024-02-27T08:10:12Z

src/main/java/de/unijena/cheminf/mortar/model/io/DynamicSMILESFileReader.java

+            String tmpSmilesFileFirstLine = tmpSmilesFileCurrentLine;
+            findSeparatorLoop:
+            while (!Thread.currentThread().isInterrupted() && tmpCurrentLineInFileCounter < DynamicSMILESFileReader.MAXIMUM_LINE_NUMBER_TO_CHECK_IN_SMILES_FILES) {
+                tmpSmilesFileCurrentLine = tmpSmilesFileBufferedReader.readLine();


Why is the first line skipped?

Because it might be a headline not containing a SMILES code. But we can actually try to parse it anyway. Fixed in cd86c88

FelixBaensch · 2024-02-27T08:11:04Z

src/main/java/de/unijena/cheminf/mortar/model/io/DynamicSMILESFileReader.java

+     * are tested first.
+     */
+    public static final String[] POSSIBLE_SMILES_FILE_SEPARATORS = {"\n", ",", ";", " ", "\t"};
+    //


Newline as a separator?

This was a workaround for files that have only the SMILES column, hence no separator. Yes, it was dirty, so fixed in cd86c88

FelixBaensch · 2024-02-27T08:11:28Z

src/main/java/de/unijena/cheminf/mortar/model/io/DynamicSMILESFileReader.java

+    private static final Logger LOGGER = Logger.getLogger(DynamicSMILESFileReader.class.getName());
+    //
+    private DynamicSMILESFileReader() {
+


Fixed in cd86c88, constructor actually does sth now.

… the separators removed, clean-up, and documentation;

JonasSchaub · 2024-02-27T16:21:50Z

Still to do:

look at / fix SonarCloud issues
think about how to report faulty molecules that could not be parsed from the input file.

…word can be interpreted as an empty SMILES code followed by the name of the structure

…th the other import methods, primarily SDF import;

JonasSchaub · 2024-02-28T14:44:36Z

@FelixBaensch and @SamuelBehr , if you have time and motivation, you are welcome to re-review these changes. Otherwise, please simply approve, thank you.

sonarcloud · 2024-02-29T13:00:05Z

Quality Gate passed

Issues
0 New issues
16 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

SamuelBehr added 2 commits January 5, 2023 12:47

Made files with three columns readable; test method still gives the e…

949a302

…xpected results, but further testing should be done

Updated the test method and added an additional test file to the reso…

a253eff

…urces

JonasSchaub assigned JonasSchaub and SamuelBehr Jan 25, 2024

JonasSchaub mentioned this pull request Feb 20, 2024

Allow more elements per line in SMILES file #40

Closed

JonasSchaub removed their assignment Feb 20, 2024

FelixBaensch requested review from SamuelBehr and removed request for SamuelBehr February 20, 2024 15:43

JonasSchaub added 2 commits February 22, 2024 11:15

Merge branch 'production' into SMILES-Importer_TestForImprovements

ee994ea

# Conflicts: # src/main/java/de/unijena/cheminf/mortar/model/io/Importer.java # src/test/java/de/unijena/cheminf/mortar/model/io/ImporterTest.java

Fix spotless complaint

e9d018d

JonasSchaub added 9 commits February 22, 2024 17:04

Getting closer to the problem, the SmilesParser parses only until the…

96d804a

… first white space character, even if the input string goes on after that, this creates the problems;

WIP added check for only valid SMILES characters in potential SMILES …

5f30157

…strings;

Added + and @ as possible SMILES characters;

c7950fc

Removed double % (thank you, SonarCloud);

818105d

Removed vertical bar as possible SMILES file separator because it is …

5ddb218

…used for Chemaxon cxSMILES;

Moved SMILES file parsing to separate class, so far mostly unchanged,…

a33557e

… WIP;

WIP but we are getting somewhere, tests are passing again;

6c822b4

WIP started clean-up;

2b45f01

FelixBaensch reviewed Feb 27, 2024

View reviewed changes

JonasSchaub added 2 commits February 27, 2024 13:09

Separated tests into multiple methods;

7145dec

First line is not skipped anymore, workaround for one-column-files in…

cd86c88

… the separators removed, clean-up, and documentation;

JonasSchaub added 2 commits February 28, 2024 13:10

Added more trim statements for input file strings because whitespace-…

44089b0

…word can be interpreted as an empty SMILES code followed by the name of the structure

Added .csv to the list of possible input file types;

9353334

JonasSchaub added 3 commits February 28, 2024 14:19

Extended doc;

35c0294

Fixed some SonarCloud complaints;

20276da

Fixed counters for skipped lines and aligned the logging behaviour wi…

778f58a

…th the other import methods, primarily SDF import;

JonasSchaub assigned JonasSchaub and unassigned SamuelBehr Feb 28, 2024

JonasSchaub requested review from FelixBaensch and SamuelBehr February 28, 2024 14:42

JonasSchaub added bug Something isn't working enhancement New feature or request labels Feb 28, 2024

JonasSchaub added 2 commits February 28, 2024 16:24

Fixed a bug and added the file that helped me detect it as a test;

bb1be22

Small final adjustments and improvements; ready for review;

da86cbd

FelixBaensch approved these changes Mar 1, 2024

View reviewed changes

JonasSchaub merged commit d2f3481 into production Mar 1, 2024
2 checks passed

JonasSchaub deleted the SMILES-Importer_TestForImprovements branch March 1, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smiles importer test for improvements #31

Smiles importer test for improvements #31

JonasSchaub commented Jan 25, 2024

FelixBaensch commented Feb 20, 2024

JonasSchaub commented Feb 22, 2024

SamuelBehr commented Feb 22, 2024

FelixBaensch commented Feb 22, 2024 •

edited

Loading

FelixBaensch Feb 27, 2024

JonasSchaub Feb 27, 2024

FelixBaensch Feb 27, 2024

JonasSchaub Feb 27, 2024

FelixBaensch Mar 1, 2024

FelixBaensch Feb 27, 2024

JonasSchaub Feb 27, 2024

FelixBaensch Feb 27, 2024

JonasSchaub Feb 27, 2024

JonasSchaub commented Feb 27, 2024

JonasSchaub commented Feb 28, 2024

sonarcloud bot commented Feb 29, 2024 •

edited

Loading

Smiles importer test for improvements #31

Smiles importer test for improvements #31

Conversation

JonasSchaub commented Jan 25, 2024

FelixBaensch commented Feb 20, 2024

JonasSchaub commented Feb 22, 2024

SamuelBehr commented Feb 22, 2024

FelixBaensch commented Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonasSchaub commented Feb 27, 2024

JonasSchaub commented Feb 28, 2024

sonarcloud bot commented Feb 29, 2024 • edited Loading

Quality Gate passed

FelixBaensch commented Feb 22, 2024 •

edited

Loading

sonarcloud bot commented Feb 29, 2024 •

edited

Loading