Update score log structure #49

nebfield · 2024-09-13T16:30:57Z

What's changed

Added pydantic models for:

Custom scoring file headers
PGS Catalog scoring file headers (inherited from the base header model, could be reused for validation 👀 )
A score log (before we chucked a mix of useful data into a JSON file)

Integrated these models with the ScoringFile object and combine CLI

There is a breaking change in log structure. Changes include:

New compatible_effect_type field (bool)
The dynamic accession field ( {'PGSXXXX': { nested_dict }) has been made static because of pydantic limitations (replaced with "header" field)

Why changed

#46 (comment)

Example log file

PGS000001 is normalised OK and will be in the output, PGS004255 is ignored but warned about.

[
  {
    "header": {
      "pgs_id": "PGS000001",
      "pgs_name": "PRS77_BC",
      "trait_reported": "Breast cancer",
      "genome_build": "NR",
      "format_version": "2.0",
      "trait_mapped": [
        "breast carcinoma"
      ],
      "trait_efo": [
        "EFO_0000305"
      ],
      "variants_number": 77,
      "weight_type": null,
      "pgp_id": "PGP000001",
      "citation": "Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036",
      "HmPOS_build": "GRCh38",
      "HmPOS_date": "2022-07-29",
      "HmPOS_match_pos": "{\"True\": null, \"False\": null}",
      "HmPOS_match_chr": "{\"True\": null, \"False\": null}",
      "license": "PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details."
    },
    "compatible_effect_type": true,
    "pgs_id": "PGS000001",
    "is_harmonised": true,
    "sources": [
      "ENSEMBL"
    ]
  },
  {
    "header": {
      "pgs_id": "PGS004255",
      "pgs_name": "GenoBoost_rheumatoid_arthritis_0",
      "trait_reported": "Rheumatoid arthritis",
      "genome_build": "GRCh37",
      "format_version": "2.0",
      "trait_mapped": [
        "rheumatoid arthritis"
      ],
      "trait_efo": [
        "EFO_0000685"
      ],
      "variants_number": 30,
      "weight_type": "beta",
      "pgp_id": "PGP000546",
      "citation": "Ohta R et al. Nat Commun (2024). doi:10.1038/s41467-024-48654-x",
      "HmPOS_build": "GRCh38",
      "HmPOS_date": "2024-06-11",
      "HmPOS_match_pos": "{\"True\": null, \"False\": null}",
      "HmPOS_match_chr": "{\"True\": null, \"False\": null}",
      "license": "PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details."
    },
    "compatible_effect_type": false,
    "pgs_id": "PGS004255",
    "is_harmonised": true,
    "sources": null
  }
]

codecov · 2024-09-13T16:31:17Z

Codecov Report

Attention: Patch coverage is 95.85253% with 9 lines in your changes missing coverage. Please review.

Project coverage is 89.51%. Comparing base (aee104f) to head (435e692).

Files with missing lines	Patch %	Lines
pgscatalog.core/src/pgscatalog/core/lib/models.py	94.88%	9 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##           fix-nonadditive      #49      +/-   ##
===================================================
+ Coverage            89.18%   89.51%   +0.33%     
===================================================
  Files                   40       39       -1     
  Lines                 2320     2413      +93     
===================================================
+ Hits                  2069     2160      +91     
- Misses                 251      253       +2

Flag	Coverage Δ
pgscatalog.core	`90.68% <95.85%> (+0.52%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nebfield · 2024-09-13T16:32:06Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

-        "pgs_id",
-        "pgp_id",
-        "pgs_name",
-        "genome_build",
-        "variants_number",
-        "trait_reported",
-        "trait_efo",
-        "trait_mapped",
-        "weight_type",
-        "citation",
-        "HmPOS_build",
-        "HmPOS_date",
-        "format_version",
-        "license",
-    )
-
-    _default_license_text = (
-        "PGS obtained from the Catalog should be cited appropriately, and "
-        "used in accordance with any licensing restrictions set by the authors. See "
-        "EBI "
-        "Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional "
-        "details."
-    )
-
-    def __init__(
-        self,
-        *,
-        pgs_name,
-        genome_build,
-        pgs_id=None,
-        pgp_id=None,
-        variants_number=None,
-        trait_reported=None,
-        trait_efo=None,
-        trait_mapped=None,
-        weight_type=None,
-        citation=None,
-        HmPOS_build=None,
-        HmPOS_date=None,
-        format_version=None,
-        license=None,
-    ):
-        """kwargs are forced because this is a complicated init and from_path() is
-        almost always the correct thing to do.
-
-        Mandatory/optional fields in a header are less clear than columns. The
-        Catalog provides this information automatically but scoring files from other
-        places might not.
-
-        We don't want to annoy people and force them to reformat their custom scoring
-        files, but we do require some minimum information for the calculator,
-        so pgs_name and genome_build are mandatory.
-        """
-        self.pgs_name = pgs_name
-        self.genome_build = GenomeBuild.from_string(genome_build)
-
-        if self.genome_build is None:
-            # try overwriting with harmonised data
-            self.genome_build = GenomeBuild.from_string(HmPOS_build)
-
-        if self.pgs_name is None:
-            raise ValueError("pgs_name cannot be None")
-
-        # the rest of these fields are optional
-        self.pgs_id = pgs_id
-        self.pgp_id = pgp_id
-
-        try:
-            self.variants_number = int(variants_number)
-        except TypeError:
-            self.variants_number = None
-
-        self.trait_reported = trait_reported
-        self.trait_efo = trait_efo
-        self.trait_mapped = trait_mapped
-        self.weight_type = weight_type
-        self.citation = citation
-        self.HmPOS_build = GenomeBuild.from_string(HmPOS_build)
-        self.HmPOS_date = HmPOS_date
-        self.format_version = format_version
-        self.license = license
-
-        if self.license is None:
-            self.license = self._default_license_text
-
-    def __repr__(self):
-        values = {x: getattr(self, x) for x in self.fields}
-        value_strings = ", ".join([f"{key}='{value}'" for key, value in values.items()])
-        return f"{type(self).__name__}({value_strings})"
-
-    @classmethod
-    def from_path(cls, path):
-        raw_header: dict = read_header(path)
-
-        if len(raw_header) == 0:
-            raise ValueError(f"No header detected in scoring file {path=}")
-
-        header_dict = {k: raw_header.get(k) for k in cls.fields}
-
-        return cls(**header_dict)


deleted because migrated to a pydantic model

nebfield · 2024-09-13T16:33:42Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

+    @property
+    def pgs_id(self):
+        return self._pgs_id

-        if (build := self._header.HmPOS_build) is not None:
-            self.genome_build = build
-            self.harmonised = True
+    @property
+    def is_harmonised(self):
+        return self._header.is_harmonised
+
+    @property
+    def genome_build(self):
+        if self.is_harmonised:
+            return self._header.HmPOS_build
        else:
-            self.genome_build = self._header.genome_build
-            self.harmonised = False
+            return self._header.genome_build
+
+    @property
+    def header(self):
+        return self._header


I added these properties because they're a really helpful shortcut in the CLI

nebfield · 2024-09-13T16:34:37Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

-    def get_log(self, drop_missing=False, variant_log=None):
-        """Create a JSON log from a ScoringFile's header and variant rows."""
-
-        logger.debug(f"Creating JSON log for {self!r}")
-
-        log = {}
-
-        for attr in self._header.fields:
-            if (extracted_attr := getattr(self._header, attr, None)) is not None:
-                log[attr] = str(extracted_attr)
-            else:
-                log[attr] = None
-
-        if log["variants_number"] is None:
-            # custom scoring files might not have this information
-            log["variants_number"] = variant_log["n_variants"]
-
-        if (
-            variant_log is not None
-            and int(log["variants_number"]) != variant_log["n_variants"]
-            and not drop_missing
-        ):
-            logger.warning(
-                f"Mismatch between header ({log['variants_number']}) and output row count ({variant_log['n_variants']}) for {self.pgs_id}"
-            )
-            logger.warning(
-                "This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)"
-            )
-
-        # multiple terms may be separated with a pipe
-        if log["trait_mapped"]:
-            log["trait_mapped"] = log["trait_mapped"].split("|")
-
-        if log["trait_efo"]:
-            log["trait_efo"] = log["trait_efo"].split("|")
-
-        log["columns"] = self._fields
-        log["use_harmonised"] = self.harmonised
-
-        if variant_log is not None:
-            log["sources"] = [k for k, v in variant_log.items() if k != "n_variants"]
-
-        return {self.pgs_id: log}
-


Replaced by the pydantic models + model_dump()

nebfield · 2024-09-13T16:35:31Z

pgscatalog.core/src/pgscatalog/core/lib/models.py

+class ScoreVariant(CatalogScoreVariant):
+    """This model includes attributes useful for processing and normalising variants
+
+    >>> variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "row_nr": 0, "accession": "test"})
+    >>> variant  # doctest: +ELLIPSIS
+    ScoreVariant(rsID=None, chr_name='1', chr_position=1, effect_allele=Allele(allele='A', ...
+    >>> variant.is_complex
+    False
+    >>> variant.is_non_additive
+    False
+    >>> variant.is_harmonised
+    False
+    >>> variant.effect_type
+    EffectType.ADDITIVE
+
+    >>> variant_missing_positions = ScoreVariant(**{"rsID": None, "chr_name": None, "chr_position": None, "effect_allele": "A", "effect_weight": 0.5,  "row_nr": 0, "accession": "test"}) # doctest: +ELLIPSIS
+    Traceback (most recent call last):
+    ...
+    TypeError: Bad position: self.rsID=None, self.chr_name=None, self.chr_position=None
+
+    >>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
+    >>> harmonised_variant.is_harmonised
+    True
+
+    >>> variant_badly_harmonised = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": None, "hm_pos": None, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
+    Traceback (most recent call last):
+    ...
+    TypeError: Missing harmonised column data: hm_chr
+
+    >>> variant_nonadditive = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "dosage_0_weight": 0, "dosage_1_weight": 1,  "row_nr": 0, "accession": "test"})
+    >>> variant_nonadditive.is_non_additive
+    True
+    >>> variant_nonadditive.is_complex
+    False
+    >>> variant_nonadditive.effect_type
+    EffectType.NONADDITIVE
+
+    >>> variant_complex = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "is_haplotype": True,  "row_nr": 0, "accession": "test"})
+    >>> variant_complex.is_complex
+    True
+    """
+
+    model_config = ConfigDict(use_enum_values=True)
+
+    row_nr: int = Field(
+        title="Row number",
+        description="Row number of variant in scoring file (first variant = 0)",
+    )
+    accession: str = Field(title="Accession", description="Accession of score variant")
+    is_duplicated: Optional[bool] = Field(
+        default=False,
+        title="Duplicated variant",
+        description="In a list of variants with the same accession, is ID duplicated?",
+    )
+
+    # column names for output are used by __iter__ and when writing out
+    output_fields: ClassVar[tuple[str]] = (
+        "chr_name",
+        "chr_position",
+        "effect_allele",
+        "other_allele",
+        "effect_weight",
+        "effect_type",
+        "is_duplicated",
+        "accession",
+        "row_nr",
+    )
+
+    def __iter__(self):
+        for attr in self.output_fields:
+            yield getattr(self, attr)


thought it would be best to keep all pydantic models in this module 🤔

nebfield added 4 commits September 12, 2024 12:01

include example of validating a scoring file header

a08e196

update header models

52f8a2e

integrate ScoreLog and ScoreLogs

8507b19

fix doctest imports

84a68c4

nebfield commented Sep 13, 2024

View reviewed changes

add log warning re: missing variants

eb8fbaa

nebfield marked this pull request as ready for review September 16, 2024 08:33

nebfield added 2 commits September 16, 2024 11:00

test log output with incompatible effect type

aa095c8

make ScoreLogs a RootModel to act as a json list

435e692

nebfield requested a review from fyvon September 16, 2024 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update score log structure #49

Update score log structure #49

nebfield commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

nebfield Sep 13, 2024

nebfield Sep 13, 2024

nebfield Sep 13, 2024

nebfield Sep 13, 2024

Update score log structure #49

Are you sure you want to change the base?

Update score log structure #49

Conversation

nebfield commented Sep 13, 2024 • edited Loading

What's changed

Why changed

Example log file

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading