Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge refactor to main #88

Merged
merged 140 commits into from
Feb 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
716523c
output callback can do multiple options
braceal Jan 27, 2023
70f4385
inference dataset for improved efficiency
braceal Jan 28, 2023
bda0435
remove inference logic from train script
braceal Jan 28, 2023
81216ee
typing
braceal Jan 28, 2023
2891e47
scalable inference output script
braceal Jan 28, 2023
7bb5b83
scalable inference output script
braceal Jan 28, 2023
a996b80
scalable inference output script
braceal Jan 28, 2023
f984039
scalable inference output script
braceal Jan 28, 2023
7bd9bdf
scalable inference output script
braceal Jan 28, 2023
b1d5736
scalable inference output script
braceal Jan 28, 2023
f789558
scalable inference output script
braceal Jan 28, 2023
8140c6d
scalable inference output script
braceal Jan 28, 2023
2a8aa2f
scalable inference output script
braceal Jan 28, 2023
38d7483
scalable inference output script
braceal Jan 28, 2023
4f63151
ragged
braceal Jan 30, 2023
20bfafa
Testing h5 distributed inference
KPHippe Jan 30, 2023
ce681a6
Testing h5 distributed inference
KPHippe Jan 30, 2023
05e44f1
Testing h5 distributed inference
KPHippe Jan 30, 2023
21efa90
Testing h5 distributed inference
KPHippe Jan 30, 2023
eb9f57a
Testing h5 distributed inference
KPHippe Jan 30, 2023
4931a17
Testing h5 distributed inference
KPHippe Jan 30, 2023
9060e62
Testing h5 distributed inference
KPHippe Jan 30, 2023
d28d006
Testing h5 distributed inference
KPHippe Jan 30, 2023
c05fe57
Testing h5 distributed inference
KPHippe Jan 31, 2023
591ee8d
Testing h5 distributed inference
KPHippe Jan 31, 2023
14b83b3
Testing h5 distributed inference
KPHippe Jan 31, 2023
8fd8e82
Testing h5 distributed inference
KPHippe Jan 31, 2023
9bd1723
Testing h5 distributed inference
KPHippe Jan 31, 2023
1d875a9
Testing h5 distributed inference
KPHippe Jan 31, 2023
ad9ba86
Testing h5 distributed inference
KPHippe Jan 31, 2023
0f3a84f
write indices to h5
braceal Jan 31, 2023
8284caa
layer bounds
braceal Jan 31, 2023
4d78b28
compression
braceal Jan 31, 2023
ccb9bd3
compression
braceal Jan 31, 2023
acbb436
bug fix. node local storgae
braceal Jan 31, 2023
306cc75
no compress
braceal Jan 31, 2023
e5b5236
dbg
braceal Jan 31, 2023
3f8e798
dbg
braceal Jan 31, 2023
1b9b5a0
dbg
braceal Jan 31, 2023
677bc00
dbg
braceal Jan 31, 2023
5ef7016
Removing unused code
KPHippe Jan 31, 2023
68f7790
Testing gather h5
KPHippe Jan 31, 2023
ce8e1a5
Testing gather h5
KPHippe Jan 31, 2023
f9a8393
Testing gather h5
KPHippe Jan 31, 2023
0efb987
Testing gather h5
KPHippe Jan 31, 2023
62910b1
Testing gather h5
KPHippe Jan 31, 2023
13189c5
Testing gather h5
KPHippe Jan 31, 2023
0254e1f
Testing gather h5
KPHippe Jan 31, 2023
6798377
Testing h5 distributed inference
KPHippe Jan 31, 2023
a008f88
Testing gather h5
KPHippe Jan 31, 2023
248f442
Setting seed to already existing embeddings
KPHippe Jan 31, 2023
4dd81f8
Testing gather h5
KPHippe Jan 31, 2023
5ba942d
Testing gather h5
KPHippe Feb 1, 2023
6beed1e
testing seq hashes
KPHippe Feb 1, 2023
df66282
testing seq hashes
KPHippe Feb 1, 2023
86389be
testing seq hashes
KPHippe Feb 1, 2023
462e20a
testing seq hashes
KPHippe Feb 1, 2023
2ff6e96
testing seq hashes, speedtest
KPHippe Feb 1, 2023
75f3672
testing seq hashes, speedtest
KPHippe Feb 1, 2023
cdeae91
testing seq hashes, speedtest
KPHippe Feb 1, 2023
95aa06c
testing logits, speedtest
KPHippe Feb 1, 2023
2d868a7
testing logits, speedtest
KPHippe Feb 1, 2023
6f77788
testing logits, speedtest
KPHippe Feb 1, 2023
be501f4
testing logits, speedtest
KPHippe Feb 1, 2023
6ce449a
testing logits, speedtest
KPHippe Feb 1, 2023
158d8d4
testing logits, speedtest
KPHippe Feb 1, 2023
3b82e4f
testing logits, speedtest
KPHippe Feb 1, 2023
8c3dd16
testing logits, speedtest
KPHippe Feb 1, 2023
8f0b338
testing logits, speedtest
KPHippe Feb 1, 2023
03f53bb
testing logits, speedtest
KPHippe Feb 1, 2023
c1209f1
testing logits, speedtest
KPHippe Feb 1, 2023
b4a4068
testing logits, speedtest
KPHippe Feb 1, 2023
ca42bcd
testing logits, speedtest
KPHippe Feb 1, 2023
cef05a4
testing logits, speedtest
KPHippe Feb 1, 2023
3198f0e
testing logits, speedtest
KPHippe Feb 1, 2023
6afb6c8
testing logits, speedtest
KPHippe Feb 1, 2023
4a9561c
testing logits, speedtest
KPHippe Feb 1, 2023
8d9ecb3
testing logits, speedtest
KPHippe Feb 1, 2023
fc66932
testing logits, speedtest
KPHippe Feb 2, 2023
68bed40
testing logits, speedtest
KPHippe Feb 2, 2023
263a1dc
testing logits, speedtest
KPHippe Feb 2, 2023
131d313
testing logits, speedtest
KPHippe Feb 2, 2023
d765c5e
counter speed test
KPHippe Feb 2, 2023
1eb2a2f
counter speed test
KPHippe Feb 2, 2023
1a159d4
counter speed test
KPHippe Feb 2, 2023
28c5aac
counter speed test
KPHippe Feb 2, 2023
8a03f73
testing layer bounds
KPHippe Feb 2, 2023
c9d3533
testing layer bounds
KPHippe Feb 2, 2023
b3bdd1e
testing layer bounds
KPHippe Feb 2, 2023
51ee154
testing layer bounds
KPHippe Feb 2, 2023
10ddda3
testing layer bounds
KPHippe Feb 2, 2023
596f9f1
testing layer bounds
KPHippe Feb 2, 2023
4db46f9
testing layer bounds
KPHippe Feb 2, 2023
6efa110
testing layer bounds
KPHippe Feb 2, 2023
ddd4b55
testing layer bounds
KPHippe Feb 2, 2023
08904d5
revising gather
KPHippe Feb 2, 2023
3560c0f
revising gather
KPHippe Feb 2, 2023
71c1852
revising gather
KPHippe Feb 2, 2023
4d0448e
need to squeeze embeddings
KPHippe Feb 2, 2023
5400398
adding hash directly to dataset
KPHippe Feb 2, 2023
a423038
adding hash directly to dataset
KPHippe Feb 2, 2023
0a0a315
adding hash directly to dataset
KPHippe Feb 2, 2023
909593e
adding hash directly to dataset
KPHippe Feb 2, 2023
9dce930
adding hash directly to dataset
KPHippe Feb 2, 2023
f4d19fc
raw seq hash
KPHippe Feb 2, 2023
54ca8c7
Silence read fasta
KPHippe Feb 2, 2023
d2e6c78
See if model can be warmed
KPHippe Feb 2, 2023
5fce549
See if model can be warmed
KPHippe Feb 2, 2023
4fca768
Testing bulk inference
KPHippe Feb 2, 2023
78de511
Testing bulk inference
KPHippe Feb 2, 2023
bb80912
Testing bulk inference
KPHippe Feb 2, 2023
6e86f00
Testing bulk inference
KPHippe Feb 2, 2023
855238d
Testing bulk inference
KPHippe Feb 2, 2023
8a9a8cd
Testing bulk inference
KPHippe Feb 2, 2023
ca8f9a3
Testing bulk inference
KPHippe Feb 2, 2023
55f5752
Testing bulk inference
KPHippe Feb 2, 2023
7e6a1c8
Testing bulk inference
KPHippe Feb 2, 2023
a46ee50
Need to see env vars
KPHippe Feb 3, 2023
46c686a
Need to see env vars
KPHippe Feb 3, 2023
30dca76
Need to see env vars
KPHippe Feb 3, 2023
857776b
Note that this script does not work as intended (at all)
KPHippe Feb 3, 2023
bfda2f4
Adding logic for gathering logits
KPHippe Feb 6, 2023
e4b79b5
Update logit glob pattern
KPHippe Feb 7, 2023
94711e9
Update logit outfile name
KPHippe Feb 7, 2023
95cb055
Exposing verbosity
KPHippe Feb 7, 2023
964d393
moving h5 to after predict start
KPHippe Feb 8, 2023
bc770c9
silence logging
KPHippe Feb 8, 2023
c90c482
silence logging
KPHippe Feb 8, 2023
43a3ae6
Testing failures
KPHippe Feb 10, 2023
5386d55
FIXING SEQ LEN
KPHippe Feb 10, 2023
d7852ac
Squeeze the life out of me
KPHippe Feb 10, 2023
8085ccc
Double checking sequence lengths
KPHippe Feb 10, 2023
89ecf4e
Verifying that the dataset (tokenizer) is and forever shall be correct
KPHippe Feb 10, 2023
f39aa6a
Refactor/cleanup for inference at scale
KPHippe Feb 13, 2023
14fc66b
Merge branch 'patch/output-callback' of github.com:ramanathanlab/gene…
KPHippe Feb 13, 2023
5019990
Remove bogus utils :)
KPHippe Feb 13, 2023
ef98ba2
Linting and documentation
KPHippe Feb 13, 2023
f6cfffb
Remove mean reduction references
KPHippe Feb 13, 2023
dbcbac1
Merge pull request #85 from ramanathanlab/patch/output-callback
KPHippe Feb 13, 2023
670d53b
Increase version corresponding to recent refactor
KPHippe Feb 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion genslm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.0.2a1"
__version__ = "0.0.3a1"

# Public imports
from genslm.dataset import SequenceDataset # noqa
Expand Down
40 changes: 0 additions & 40 deletions genslm/cmdline/gather_embeddings.py

This file was deleted.

124 changes: 124 additions & 0 deletions genslm/cmdline/gather_inference_h5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
"""
Gathers embeddings written by `run_inference.py`. Gathers
rank files into single h5py file with ExternalLinks to
the original files. This is necesary for matching new H5 files to original
fasta files, but makes the dataset brittle to being transferred to new locations. But if
we try and copy dataset to new file it becomes very very slow.

Current implementation coupled to the output format of `run_inference.py`.
"""
import re
from argparse import ArgumentParser
from pathlib import Path
from typing import Optional

import h5py


def gather_logits(
input_dir: Path,
output_path: Optional[Path] = None,
glob_pattern: str = "logits-*.h5",
verbose: bool = False,
):

if output_path is None:
output_path = input_dir / "logits_gathered.h5"

input_files = list(input_dir.glob(glob_pattern))
# Glob embedding and index files written by each rank
with h5py.File(output_path, "w") as output_file:
output_file.create_group("logits")
output_file.create_group("na-hashes")
for i, h5_file in enumerate(input_files):
if verbose:
print("Loading", h5_file)
with h5py.File(h5_file, "r") as input_file:
resolved_path = h5_file.resolve()

for seq_fasta_index in input_file["logits"].keys():
output_file["logits"][str(seq_fasta_index)] = h5py.ExternalLink(
str(resolved_path), f"logits/{seq_fasta_index}"
)

hashes = input_file["na-hashes"]
indices = input_file["fasta-indices"]
for fasta_idx, na_hash in zip(indices, hashes):
output_file["na-hashes"].create_dataset(
f"{fasta_idx}", data=na_hash
)
if verbose:
print("Wrote gathered output to", output_path, "\n")


def gather_embeddings(
input_dir: Path,
output_path: Optional[Path] = None,
glob_pattern: Optional[str] = None,
verbose: bool = False,
) -> None:
"""Gather embeddings produced via DDP into a single h5 file."""

if glob_pattern is None:
glob_pattern = "*.h5"

if output_path is None:
output_path = input_dir / "embeddings_gathered.h5"

input_files = list(input_dir.glob(glob_pattern))
# Glob embedding and index files written by each rank
with h5py.File(output_path, "w") as output_file:
output_file.create_group("embeddings")
output_file.create_group("na-hashes")
for i, h5_file in enumerate(input_files):
if verbose:
print("Loading", h5_file)
with h5py.File(h5_file, "r") as input_file:
resolved_path = h5_file.resolve()

for seq_fasta_index in input_file["embeddings"].keys():
output_file["embeddings"][seq_fasta_index] = h5py.ExternalLink(
str(resolved_path), f"embeddings/{seq_fasta_index}"
)

hashes = input_file["na-hashes"]
indices = input_file["fasta-indices"]
for fasta_idx, na_hash in zip(indices, hashes):
output_file["na-hashes"].create_dataset(
f"{fasta_idx}", data=na_hash
)
if verbose:
print("Wrote gathered output to", output_path, "\n")


if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("-i", "--input_dir", type=Path, required=True)
parser.add_argument("-o", "--output_path", type=Path, required=True)
parser.add_argument(
"-g", "--embeddings_glob_pattern", type=str, default="embeddings-*.h5"
)
parser.add_argument("-l", "--logits_glob_pattern", type=str, default="logits-*.h5")
parser.add_argument("--embeddings", action="store_true", help="Gather embeddings")
parser.add_argument("--logits", action="store_true", help="Gather logits.")
parser.add_argument("-v", "--verbose", action="store_true", help="Verbose output.")
args = parser.parse_args()

if args.embeddings:
files = list(args.input_dir.glob(args.embeddings_glob_pattern))
layers = set()
layer_pattern = re.compile(r"layer-(\d+)")
for file in files:
if "layer" in file.name:
layer = layer_pattern.search(file.name).group(1)
layers.add(layer)

for layer in layers:
glob_pattern = f"*layer-{layer}*.h5"
out_path = args.output_path / f"embeddings-gathered-layer-{layer}.h5"

gather_embeddings(args.input_dir, out_path, glob_pattern, args.verbose)

if args.logits:
out_path = args.output_path / "logits-gathered.h5"
gather_logits(args.input_dir, out_path, args.logits_glob_pattern, args.verbose)
183 changes: 0 additions & 183 deletions genslm/cmdline/inference_outputs.py

This file was deleted.

Loading