-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Extraction From Dolomite Engine #1
Conversation
@fabianlim I would also suggest dropping |
@aldopareja this is more or less ok, but missing the notices. what do we want to put in the header of every file? like this?
|
587d240
to
3636587
Compare
We should also add the same publishing CI that we use elsewhere in instructlab so that it's easy to get stuff published. |
This one instructlab/training#31 |
Sorry not that one, this one: instructlab/training#42 |
Nvm, I created a PR for publishing here, ignore the above comments: #2 |
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
wow!!! 4k lines of code already :) |
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Yikes!! all of my comments were ignored 🤣 |
@mayank31398 I thought you only gave these 2 comments Was there anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this checkpointing logic?
or should this be in instructlab training repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this code to GPTDolomiteConfig I think
Since this repo is just GPTDolomite, we don't need CommonConfig class maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets drop this since we are removing all other normalization implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this as a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
class YaRNScaledRoPE(RoPE): | ||
def __init__( | ||
self, | ||
head_dim: int, | ||
max_position_embeddings: int = 2048, | ||
base: int = 10000, | ||
scale: float = 1, | ||
original_max_position_embeddings: int = 2048, | ||
extrapolation_factor: float = 1, | ||
attn_factor: float = 1, | ||
beta_fast: int = 32, | ||
beta_slow: int = 1, | ||
) -> None: | ||
torch.nn.Module.__init__(self) | ||
|
||
self.head_dim = head_dim | ||
self.max_position_embeddings = max_position_embeddings | ||
self.base = base | ||
self.scale = scale | ||
self.original_max_position_embeddings = original_max_position_embeddings | ||
self.extrapolation_factor = extrapolation_factor | ||
self.attn_factor = attn_factor | ||
self.beta_fast = beta_fast | ||
self.beta_slow = beta_slow | ||
|
||
# Get n-d magnitude scaling corrected for interpolation | ||
self.mscale = _yarn_get_mscale(self.scale) * self.attn_factor | ||
|
||
self.reset_parameters() | ||
|
||
def reset_parameters(self) -> None: | ||
pos_freqs = self.base ** ( | ||
torch.arange(0, self.head_dim, 2).float() / self.head_dim | ||
) | ||
inv_freq_extrapolation = 1.0 / pos_freqs | ||
inv_freq_interpolation = 1.0 / (self.scale * pos_freqs) | ||
|
||
low, high = _yarn_find_correction_range( | ||
self.beta_fast, | ||
self.beta_slow, | ||
self.head_dim, | ||
self.base, | ||
self.original_max_position_embeddings, | ||
) | ||
inv_freq_mask = ( | ||
(1 - _yarn_linear_ramp_mask(low, high, self.head_dim // 2).float()) | ||
* self.extrapolation_factor | ||
) # Get n-d rotational scaling corrected for extrapolation | ||
inv_freq = ( | ||
inv_freq_interpolation * (1 - inv_freq_mask) | ||
+ inv_freq_extrapolation * inv_freq_mask | ||
) | ||
self.register_buffer("inv_freq", inv_freq, persistent=False) | ||
|
||
# pylint: disable=no-value-for-parameter | ||
self._set_cos_sin_cache( | ||
self.max_position_embeddings, dtype=torch.get_default_dtype() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets drop yarn, I haven't tested this so not sure if logic is correct.
# Inverse dim formula to find dim based on number of rotations | ||
def _yarn_find_correction_dim( | ||
num_rotations: int, dim: int, base: int = 10000, max_position_embeddings: int = 2048 | ||
) -> float: | ||
return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / ( | ||
2 * math.log(base) | ||
) | ||
|
||
|
||
# Find dim range bounds based on rotations | ||
def _yarn_find_correction_range( | ||
low_rot: int, | ||
high_rot: int, | ||
dim: int, | ||
base: int = 10000, | ||
max_position_embeddings: int = 2048, | ||
) -> int: | ||
low = math.floor( | ||
_yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings) | ||
) | ||
high = math.ceil( | ||
_yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings) | ||
) | ||
return max(low, 0), min(high, dim - 1) # Clamp values just in case | ||
|
||
|
||
def _yarn_linear_ramp_mask(min: float, max: float, dim: int) -> torch.Tensor: | ||
if min == max: | ||
max += 0.001 # Prevent singularity | ||
|
||
linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min) | ||
ramp_func = torch.clamp(linear_func, 0, 1) | ||
return ramp_func | ||
|
||
|
||
def _yarn_get_mscale(scale: float = 1) -> float: | ||
if scale <= 1: | ||
return 1.0 | ||
return 0.1 * math.log(scale) + 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above comment about dropping yarn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move CommonConfig logic to this class maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might is also not needed since this is used for gradient checkpointing/ FSDP wrapping
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* addressed missed out comments in #1, except checkpointing Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * ruff + lint Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * removed gradient checkpointing Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * moved config file and commented on rope scaling. Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
This is the initial extraction from the dolomite engine repo.
Extracted models:
hf_models/models/gpt_dolomite
hf_models/models/moe_dolomite
Conversion from HF supported
hf_models/model_conversion/bigcode
hf_models/model_conversion/llama
hf_models/model_conversion/mixtral
TODO:
modeling_utils/normalization/rmsnorm/torchtitan.py
modeling_utils/normalization/rmsnorm/apex.py
modeling_utils/normalization/layernorm/apex.py
modeling_utils/normalization/layernorm/apex_persistent.py
modeling_utils/embedding/ParameterizedEmbedding
modeling_utils/linear/ParameterizedLinear