Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Extraction From Dolomite Engine #1

Merged
merged 14 commits into from
Jun 21, 2024
Merged

Initial Extraction From Dolomite Engine #1

merged 14 commits into from
Jun 21, 2024

Conversation

fabianlim
Copy link
Collaborator

@fabianlim fabianlim commented Jun 20, 2024

This is the initial extraction from the dolomite engine repo.

Extracted models:

  • hf_models/models/gpt_dolomite
  • hf_models/models/moe_dolomite

Conversion from HF supported

  • hf_models/model_conversion/bigcode
  • hf_models/model_conversion/llama
  • hf_models/model_conversion/mixtral

TODO:

  • adding CI (pypi, linting)
  • remove some more unused code
    • modeling_utils/normalization/rmsnorm/torchtitan.py
    • modeling_utils/normalization/rmsnorm/apex.py
    • modeling_utils/normalization/layernorm/apex.py
    • modeling_utils/normalization/layernorm/apex_persistent.py
    • modeling_utils/embedding/ParameterizedEmbedding
    • modeling_utils/linear/ParameterizedLinear
  • adding some more notices
  • adding the conversion utilities.

@mayank31398
Copy link

@fabianlim I would also suggest dropping ParametertizedEmbedding and ParametertizedLinear and using the linear and embedding from torch directly
They are just for an experimental project I was working on.

@fabianlim
Copy link
Collaborator Author

fabianlim commented Jun 21, 2024

@aldopareja this is more or less ok, but missing the notices. what do we want to put in the header of every file?

like this?

# this code has been extracted from https://github.com/ibm-granite/dolomite-engine

@fabianlim fabianlim changed the title Initial Extraction From Dolomite Engine : DO NOT MERGE Initial Extraction From Dolomite Engine Jun 21, 2024
@fabianlim fabianlim force-pushed the initial branch 3 times, most recently from 587d240 to 3636587 Compare June 21, 2024 16:06
@RobotSail
Copy link
Member

We should also add the same publishing CI that we use elsewhere in instructlab so that it's easy to get stuff published.

@RobotSail
Copy link
Member

This one instructlab/training#31

@RobotSail
Copy link
Member

This one instructlab/training#31

Sorry not that one, this one: instructlab/training#42

@RobotSail
Copy link
Member

Nvm, I created a PR for publishing here, ignore the above comments: #2

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
README.md Outdated Show resolved Hide resolved
@mayank31398
Copy link

wow!!! 4k lines of code already :)

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@fabianlim fabianlim merged commit 6d0760e into main Jun 21, 2024
5 checks passed
@fabianlim fabianlim deleted the initial branch June 21, 2024 23:56
@mayank31398
Copy link

Yikes!! all of my comments were ignored 🤣

@fabianlim
Copy link
Collaborator Author

@mayank31398 I thought you only gave these 2 comments

  • removed the Parameterized version 0b7367b
  • adjust the authorship 853c987

Was there anything else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this checkpointing logic?
or should this be in instructlab training repo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this code to GPTDolomiteConfig I think
Since this repo is just GPTDolomite, we don't need CommonConfig class maybe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets drop this since we are removing all other normalization implementations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this as a file?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines +75 to +132
class YaRNScaledRoPE(RoPE):
def __init__(
self,
head_dim: int,
max_position_embeddings: int = 2048,
base: int = 10000,
scale: float = 1,
original_max_position_embeddings: int = 2048,
extrapolation_factor: float = 1,
attn_factor: float = 1,
beta_fast: int = 32,
beta_slow: int = 1,
) -> None:
torch.nn.Module.__init__(self)

self.head_dim = head_dim
self.max_position_embeddings = max_position_embeddings
self.base = base
self.scale = scale
self.original_max_position_embeddings = original_max_position_embeddings
self.extrapolation_factor = extrapolation_factor
self.attn_factor = attn_factor
self.beta_fast = beta_fast
self.beta_slow = beta_slow

# Get n-d magnitude scaling corrected for interpolation
self.mscale = _yarn_get_mscale(self.scale) * self.attn_factor

self.reset_parameters()

def reset_parameters(self) -> None:
pos_freqs = self.base ** (
torch.arange(0, self.head_dim, 2).float() / self.head_dim
)
inv_freq_extrapolation = 1.0 / pos_freqs
inv_freq_interpolation = 1.0 / (self.scale * pos_freqs)

low, high = _yarn_find_correction_range(
self.beta_fast,
self.beta_slow,
self.head_dim,
self.base,
self.original_max_position_embeddings,
)
inv_freq_mask = (
(1 - _yarn_linear_ramp_mask(low, high, self.head_dim // 2).float())
* self.extrapolation_factor
) # Get n-d rotational scaling corrected for extrapolation
inv_freq = (
inv_freq_interpolation * (1 - inv_freq_mask)
+ inv_freq_extrapolation * inv_freq_mask
)
self.register_buffer("inv_freq", inv_freq, persistent=False)

# pylint: disable=no-value-for-parameter
self._set_cos_sin_cache(
self.max_position_embeddings, dtype=torch.get_default_dtype()
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets drop yarn, I haven't tested this so not sure if logic is correct.

Comment on lines +148 to +186
# Inverse dim formula to find dim based on number of rotations
def _yarn_find_correction_dim(
num_rotations: int, dim: int, base: int = 10000, max_position_embeddings: int = 2048
) -> float:
return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (
2 * math.log(base)
)


# Find dim range bounds based on rotations
def _yarn_find_correction_range(
low_rot: int,
high_rot: int,
dim: int,
base: int = 10000,
max_position_embeddings: int = 2048,
) -> int:
low = math.floor(
_yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings)
)
high = math.ceil(
_yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings)
)
return max(low, 0), min(high, dim - 1) # Clamp values just in case


def _yarn_linear_ramp_mask(min: float, max: float, dim: int) -> torch.Tensor:
if min == max:
max += 0.001 # Prevent singularity

linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
ramp_func = torch.clamp(linear_func, 0, 1)
return ramp_func


def _yarn_get_mscale(scale: float = 1) -> float:
if scale <= 1:
return 1.0
return 0.1 * math.log(scale) + 1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment about dropping yarn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move CommonConfig logic to this class maybe?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might is also not needed since this is used for gradient checkpointing/ FSDP wrapping

fabianlim added a commit that referenced this pull request Jun 22, 2024
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
mayank31398 pushed a commit that referenced this pull request Jun 22, 2024
* addressed missed out comments in #1, except checkpointing

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* ruff + lint

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* removed gradient checkpointing

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* moved config file and commented on rope scaling.

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

---------

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants