API Documentation

sirfshampoo.SIRFShampoo

SIRFShampoo(model: Module, params: Optional[Union[List[Parameter], List[Dict[str, Any]]]] = None, lr: float = 0.001, beta2: float = 0.01, alpha1: float = 0.9, alpha2: float = 0.5, lam: float = 0.001, kappa: float = 0.0, batch_size: Union[int, Callable[[Tuple[Tensor, ...]], int]] = get_batch_size, T: Union[int, Callable[[int], bool]] = 1, structures: Union[str, Dict[int, Union[str, Tuple[str, ...]]]] = 'dense', preconditioner_dtypes: Optional[Union[dtype, Dict[int, Union[None, dtype, Tuple[Union[None, dtype], ...]]]]] = None, combine_params: Tuple[PreconditionerGroup, ...] = DEFAULT_COMBINE_PARAMS, verbose_init: bool = False)

Bases: Optimizer

Structured inverse-free and root-free Shampoo optimizer.

Attributes:

SUPPORTED_STRUCTURES (Dict[str, Type[StructuredMatrix]]) –

A dictionary mapping structure names to the respective classes of structured matrices that can be used for the pre-conditioner.
STATE_ATTRIBUTES (List[str]) –

Attributes that belong to the optimizer's state but are not stored inside the self.state attribute. They will be saved and restored when the optimizer is check-pointed (by calling .state_dict() and .load_state_dict()).

Set up the optimizer.

Notation based on Can We Remove the Square-Root in Adaptive Gradient Methods?.

Note

We rewrite the parameter groups such that parameters sharing a pre- conditioner are in one group. This simplifies the internal book- keeping when updating the pre-conditioner and parameters.

Parameters:

model (Module) –

The model to optimize. The optimizer needs access to the model to figure out weights/biases of one layer.
params (Optional[Union[List[Parameter], List[Dict[str, Any]]]], default: None ) –

The parameters to optimize. If None, all parameters of the model are optimized. Default: None.
lr (float, default: 0.001 ) –

Learning rate for the parameter update. Default: 0.001.
beta2 (float, default: 0.01 ) –

Learning rate for the preconditioner update. Default: 0.01.
alpha1 (float, default: 0.9 ) –

Momentum for the parameter update. Default: 0.9.
alpha2 (float, default: 0.5 ) –

Riemannian momentum on the pre-conditioners. Default 0.5.
lam (float, default: 0.001 ) –

Damping for the pre-conditioner update. Default: 0.001.
kappa (float, default: 0.0 ) –

Weight decay. Default: 0.0.
batch_size (Union[int, Callable[[Tuple[Tensor, ...]], int]], default: get_batch_size ) –

The batch size as integer or a callable from the input tensors of the neural network to the batch size (will be installed as pre- forward hook). If not specified, we detect the batch size by using the first input tensors leading dimension.
T (Union[int, Callable[[int], bool]], default: 1 ) –

The pre-conditioner update frequency as integer or callable from the optimizer's global step to a boolean that is True if the pre- conditioner should be updated at that iteration. Default: 1.
structures (Union[str, Dict[int, Union[str, Tuple[str, ...]]]], default: 'dense' ) –

Specification of which structures the preconditioner matrices should use. There are multiple ways to specify this: - If a single string, every of the N factors of an Nd tensor's preconditioner will use the same structure specified by the string. - If specified as dictionary, each key represents the dimension of a preconditioned tensor and its value specifies the structure as string or tuple. E.g. {1: 'dense', 2: ('dense', 'diagonal'), 3: 'diagonal'} means that 1d tensors will be predonditioned with a single dense Kronecker factor, 2d tensors with a dense and a diagonal factor, and 3d tensors with three diagonal factors. Supported choices are 'dense', 'diagonal', 'block30diagonal', 'hierarchical15_15', 'triltoeplitz', and 'triutoeplitz'. See Figure 5 for an illustration.
preconditioner_dtypes (Optional[Union[dtype, Dict[int, Union[None, dtype, Tuple[Union[None, dtype], ...]]]]], default: None ) –

The data type to use for the pre-conditioner. There are multiple ways to specify this and the format is identical to that of structures. E.g. {1: bfloat16, 2: (float32, float16), 3: float32} means that 1d tensors will use bfloat16, 2d tensors will use float32 for the first and float16 for the second factor, and 3d tensors will use float32 for all factors. If None, the parameter's data type will be used. Default: None.
combine_params (Tuple[PreconditionerGroup, ...], default: DEFAULT_COMBINE_PARAMS ) –

A tuple of PreconditionerGroup objects that specify how to combine parameters into combinations which share a pre-conditioner. Leading rules are prioritized over trailing entries, i.e. if a parameter matches with multiple rules, the earlier rule is used. By default, this tuple contains a single rule that treats each parameter of a neural network with an independent pre-conditioner.
verbose_init (bool, default: False ) –

Whether to print information at initialization, i.e. how parameters are grouped and what pre-conditioners are used. Default: False.

Source code in sirfshampoo/optimizer.py

def __init__(
    self,
    model: Module,
    params: Optional[Union[List[Parameter], List[Dict[str, Any]]]] = None,
    lr: float = 0.001,  # beta1 in the paper
    beta2: float = 0.01,
    alpha1: float = 0.9,
    alpha2: float = 0.5,
    lam: float = 0.001,
    kappa: float = 0.0,
    batch_size: Union[int, Callable[[Tuple[Tensor, ...]], int]] = get_batch_size,
    T: Union[int, Callable[[int], bool]] = 1,
    structures: Union[str, Dict[int, Union[str, Tuple[str, ...]]]] = "dense",
    preconditioner_dtypes: Optional[
        Union[dtype, Dict[int, Union[None, dtype, Tuple[Union[None, dtype], ...]]]]
    ] = None,
    combine_params: Tuple[PreconditionerGroup, ...] = DEFAULT_COMBINE_PARAMS,
    verbose_init: bool = False,
):
    """Set up the optimizer.

    Notation based on [Can We Remove the Square-Root in Adaptive Gradient
    Methods?](https://openreview.net/pdf?id=vuMD71R20q).

    Note:
        We rewrite the parameter groups such that parameters sharing a pre-
        conditioner are in one group. This simplifies the internal book-
        keeping when updating the pre-conditioner and parameters.

    Args:
        model: The model to optimize. The optimizer needs access to the model
            to figure out weights/biases of one layer.
        params: The parameters to optimize. If `None`, all parameters of the
            model are optimized. Default: `None`.
        lr: Learning rate for the parameter update. Default: `0.001`.
        beta2: Learning rate for the preconditioner update. Default: `0.01`.
        alpha1: Momentum for the parameter update. Default: `0.9`.
        alpha2: Riemannian momentum on the pre-conditioners. Default `0.5`.
        lam: Damping for the pre-conditioner update. Default: `0.001`.
        kappa: Weight decay. Default: `0.0`.
        batch_size: The batch size as integer or a callable from the input tensors
            of the neural network to the batch size (will be installed as pre-
            forward hook). If not specified, we detect the batch size by using the
            first input tensors leading dimension.
        T: The pre-conditioner update frequency as integer or callable from the
            optimizer's global step to a boolean that is `True` if the pre-
            conditioner should be updated at that iteration. Default: `1`.
        structures: Specification of which structures the preconditioner matrices
            should use. There are multiple ways to specify this:
            - If a single string, every of the `N` factors of an `N`d tensor's
              preconditioner will use the same structure specified by the string.
            - If specified as dictionary, each key represents the dimension of a
              preconditioned tensor and its value specifies the structure as string
              or tuple. E.g. `{1: 'dense', 2: ('dense', 'diagonal'), 3: 'diagonal'}`
              means that 1d tensors will be predonditioned with a single dense
              Kronecker factor, 2d tensors with a dense and a diagonal factor, and
              3d tensors with three diagonal factors.
            Supported choices are `'dense'`, `'diagonal'`, `'block30diagonal'`,
            `'hierarchical15_15'`, `'triltoeplitz'`, and `'triutoeplitz'`. See
            [Figure 5](https://arxiv.org/pdf/2312.05705v3) for an illustration.
        preconditioner_dtypes: The data type to use for the pre-conditioner. There
            are multiple ways to specify this and the format is identical to that of
            `structures`. E.g. `{1: bfloat16, 2: (float32, float16), 3: float32}`
            means that 1d tensors will use `bfloat16`, 2d tensors will use `float32`
            for the first and `float16` for the second factor, and 3d tensors will
            use `float32` for all factors. If `None`, the parameter's data type will
            be used. Default: `None`.
        combine_params: A tuple of `PreconditionerGroup` objects that specify how to
            combine parameters into combinations which share a pre-conditioner.
            Leading rules are prioritized over trailing entries, i.e. if a parameter
            matches with multiple rules, the earlier rule is used. By default, this
            tuple contains a single rule that treats each parameter of a neural
            network with an independent pre-conditioner.
        verbose_init: Whether to print information at initialization, i.e. how
            parameters are grouped and what pre-conditioners are used.
            Default: `False`.
    """
    defaults = dict(
        lr=lr,
        beta2=beta2,
        alpha1=alpha1,
        alpha2=alpha2,
        lam=lam,
        kappa=kappa,
        T=T,
        structures=structures,
        preconditioner_dtypes=preconditioner_dtypes,
        combine_params=combine_params,
    )

    if params is None:
        params = [p for p in model.parameters() if p.requires_grad]
    super().__init__(params, defaults)

    self.param_to_names = {p.data_ptr(): n for n, p in model.named_parameters()}
    self.global_step = 0

    # batch size detection
    if callable(batch_size):
        # install as module hook that updates the batch size in every forward pass
        self.batch_size_valid = self.global_step
        self.batch_size = 0

        def hook(module: Module, inputs: Tuple[Tensor, ...]):
            """Forward hook to accumulate the batch size in the optimizer.

            Args:
                module: The module that is called.
                inputs: The input tensors to the module.
            """
            # batch size is outdated because optimizer has stepped
            if self.batch_size_valid != self.global_step:
                self.batch_size_valid = self.global_step
                self.batch_size = 0

            # do not accumulate batch size during evaluation
            if module.training:
                self.batch_size += batch_size(inputs)

        model.register_forward_pre_hook(hook)
    else:
        self.batch_size = batch_size
        self.batch_size_valid = "always"

    # we rewrite the original parameter groups and create new ones such that each
    # parameter group contains the parameters that are treated jointly with one
    # pre-conditioner. This simplifies book-keeping when updating the
    # pre-conditioner and taking a step.
    self._one_param_group_per_preconditioner(model)
    # convert structure and dtype arguments into tuples
    self._standardize_structures()
    self._standardize_preconditioner_dtypes()
    self._verify_hyperparameters()

    self._initialize_momentum_buffers()

    # The pre-conditioner for one group is a list of matrices (the Kronecker
    # factors). For a layer with 2d weight of shape `(D_out, D_in)`, the entries are
    # (C, K) from the paper where C is `(D_out, D_out)` and K is `(D_in, D_in)`.
    self.preconditioner: List[List[StructuredMatrix]] = (
        self._initialize_preconditioner("identity")
    )
    # same for the momenta, i.e. (m_C, m_K) from the paper for a 2d weight
    self.preconditioner_momenta: List[List[Union[StructuredMatrix, None]]] = (
        self._initialize_preconditioner("zero", is_momentum=True)
    )

    if verbose_init:
        self.print_group_info()

step

step(closure: Optional[Callable] = None) -> None

Perform a single optimization step.

Parameters:

closure (Optional[Callable], default: None ) –

Not supported. Default: None.

Raises:

NotImplementedError –

If closure is not None.

Source code in sirfshampoo/optimizer.py

def step(self, closure: Optional[Callable] = None) -> None:
    """Perform a single optimization step.

    Args:
        closure: Not supported. Default: `None`.

    Raises:
        NotImplementedError: If `closure` is not `None`.
    """
    if closure is not None:
        raise NotImplementedError("Closure is not supported.")

    for group_idx, _ in enumerate(self.param_groups):
        self._step(group_idx)

    self.global_step += 1

state_dict

state_dict() -> Dict[str, Any]

Return a save-able state of the optimizer.

Returns:

Dict[str, Any] –

A dictionary containing the optimizer state.

Source code in sirfshampoo/optimizer.py

def state_dict(self) -> Dict[str, Any]:
    """Return a save-able state of the optimizer.

    Returns:
        A dictionary containing the optimizer state.
    """
    state_dict = super().state_dict()

    for name in self.STATE_ATTRIBUTES:
        assert name not in state_dict.keys()
        state_dict[name] = getattr(self, name)

    return state_dict

load_state_dict

load_state_dict(state_dict: Dict[str, Any]) -> None

Load an optimizer state.

Parameters:

state_dict (Dict[str, Any]) –

A dictionary containing a valid state obtained from this class's .state_dict() method.

Source code in sirfshampoo/optimizer.py

def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
    """Load an optimizer state.

    Args:
        state_dict: A dictionary containing a valid state obtained from this
            class's `.state_dict()` method.
    """
    attributes = {name: state_dict.pop(name) for name in self.STATE_ATTRIBUTES}
    super().load_state_dict(state_dict)

    for name, value in attributes.items():
        setattr(self, name, value)

sirfshampoo.PreconditionerGroup

Bases: ABC

Interface for treating multiple tensors with one pre-conditioner.

Users who want to specify their own rule how to group parameters into a tensor which is then treated with a Kronecker-factored pre-conditioner should implement this interface.

group `abstractmethod`

group(tensors: List[Tensor]) -> Tensor

Combine multiple tensors into one.

This is the inverse operation of ungroup.

Parameters:

tensors (List[Tensor]) –

List of tensors to combine.

Returns:

Tensor –

Combined tensor.

Raises:

NotImplementedError –

Must be implemented by a child class.

Source code in sirfshampoo/combiner.py

@abstractmethod
def group(self, tensors: List[Tensor]) -> Tensor:
    """Combine multiple tensors into one.

    This is the inverse operation of `ungroup`.

    Args:
        tensors: List of tensors to combine.

    Returns:
        Combined tensor.

    Raises:
        NotImplementedError: Must be implemented by a child class.
    """
    raise NotImplementedError

identify `abstractmethod`

identify(model: Module) -> List[List[Parameter]]

Detect parameters that should be treated jointly.

Parameters:

model (Module) –

The neural network.

Returns:

List[List[Parameter]] –

A list whose entries are list of parameters that are treated jointly.

Raises:

NotImplementedError –

Must be implemented by a child class.

Source code in sirfshampoo/combiner.py

@abstractmethod
def identify(self, model: Module) -> List[List[Parameter]]:
    """Detect parameters that should be treated jointly.

    Args:
        model: The neural network.

    Returns:
        A list whose entries are list of parameters that are treated jointly.

    Raises:
        NotImplementedError: Must be implemented by a child class.
    """
    raise NotImplementedError

ungroup `abstractmethod`

ungroup(grouped_tensor: Tensor, tensor_shapes: List[Size]) -> List[Tensor]

Split a combined tensor into multiple tensors.

This is the inverse operation of group.

Parameters:

grouped_tensor (Tensor) –

Combined tensor.
tensor_shapes (List[Size]) –

Shapes of the tensors to split into.

Returns:

List[Tensor] –

List of tensors.

Raises:

NotImplementedError –

Must be implemented by a child class.

Source code in sirfshampoo/combiner.py

@abstractmethod
def ungroup(
    self, grouped_tensor: Tensor, tensor_shapes: List[Size]
) -> List[Tensor]:
    """Split a combined tensor into multiple tensors.

    This is the inverse operation of `group`.

    Args:
        grouped_tensor: Combined tensor.
        tensor_shapes: Shapes of the tensors to split into.

    Returns:
        List of tensors.

    Raises:
        NotImplementedError: Must be implemented by a child class.
    """
    raise NotImplementedError

sirfshampoo.PerParameter

Bases: PreconditionerGroup

Pre-conditioner group to treat each parameter with its own pre-conditioner.

sirfshampoo.LinearWeightBias

Bases: PreconditionerGroup

Treat weight and bias of a linear layer jointly.

Stacks the bias as last column to the weight matrix.

Attributes:

LINEAR_CLS (Tuple[Type[Module]]) –

Classes that should be detected as linear layers.

sirfshampoo.FlattenEmbedding

Bases: FlattenAndConcatenate

Treat flattened embedding weights with a pre-conditioner.

Attributes:

EMBEDDING_CLS (Tuple[Type[Module]]) –

Classes that should be detected as embedding layers.

API Documentation

sirfshampoo.SIRFShampoo

step

state_dict

load_state_dict

sirfshampoo.PreconditionerGroup

group abstractmethod

identify abstractmethod

ungroup abstractmethod

sirfshampoo.PerParameter

sirfshampoo.LinearWeightBias

sirfshampoo.FlattenEmbedding

group `abstractmethod`

identify `abstractmethod`

ungroup `abstractmethod`