API Documentation
sirfshampoo.SIRFShampoo
SIRFShampoo(model: Module, params: Optional[Union[List[Parameter], List[Dict[str, Any]]]] = None, lr: float = 0.001, beta2: float = 0.01, alpha1: float = 0.9, alpha2: float = 0.5, lam: float = 0.001, kappa: float = 0.0, batch_size: Union[int, Callable[[Tuple[Tensor, ...]], int]] = get_batch_size, T: Union[int, Callable[[int], bool]] = 1, structures: Union[str, Dict[int, Union[str, Tuple[str, ...]]]] = 'dense', preconditioner_dtypes: Optional[Union[dtype, Dict[int, Union[None, dtype, Tuple[Union[None, dtype], ...]]]]] = None, combine_params: Tuple[PreconditionerGroup, ...] = DEFAULT_COMBINE_PARAMS, verbose_init: bool = False)
Bases: Optimizer
Structured inverse-free and root-free Shampoo optimizer.
Attributes:
-
SUPPORTED_STRUCTURES(Dict[str, Type[StructuredMatrix]]) –A dictionary mapping structure names to the respective classes of structured matrices that can be used for the pre-conditioner.
-
STATE_ATTRIBUTES(List[str]) –Attributes that belong to the optimizer's state but are not stored inside the
self.stateattribute. They will be saved and restored when the optimizer is check-pointed (by calling.state_dict()and.load_state_dict()).
Set up the optimizer.
Notation based on Can We Remove the Square-Root in Adaptive Gradient Methods?.
Note
We rewrite the parameter groups such that parameters sharing a pre- conditioner are in one group. This simplifies the internal book- keeping when updating the pre-conditioner and parameters.
Parameters:
-
model(Module) –The model to optimize. The optimizer needs access to the model to figure out weights/biases of one layer.
-
params(Optional[Union[List[Parameter], List[Dict[str, Any]]]], default:None) –The parameters to optimize. If
None, all parameters of the model are optimized. Default:None. -
lr(float, default:0.001) –Learning rate for the parameter update. Default:
0.001. -
beta2(float, default:0.01) –Learning rate for the preconditioner update. Default:
0.01. -
alpha1(float, default:0.9) –Momentum for the parameter update. Default:
0.9. -
alpha2(float, default:0.5) –Riemannian momentum on the pre-conditioners. Default
0.5. -
lam(float, default:0.001) –Damping for the pre-conditioner update. Default:
0.001. -
kappa(float, default:0.0) –Weight decay. Default:
0.0. -
batch_size(Union[int, Callable[[Tuple[Tensor, ...]], int]], default:get_batch_size) –The batch size as integer or a callable from the input tensors of the neural network to the batch size (will be installed as pre- forward hook). If not specified, we detect the batch size by using the first input tensors leading dimension.
-
T(Union[int, Callable[[int], bool]], default:1) –The pre-conditioner update frequency as integer or callable from the optimizer's global step to a boolean that is
Trueif the pre- conditioner should be updated at that iteration. Default:1. -
structures(Union[str, Dict[int, Union[str, Tuple[str, ...]]]], default:'dense') –Specification of which structures the preconditioner matrices should use. There are multiple ways to specify this: - If a single string, every of the
Nfactors of anNd tensor's preconditioner will use the same structure specified by the string. - If specified as dictionary, each key represents the dimension of a preconditioned tensor and its value specifies the structure as string or tuple. E.g.{1: 'dense', 2: ('dense', 'diagonal'), 3: 'diagonal'}means that 1d tensors will be predonditioned with a single dense Kronecker factor, 2d tensors with a dense and a diagonal factor, and 3d tensors with three diagonal factors. Supported choices are'dense','diagonal','block30diagonal','hierarchical15_15','triltoeplitz', and'triutoeplitz'. See Figure 5 for an illustration. -
preconditioner_dtypes(Optional[Union[dtype, Dict[int, Union[None, dtype, Tuple[Union[None, dtype], ...]]]]], default:None) –The data type to use for the pre-conditioner. There are multiple ways to specify this and the format is identical to that of
structures. E.g.{1: bfloat16, 2: (float32, float16), 3: float32}means that 1d tensors will usebfloat16, 2d tensors will usefloat32for the first andfloat16for the second factor, and 3d tensors will usefloat32for all factors. IfNone, the parameter's data type will be used. Default:None. -
combine_params(Tuple[PreconditionerGroup, ...], default:DEFAULT_COMBINE_PARAMS) –A tuple of
PreconditionerGroupobjects that specify how to combine parameters into combinations which share a pre-conditioner. Leading rules are prioritized over trailing entries, i.e. if a parameter matches with multiple rules, the earlier rule is used. By default, this tuple contains a single rule that treats each parameter of a neural network with an independent pre-conditioner. -
verbose_init(bool, default:False) –Whether to print information at initialization, i.e. how parameters are grouped and what pre-conditioners are used. Default:
False.
Source code in sirfshampoo/optimizer.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | |
step
Perform a single optimization step.
Parameters:
-
closure(Optional[Callable], default:None) –Not supported. Default:
None.
Raises:
-
NotImplementedError–If
closureis notNone.
Source code in sirfshampoo/optimizer.py
state_dict
Return a save-able state of the optimizer.
Returns:
-
Dict[str, Any]–A dictionary containing the optimizer state.
Source code in sirfshampoo/optimizer.py
load_state_dict
Load an optimizer state.
Parameters:
-
state_dict(Dict[str, Any]) –A dictionary containing a valid state obtained from this class's
.state_dict()method.
Source code in sirfshampoo/optimizer.py
sirfshampoo.PreconditionerGroup
Bases: ABC
Interface for treating multiple tensors with one pre-conditioner.
Users who want to specify their own rule how to group parameters into a tensor which is then treated with a Kronecker-factored pre-conditioner should implement this interface.
group
abstractmethod
Combine multiple tensors into one.
This is the inverse operation of ungroup.
Parameters:
-
tensors(List[Tensor]) –List of tensors to combine.
Returns:
-
Tensor–Combined tensor.
Raises:
-
NotImplementedError–Must be implemented by a child class.
Source code in sirfshampoo/combiner.py
identify
abstractmethod
Detect parameters that should be treated jointly.
Parameters:
-
model(Module) –The neural network.
Returns:
-
List[List[Parameter]]–A list whose entries are list of parameters that are treated jointly.
Raises:
-
NotImplementedError–Must be implemented by a child class.
Source code in sirfshampoo/combiner.py
ungroup
abstractmethod
Split a combined tensor into multiple tensors.
This is the inverse operation of group.
Parameters:
-
grouped_tensor(Tensor) –Combined tensor.
-
tensor_shapes(List[Size]) –Shapes of the tensors to split into.
Returns:
-
List[Tensor]–List of tensors.
Raises:
-
NotImplementedError–Must be implemented by a child class.
Source code in sirfshampoo/combiner.py
sirfshampoo.PerParameter
Bases: PreconditionerGroup
Pre-conditioner group to treat each parameter with its own pre-conditioner.
sirfshampoo.LinearWeightBias
Bases: PreconditionerGroup
Treat weight and bias of a linear layer jointly.
Stacks the bias as last column to the weight matrix.
Attributes:
-
LINEAR_CLS(Tuple[Type[Module]]) –Classes that should be detected as linear layers.
sirfshampoo.FlattenEmbedding
Bases: FlattenAndConcatenate
Treat flattened embedding weights with a pre-conditioner.
Attributes:
-
EMBEDDING_CLS(Tuple[Type[Module]]) –Classes that should be detected as embedding layers.