Paper Alignment
This document details the alignment between the engram-peft implementation and the official DeepSeek Engram paper: "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" (arXiv:2601.07372).
Core Architecture Mapping
| Paper Component | Code Implementation | Implementation Detail |
|---|---|---|
| Tokenizer Compression (Sec 2.2) | compression.py |
Implements P: V→V' surjective mapping using NFKC + Lowercase + Strip. |
| Multi-Head Hashing (Sec 2.2) | hashing.py |
Polynomial hashing followed by bitwise XOR. Uses prime-sized tables $M_{n,k}$. |
| Context-Aware Gating (Sec 2.3) | layer.py:ContextAwareGating |
Formula: $\alpha_t = \sigma( \text{RMSNorm}(h_t)^\top \text{RMSNorm}(k_t) / \sqrt{d} )$. |
| Short-term Memory (Sec 2.3) | layer.py:ShortConv |
Depth-wise 1D Conv with kernel size 4 and dilation equal to max N-gram size. |
| Mixed Optimization (Sec 4.1) | utils.py:get_optimizer |
SparseAdam for retrieval embeddings with $5\times$ learning rate multiplier. |
Hyperparameters (Appendix A Table 5)
Our EngramConfig defaults are 100% aligned with the configurations for the 27B and 40B models specified in the paper.
| Parameter | Paper Value (27B) | EngramConfig Default |
|---|---|---|
| Engram Dim $d_{mem}$ | 1280 | embedding_dim: 1280 |
| Engram Vocab Size | 2,262,400 | engram_vocab_size_per_ngram: [1131200, 1131200] |
| Hash Heads $K$ | 8 | n_head_per_ngram: 8 |
| Target Layers | [2, 15] | target_layers: [2, 15] |
| N-gram Orders $n$ | [2, 3] | ngram_sizes: [2, 3] |
| LR Multiplier | $5\times$ | learning_rate_multiplier: 5.0 |
| Conv Zero Init | True | conv_zero_init: True |
Official Implementation Details (via Demo)
The following specific implementation details from the DeepSeek official demo are incorporated:
- Gating Activation:
gate = gate.abs().clamp_min(1e-6).sqrt() * gate.sign()This ensures numerical stability and matches the exact gating behavior of the original research. - Polynomial Hash Coefficients: Unique random multipliers are generated for each head and each layer to minimize hash collisions across the architecture.
- Efficiency:
Embedding tables for all heads in a layer are concatenated into a single larger
nn.Embeddingand indexed using offsets for maximum GPU throughput. - mHC (multi-Head Hyper-connection):
We support
hc_mult=4which expands the hidden states before gating, as described in the paper's efficient hyper-connection section.
Weight Reuse & Knowledge Transfer
One of the key practical advantages of the Engram design—implied by its deterministic hashing and modular nature—is the ability to reuse learned memory across different environments. We implement several enhancements beyond the paper's base training logic:
- Structural Invariance:
Because each N-gram head is independent, weights can be migrated between models with different
target_layersor differentengram_vocab_size_per_ngram(via slicing/padding). - Logic Alignment (Seeds & Tokenizers): The paper emphasizes "normalized textual equivalence." By leveraging this, we can align weights between different tokenizers (e.g., Llama-2 vs Qwen) by using character-level offset mappings on a reference corpus to synchronize the logical hashes.
- Cross-Model Knowledge Distillation: A trained Engram module can be treated as a portable "knowledge pack." Our implementation supports loading weights even when hashing seeds differ, by using a best-effort remapping strategy that recovers the semantic mapping from a sample of text data.