[feat] Resume from ckpt by kevssim · Pull Request #135 · modelscope/twinkle

kevssim · 2026-03-31T01:45:42Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

在TrasnfomersModel和MultiLoraModel实现完整训练状态的恢复——包括优化器、调度器、RNG配置以及数据集跳过

gemini-code-assist

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

gemini-code-assist · 2026-03-31T01:47:55Z

+            self.dataloader.__initialized = False
+            self._rebuild_sampler_stack()
+            self.dataloader.__initialized = True


Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change

self.dataloader.__initialized = False

self._rebuild_sampler_stack()

self.dataloader.__initialized = True

if self.dataloader is not None:

self.dataloader = None

self._lazy_init_dataloader()

…into resume_from_ckpt

kevssim · 2026-04-01T06:42:20Z

/gemini summary

gemini-code-assist · 2026-04-01T06:42:35Z

Summary of Changes

This pull request introduces robust checkpoint resumption capabilities to the training framework. By enabling the restoration of full training states—including optimizer, scheduler, and RNG configurations—and implementing precise data skipping in the dataloader, the changes ensure that training can be reliably resumed after interruptions. Additionally, the PR optimizes checkpoint handling for FSDP2 strategies and adds necessary API endpoints to support these features in distributed and remote training environments.

Highlights

Checkpoint Resumption Support: Added comprehensive support for resuming training from checkpoints, including model weights, optimizer states, learning rate schedulers, and RNG states.
Dataloader Skipping: Implemented skip_consumed_samples in the dataloader to correctly resume data iteration from the exact point where training was interrupted.
FSDP2 Optimization: Enhanced FSDP2 strategy to support efficient saving and loading of wrapped optimizer states.
API Extensions: Exposed new server-side endpoints for loading training states and reading progress metadata to facilitate remote training resumption.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity

Pull request created by kevssim.
Automated code review identified potential issues with non-deterministic random state generation, use of private attributes, and security concerns regarding torch.load.
Author implemented fixes addressing random state seeding, deprecated StopIteration usage, and improved checkpoint loading security.
Refactored sampler stack rebuilding to avoid brittle modifications of dataloader internals.

tastelikefeet · 2026-04-14T03:21:36Z

@@ -99,21 +99,29 @@ def train():
    # model.set_lr_scheduler('LinearLR')


这文件命名有一个typo

确实是 typo，但 self_congnition.py 在 main 上就已存在，是否在单独 PR 中修正更合适？

tastelikefeet · 2026-04-15T07:40:41Z

        )
        response.raise_for_status()

+    def load_training_state(self, name: str, **kwargs) -> Dict[str, Any]:


load_training_state和read_training_progress什么区别，能否合并为一个呢

tastelikefeet · 2026-04-15T07:43:29Z

+        twinkle_path = model.save(
+            name=f'twinkle-epoch-{epoch}',
+            save_optimizer=True,
+            consumed_train_samples=consumed_train_samples,


dataloader.get_consumed_samples()?

或者，dataloader.get_state()，更通用一些

另外，这里额外测试下torchrun/ray的兼容性，还有megatron和transformers双模型的兼容性

tastelikefeet · 2026-04-15T07:50:31Z

+
+- `model.save(name, save_optimizer=True, consumed_train_samples=...)` saves weights together with optimizer, scheduler, scaler, RNG, and `trainer_state.json`.
+- `model.load(name, output_dir=..., adapter_name=...)` restores LoRA / adapter model weights.
+- `model.read_training_progress(checkpoint_dir, ...)` reads checkpoint metadata such as `cur_step`, `gradient_accumulation_steps`, and `consumed_train_samples`.


这两个比较相似，合成一个是否合适？比如
training_progress = model.resume_from_checkpoint(xxx)
dataloader.resume_from_checkpoint(training_progress.get('dataloader'))
这样？

tastelikefeet · 2026-04-15T07:54:14Z


+        if optimizer is not None:
+            optimizer_path = os.path.join(output_dir, 'optimizer.pt')
+            if hasattr(self.strategy, 'save_optimizer_checkpoint'):


这里职责有点不清晰，具体为什么有的strategy有save_optimizer_checkpoint，有的又没有？
读代码的人就会感觉不理解，到底什么情况需要strategy存储

tastelikefeet · 2026-04-15T07:54:45Z

+        adapter_name = kwargs.pop('adapter_name', _default_adapter_name)
+        optimizer_config = self.optimizer_group[adapter_name]
+
+        if not Platform.is_master():


这里ray和torchrun都需要确保正确，megatron部分也需要对应考虑

tastelikefeet · 2026-04-15T07:55:59Z

+
+    def save_optimizer_checkpoint(self, model, optimizer, output_path: str):
+        fsdp_plugin = self._get_fsdp_plugin()
+        if fsdp_plugin is not None and fsdp_plugin.fsdp_version == 2:


import torch一次即可

kevssim and others added 13 commits March 27, 2026 12:00

docs: add transformers resume design spec

5cd3c0f

docs: refine transformers resume design spec

91eeaeb

docs: trim resume state fields

6eebda8

docs: add npu resume compatibility requirements

cdd9c1b

chore: ignore local worktrees

1542492

wip

9883118

wip

d41a634

wip

21f9918

fix

1e59531

wip

9bb3f39

fix

fdf1f71

wip

6cf5160

Merge branch 'modelscope:main' into resume_from_ckpt

144ffe6

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

kevssim added 15 commits March 31, 2026 09:52

lint

e21f870

Merge branch 'resume_from_ckpt' of https://github.com/kevssim/twinkle …

3359209

…into resume_from_ckpt

wip

70ebe50

wip

483778d

wip

039789b

wip

54de1a4

wip

920ab86

lint

ffd6304

wip

582bd41

wip

9cb6106

wip

c0cf72e

wip

505a75c

fix

a222b5b

wip

7499e00

doc

cd0b094

kevssim commented Apr 1, 2026

View reviewed changes

Comment thread src/twinkle/model/transformers/transformers.py Outdated

kevssim added 2 commits April 1, 2026 17:08

wip

abf2c2f

lint

8bf7a6a

kevssim marked this pull request as ready for review April 1, 2026 09:32

kevssim changed the title ~~Resume from ckpt~~ [feat] Resume from ckpt Apr 1, 2026

Merge remote-tracking branch 'origin/main' into resume_from_ckpt

27e76c6

tastelikefeet reviewed Apr 14, 2026

View reviewed changes

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

kevssim added 2 commits April 16, 2026 11:02

Merge remote-tracking branch 'origin' into resume_from_ckpt

5d68910

wip

9326e64

		@@ -99,21 +99,29 @@ def train():
		# model.set_lr_scheduler('LinearLR')

Conversation

kevssim commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevssim commented Apr 1, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Summary of Changes

Highlights

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevssim Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevssim commented Mar 31, 2026 •

edited

Loading

kevssim Apr 16, 2026 •

edited

Loading