Skip to content

Add Gemma 4 MLX install-path support#19065

Open
zeel2104 wants to merge 16 commits intopytorch:mainfrom
zeel2104:gemma4-mlx-install-path
Open

Add Gemma 4 MLX install-path support#19065
zeel2104 wants to merge 16 commits intopytorch:mainfrom
zeel2104:gemma4-mlx-install-path

Conversation

@zeel2104
Copy link
Copy Markdown

@zeel2104 zeel2104 commented Apr 23, 2026

Summary

Enable Gemma 4 on the MLX backend through the HuggingFace export/run path.

This PR:

  • adds Gemma 4 support for backends/mlx/examples/llm/export_llm_hf.py
  • adds Gemma 4 text-only support to backends/mlx/examples/llm/run_llm_hf.py
  • fixes Gemma 4 hybrid-cache handling for shared KV layout and mixed sliding/full-attention cache types
  • makes the normal installed package path work without PYTHONPATH
  • limits MLX docs and CI coverage to the exact Gemma 4 configuration that was validated

This PR does not add Gemma 4 support to the internal export_llm / examples/models/gemma4/ path.

Test plan

Manual validation on Apple Silicon macOS using the installed package from .venv/site-packages:

python -m executorch.backends.mlx.examples.llm.export_llm_hf \
  --model-id google/gemma-4-E2B-it \
  --output /tmp/gemma4_custom_qlinear_only_installed.pte \
  --qlinear 4w \
  --use-custom-sdpa \
  --use-custom-kv-cache

python -m executorch.backends.mlx.examples.llm.run_llm_hf \
  --pte /tmp/gemma4_custom_qlinear_only_installed.pte \
  --model-id google/gemma-4-E2B-it \
  --prompt "What is the capital of France?" \
  --max-new-tokens 50

###Validation

-installed import path resolves from .venv/lib/python3.12/site-packages/executorch/...
-MLXBackend is registered from the installed package
-export succeeds for google/gemma-4-E2B-it with --qlinear 4w --use-custom-sdpa --use-custom-kv-cache
-runtime succeeds without PYTHONPATH
-generated output contains Paris

Additional note:

  • I also retried the clean editable install flow with python install_executorch.py --editable and verified that MLXBackend registers correctly there as well
  • I attempted a Gemma 3 smoke test, but google/gemma-3-1b-it is gated in my current environment (401 Unauthorized / GatedRepoError), so I could not complete a local Gemma 3 end-to-end rerun

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19065

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 Awaiting Approval, 2 New Failures

As of commit 719d2e8 with merge base d0b7934 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 23, 2026

Hi @zeel2104!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 23, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2026
@zeel2104 zeel2104 marked this pull request as draft April 23, 2026 12:23

# Check if model uses sliding window attention
sliding_window = getattr(model.config, "sliding_window", None)
# Check if model uses sliding window attention. Multimodal configs like
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this regress gemma3?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t expect this to regress Gemma 3. The change is just switching the sliding-window lookup to model.config.get_text_config(), which also covers the plain text config case and is needed for Gemma 4 where those attrs live under text_config. I scoped the logic to the same attribute lookup, not a Gemma-4-specific branch. I can also rerun a Gemma 3 smoke test and report back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be great to try on gemma3 as a smoke test, that would be great.

If you are unable to access the version from Google, try the unsloth version unsloth/gemma-3-1b-it (https://github.com/pytorch/executorch/blob/main/.github/workflows/mlx.yml#L469C18-L469C39)

Comment thread backends/mlx/examples/llm/run_llm_hf.py Outdated
logger = logging.getLogger(__name__)


def _iter_mlx_backend_candidates():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should not be needed. Did you do:

python install_executorch.py --editable

on a mac machine with xcode installed? If so, in the install logs, did you see a comment about MLX installation being skipped for some reason?

Comment thread backends/mlx/runtime/MLXBackend.cpp Outdated
}

try {
std::cerr << "MLX init: constructing handle" << std::endl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debug logging?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was debug-only while I was chasing the install/runtime registration issue. I’ll remove the std::cerr logging before merge.

@metascroy
Copy link
Copy Markdown
Contributor

Looks fantastic!

A couple questions:

  1. Are we regression gemma3 support at all?
  2. Does it work without "--use-custom-sdpa --use-custom-kv-cache" flags? If not, why? (This PR can stay focused on the custom path, I'm just curious what went wrong)
  3. Did you try embedding quant? If so, did something go wrong?

Comment thread .github/workflows/mlx.yml

QEMBEDDING_ARGS="--qembedding ${QCONFIG}"
if [ "${MODEL_ID}" = "google/gemma-4-E2B-it" ]; then
QEMBEDDING_ARGS=""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no embeeding?

Comment thread backends/mlx/examples/llm/run_llm_hf.py Outdated

logger.info(f"Loading model from {pte_path}...")
et_runtime = Runtime.get()
et_runtime = _ensure_mlx_backend_registered()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed, see comment on the install process.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s fair. I added this while debugging the installed-package path locally because MLXBackend was not being registered from the installed package, and I wanted a way to keep validating the runtime path. Since the install-path issue is now fixed, I’ll remove it and rely on the normal install flow.

# Decode only the newly generated tokens (not the input prompt)
new_tokens = generated_tokens[seq_len:]
generated_text = tokenizer.decode(new_tokens, skip_special_tokens=True)
generated_text = text_processor.decode(new_tokens, skip_special_tokens=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break the path where uses_processor=False?

Can we unify these two paths somehow?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up unifying this path. text_processor is now either an AutoProcessor or an AutoTokenizer, and both decode through text_processor.decode(...), so the uses_processor=False case should still work. The remaining split is only at encode time, where AutoProcessor needs processor(text=..., return_tensors="pt") and AutoTokenizer still uses encode(...).

@zeel2104
Copy link
Copy Markdown
Author

zeel2104 commented Apr 24, 2026

  1. Are we regression gemma3 support at all?
  2. Does it work without "--use-custom-sdpa --use-custom-kv-cache" flags? If not, why? (This PR can stay focused on the custom path, I'm just curious what went wrong)
  3. Did you try embedding quant? If so, did something go wrong?

I don’t expect a Gemma 3 regression from these changes.
I kept this PR to the Gemma 4 path I could validate end to end.

I did not get a non-custom Gemma 4 path to a validated state here; the issues I hit were around Gemma 4’s hybrid/shared-KV cache layout and mixed sliding/full-attention behavior, so I focused on the custom SDPA + custom KV cache path.

I also did not land --qembedding because I wasn’t able to validate it reliably for Gemma 4. The configuration that worked consistently was:
--qlinear 4w --use-custom-sdpa --use-custom-kv-cache

That’s why docs and CI are limited to that exact configuration in this PR.

@zeel2104
Copy link
Copy Markdown
Author

@metascroy
I retried the clean editable flow locally and it works now.

I ran python install_executorch.py --editable and verified that MLXBackend is registered from the editable install, so you’re right that this runtime fallback should not be needed. I’ve removed the helper in the latest commit and now rely on the normal install flow.

@zeel2104 zeel2104 marked this pull request as ready for review April 24, 2026 01:30
@zeel2104
Copy link
Copy Markdown
Author

zeel2104 commented Apr 24, 2026

I tried to rerun a Gemma 3 smoke test locally, but I’m currently blocked by Hugging Face access on google/gemma-3-1b-it rather than by an MLX/export failure.

The request fails at model download with:

  • 401 Unauthorized
  • GatedRepoError: Cannot access gated repo

So I wasn’t able to complete a Gemma 3 end-to-end rerun in this environment. I still don’t expect this change to regress Gemma 3, since the relevant change here is switching the sliding-window lookup to model.config.get_text_config(), which should also cover the plain text config case, but I haven’t been able to revalidate Gemma 3 locally yet due to model access.

Let me know if you’d like me to dig further into the non-custom path or embedding quant in a follow-up.

@metascroy
Copy link
Copy Markdown
Contributor

I tried to rerun a Gemma 3 smoke test locally, but I’m currently blocked by Hugging Face access on google/gemma-3-1b-it rather than by an MLX/export failure.

The request fails at model download with:

  • 401 Unauthorized
  • GatedRepoError: Cannot access gated repo

So I wasn’t able to complete a Gemma 3 end-to-end rerun in this environment. I still don’t expect this change to regress Gemma 3, since the relevant change here is switching the sliding-window lookup to model.config.get_text_config(), which should also cover the plain text config case, but I haven’t been able to revalidate Gemma 3 locally yet due to model access.

Let me know if you’d like me to dig further into the non-custom path or embedding quant in a follow-up.

For gemma3 verification, you can use the unsloth version model_id="unsloth/gemma-3-1b-it", which isn't gated. This is what we use in CI: https://github.com/pytorch/executorch/blob/main/.github/workflows/mlx.yml#L469

@metascroy
Copy link
Copy Markdown
Contributor

  1. Are we regression gemma3 support at all?
  2. Does it work without "--use-custom-sdpa --use-custom-kv-cache" flags? If not, why? (This PR can stay focused on the custom path, I'm just curious what went wrong)
  3. Did you try embedding quant? If so, did something go wrong?

I don’t expect a Gemma 3 regression from these changes. I kept this PR to the Gemma 4 path I could validate end to end.

I did not get a non-custom Gemma 4 path to a validated state here; the issues I hit were around Gemma 4’s hybrid/shared-KV cache layout and mixed sliding/full-attention behavior, so I focused on the custom SDPA + custom KV cache path.

I also did not land --qembedding because I wasn’t able to validate it reliably for Gemma 4. The configuration that worked consistently was: --qlinear 4w --use-custom-sdpa --use-custom-kv-cache

That’s why docs and CI are limited to that exact configuration in this PR.

I also did not land --qembedding because I wasn’t able to validate it reliably for Gemma 4

Can you say a bit more on what you mean by reliably? Did it fail to lower or run? Or did you run into model quality issues with quantized embeddings?

On non-custom path: I think it is fine to leave as follow-up, I was just curious about the specific errors you saw.

ET_LOG(Error, "MLX execute failed: %s", e.what());
return Error::Internal;
} catch (...) {
ET_LOG(Error, "MLX execute failed: unknown non-std exception");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you hit this case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not specifically hit those C++ catch-all paths.

The failures I was debugging were earlier in the flow:

  • Python/export-time Gemma 4 compatibility issues in the HF export path
  • installed/editable install issues around getting the MLX path working cleanly
  • the DEBUG=release editable install failure in setup.py

So those catch blocks were not the source of the Gemma 4 bring-up work here.

}
return Error::InvalidProgram;
} catch (...) {
ET_LOG(Error, "Failed to load MLX program: unknown non-std exception");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you hit this case?

Comment thread setup.py Outdated
# is Release.
def get_build_type(is_debug=None) -> str:
debug = int(os.environ.get("DEBUG", 0) or 0) if is_debug is None else is_debug
if is_debug is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these changes for debugging only?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these were not debug-only.

This came from the editable install path failing in my environment because DEBUG=release, while the existing code assumed DEBUG was always integer-like. The get_build_type() change makes that handling robust for string values like release / debug / true / false, which unblocked python install_executorch.py --editable for me.

I re-ran the editable install after this change and verified that MLXBackend registers correctly there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not touch setup.py for this task, unless it is actually needed.

If things work with "python install_executorch.py --editable", then let's leave these setup improvements for another PR.

@zeel2104
Copy link
Copy Markdown
Author

  1. Are we regression gemma3 support at all?
  2. Does it work without "--use-custom-sdpa --use-custom-kv-cache" flags? If not, why? (This PR can stay focused on the custom path, I'm just curious what went wrong)
  3. Did you try embedding quant? If so, did something go wrong?

I don’t expect a Gemma 3 regression from these changes. I kept this PR to the Gemma 4 path I could validate end to end.
I did not get a non-custom Gemma 4 path to a validated state here; the issues I hit were around Gemma 4’s hybrid/shared-KV cache layout and mixed sliding/full-attention behavior, so I focused on the custom SDPA + custom KV cache path.
I also did not land --qembedding because I wasn’t able to validate it reliably for Gemma 4. The configuration that worked consistently was: --qlinear 4w --use-custom-sdpa --use-custom-kv-cache
That’s why docs and CI are limited to that exact configuration in this PR.

I also did not land --qembedding because I wasn’t able to validate it reliably for Gemma 4

Can you say a bit more on what you mean by reliably? Did it fail to lower or run? Or did you run into model quality issues with quantized embeddings?

On non-custom path: I think it is fine to leave as follow-up, I was just curious about the specific errors you saw.

For --qembedding, I didn’t fully characterize it for Gemma 4 before scoping it out of the PR. On the Gemma 3 smoke rerun I just did, the failure happened earlier in export in the custom KV-cache path, before I could isolate embedding quant behavior separately. So I still don’t have a clean “quality vs lower vs runtime” answer for --qembedding specifically; I just don’t have enough validated signal yet to document or CI-enable it.

@zeel2104
Copy link
Copy Markdown
Author

I tried to rerun a Gemma 3 smoke test locally, but I’m currently blocked by Hugging Face access on google/gemma-3-1b-it rather than by an MLX/export failure.
The request fails at model download with:

  • 401 Unauthorized
  • GatedRepoError: Cannot access gated repo

So I wasn’t able to complete a Gemma 3 end-to-end rerun in this environment. I still don’t expect this change to regress Gemma 3, since the relevant change here is switching the sliding-window lookup to model.config.get_text_config(), which should also cover the plain text config case, but I haven’t been able to revalidate Gemma 3 locally yet due to model access.
Let me know if you’d like me to dig further into the non-custom path or embedding quant in a follow-up.

For gemma3 verification, you can use the unsloth version model_id="unsloth/gemma-3-1b-it", which isn't gated. This is what we use in CI: https://github.com/pytorch/executorch/blob/main/.github/workflows/mlx.yml#L469

I reran the Gemma 3 smoke test locally using the ungated CI model unsloth/gemma-3-1b-it, and I did hit a failure in the current custom Gemma path.

Tested with:
--qlinear 4w --qembedding 4w --use-custom-sdpa --use-custom-kv-cache

The failure happens during export, before lowering/runtime:

  • inside replace_hf_cache_with_mlx_ring_buffer(...)
  • when constructing HFStaticCache
  • it reaches HF early_initialization
  • then fails with:

TypeError: zeros(): argument 'size' failed to unpack the object at pos 2 with error "type must be tuple of ints, but got list"

So at least in my local environment, this does look like a Gemma 3 regression in the custom KV-cache path rather than just a Gemma 4-only issue.

The later run_llm_hf.py failure was only because the .pte file was never produced after export failed.

@zeel2104
Copy link
Copy Markdown
Author

Happy to investigate that Gemma 3 custom-cache regression further if you want that covered before merge, or I can keep this PR scoped strictly to the Gemma 4 path that was validated.

@metascroy
Copy link
Copy Markdown
Contributor

Happy to investigate that Gemma 3 custom-cache regression further if you want that covered before merge, or I can keep this PR scoped strictly to the Gemma 4 path that was validated.

Let's see what CI says. You can keep the change scoped to gemma 4, but we cannot have gemma 3 regressing because of your change.

@metascroy
Copy link
Copy Markdown
Contributor

metascroy commented Apr 27, 2026

@zeel2104 CI for gemma3 (custom path only) is failing with:

File ".../transformers/cache_utils.py", line 798, in early_initialization
    fake_keys_tensor = torch.zeros((batch_size, num_heads, 0, head_dim), dtype=dtype, device=device)
TypeError: zeros(): argument 'size' failed to unpack the object at pos 2 with error
  "type must be tuple of ints, but got list"

whereas it was previously passing. I suspect there is a breaking change in HF interfaces, and your changes for the custom path are implicitly depending on this breaking change. Can you make sure your changes work against the pin in https://github.com/pytorch/executorch/blob/main/.ci/docker/ci_commit_pins/optimum-executorch.txt (which is what we run in CI).

See the tests in https://github.com/pytorch/executorch/blob/main/.github/workflows/mlx.yml for setup (and transformer version we pin against).

Let me know if this isn't possible to do for Gemma4.

@zeel2104
Copy link
Copy Markdown
Author

Makes sense. The setup.py changes were packaging/install-path improvements rather than core Gemma 4 MLX support, so I’ll take them out of this PR and keep them for a follow-up if still useful.

@zeel2104
Copy link
Copy Markdown
Author

@zeel2104 CI for gemma3 (custom path only) is failing with:

File ".../transformers/cache_utils.py", line 798, in early_initialization
    fake_keys_tensor = torch.zeros((batch_size, num_heads, 0, head_dim), dtype=dtype, device=device)
TypeError: zeros(): argument 'size' failed to unpack the object at pos 2 with error
  "type must be tuple of ints, but got list"

whereas it was previously passing. I suspect there is a breaking change in HF interfaces, and your changes for the custom path are implicitly depending on this breaking change. Can you make sure your changes work against the pin in https://github.com/pytorch/executorch/blob/main/.ci/docker/ci_commit_pins/optimum-executorch.txt (which is what we run in CI).

See the tests in https://github.com/pytorch/executorch/blob/main/.github/workflows/mlx.yml for setup (and transformer version we pin against).

Let me know if this isn't possible to do for Gemma4.

Thanks, I tracked this down to an HF cache API compatibility issue.

My custom cache replacement had started assuming newer HF cache-layer behavior than the version pinned in CI. I updated it to handle both the older pinned interface and the newer Gemma 4-capable interface.

After the fix:

  • Gemma 3 custom path export works again with unsloth/gemma-3-1b-it
  • Gemma 4 custom path still works with google/gemma-4-E2B-it

So this should avoid the Gemma 3 regression while keeping the Gemma 4 support intact.

@metascroy
Copy link
Copy Markdown
Contributor

Re-running CI

@metascroy
Copy link
Copy Markdown
Contributor

metascroy commented Apr 28, 2026

Gemma3 is working again, but it looks like gemma4 failed in CI :(

2026-04-28T03:02:31.6119920Z [INFO 2026-04-28 03:01:56,682 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6120570Z [INFO 2026-04-28 03:01:56,685 run_llm_hf.py:60] Loaded processor from HuggingFace: google/gemma-4-E2B-it
2026-04-28T03:02:31.6121060Z [INFO 2026-04-28 03:01:56,685 run_llm_hf.py:103] Loading model from /tmp/gemma4-e2b.pte...
2026-04-28T03:02:31.6121430Z [INFO 2026-04-28 03:01:56,705 run_llm_hf.py:108] Model input_ids max seq len: 511
2026-04-28T03:02:31.6121840Z [INFO 2026-04-28 03:02:23,619 run_llm_hf.py:112] Encoding prompt: 'What is the capital of France?'
2026-04-28T03:02:31.6122390Z [INFO 2026-04-28 03:02:23,718 run_llm_hf.py:121] Input shape: torch.Size([1, 16])
2026-04-28T03:02:31.6122800Z [INFO 2026-04-28 03:02:23,719 run_llm_hf.py:139] Running full-prompt prefill (16 tokens)...
2026-04-28T03:02:31.6123170Z [cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
2026-04-28T03:02:31.6123520Z [cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
2026-04-28T03:02:31.6123890Z [INFO 2026-04-28 03:02:27,735 run_llm_hf.py:145] Prefill time: 4.017s (4.0 tokens/sec)
2026-04-28T03:02:31.6124270Z [INFO 2026-04-28 03:02:27,738 run_llm_hf.py:156] Generating up to 50 tokens...
2026-04-28T03:02:31.6124480Z 
2026-04-28T03:02:31.6124550Z Prefill time: 4.017s (4.0 tok/s)
2026-04-28T03:02:31.6124760Z Decode time:  2.137s (50 tokens, 23.4 tok/s)
2026-04-28T03:02:31.6124930Z 
2026-04-28T03:02:31.6125010Z ============================================================
2026-04-28T03:02:31.6125310Z Generated text:
2026-04-28T03:02:31.6125490Z ============================================================
2026-04-28T03:02:31.6125720Z The
2026-04-28T03:02:31.6125880Z ============================================================
2026-04-28T03:02:31.6126090Z + grep -iq Paris
2026-04-28T03:02:31.6126610Z + echo 'W0428 03:01:47.114000 43372 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2026-04-28T03:02:31.6127500Z [INFO 2026-04-28 03:01:50,989 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6128460Z [INFO 2026-04-28 03:01:51,078 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6129490Z [INFO 2026-04-28 03:01:51,160 _client.py:1025] HTTP Request: GET https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6130500Z [INFO 2026-04-28 03:01:51,259 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6131400Z [INFO 2026-04-28 03:01:51,333 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6132330Z [INFO 2026-04-28 03:01:51,407 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6133230Z [INFO 2026-04-28 03:01:51,485 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6134030Z [INFO 2026-04-28 03:01:51,603 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6135020Z [INFO 2026-04-28 03:01:51,693 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/chat_template.jinja "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6136030Z [INFO 2026-04-28 03:01:51,778 _client.py:1025] HTTP Request: GET https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/chat_template.jinja "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6137000Z [INFO 2026-04-28 03:01:51,858 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6137910Z [INFO 2026-04-28 03:01:51,946 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6138990Z [INFO 2026-04-28 03:01:52,020 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6139980Z [INFO 2026-04-28 03:01:52,050 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6140880Z [INFO 2026-04-28 03:01:52,127 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/chat_template.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6141690Z [INFO 2026-04-28 03:01:52,200 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/chat_template.jinja "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6142650Z [INFO 2026-04-28 03:01:52,259 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/chat_template.jinja "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6143580Z [INFO 2026-04-28 03:01:52,339 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/audio_tokenizer_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6144400Z [INFO 2026-04-28 03:01:52,415 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6145330Z [INFO 2026-04-28 03:01:52,443 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6146250Z [INFO 2026-04-28 03:01:52,516 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6147060Z [INFO 2026-04-28 03:01:52,593 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6147990Z [INFO 2026-04-28 03:01:52,622 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6148900Z [INFO 2026-04-28 03:01:52,696 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6149720Z [INFO 2026-04-28 03:01:52,768 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6150650Z [INFO 2026-04-28 03:01:52,795 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6151560Z [INFO 2026-04-28 03:01:52,892 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6152380Z [INFO 2026-04-28 03:01:52,965 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6153300Z [INFO 2026-04-28 03:01:52,994 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6154280Z [INFO 2026-04-28 03:01:53,069 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6157540Z [INFO 2026-04-28 03:01:53,143 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6158620Z [INFO 2026-04-28 03:01:53,171 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6159560Z [INFO 2026-04-28 03:01:53,256 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6160500Z [INFO 2026-04-28 03:01:53,339 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/tokenizer_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6161570Z [INFO 2026-04-28 03:01:53,421 _client.py:1025] HTTP Request: GET https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/tokenizer_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6162590Z [INFO 2026-04-28 03:01:53,499 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6163490Z [INFO 2026-04-28 03:01:53,593 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6164260Z [INFO 2026-04-28 03:01:53,675 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/tokenizer.json "HTTP/1.1 302 Found"
2026-04-28T03:02:31.6165090Z [INFO 2026-04-28 03:01:53,759 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it/xet-read-token/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6165930Z [INFO 2026-04-28 03:01:54,664 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/added_tokens.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6169750Z [INFO 2026-04-28 03:01:54,739 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/special_tokens_map.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6170470Z [INFO 2026-04-28 03:01:56,034 _client.py:1025] HTTP Request: GET https://huggingface.co/api/models/google/gemma-4-E2B-it "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6171200Z [INFO 2026-04-28 03:01:56,145 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6172130Z [INFO 2026-04-28 03:01:56,179 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6173060Z [INFO 2026-04-28 03:01:56,257 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/video_preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6173890Z [INFO 2026-04-28 03:01:56,361 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6174710Z [INFO 2026-04-28 03:01:56,434 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/processor_config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-28T03:02:31.6175640Z [INFO 2026-04-28 03:01:56,462 _client.py:1025] HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/google/gemma-4-E2B-it/b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf/processor_config.json "HTTP/1.1 200 OK"
2026-04-28T03:02:31.6176740Z [INFO 2026-04-28 03:01:56,535 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/video_preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6177590Z [INFO 2026-04-28 03:01:56,682 _client.py:1025] HTTP Request: HEAD https://huggingface.co/google/gemma-4-E2B-it/resolve/main/preprocessor_config.json "HTTP/1.1 404 Not Found"
2026-04-28T03:02:31.6178360Z [INFO 2026-04-28 03:01:56,685 run_llm_hf.py:60] Loaded processor from HuggingFace: google/gemma-4-E2B-it
2026-04-28T03:02:31.6178850Z [INFO 2026-04-28 03:01:56,685 run_llm_hf.py:103] Loading model from /tmp/gemma4-e2b.pte...
2026-04-28T03:02:31.6179260Z [INFO 2026-04-28 03:01:56,705 run_llm_hf.py:108] Model input_ids max seq len: 511
2026-04-28T03:02:31.6179740Z [INFO 2026-04-28 03:02:23,619 run_llm_hf.py:112] Encoding prompt: '\''What is the capital of France?'\''
2026-04-28T03:02:31.6180200Z [INFO 2026-04-28 03:02:23,718 run_llm_hf.py:121] Input shape: torch.Size([1, 16])
2026-04-28T03:02:31.6180620Z [INFO 2026-04-28 03:02:23,719 run_llm_hf.py:139] Running full-prompt prefill (16 tokens)...
2026-04-28T03:02:31.6181100Z [cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
2026-04-28T03:02:31.6181500Z [cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
2026-04-28T03:02:31.6181880Z [INFO 2026-04-28 03:02:27,735 run_llm_hf.py:145] Prefill time: 4.017s (4.0 tokens/sec)
2026-04-28T03:02:31.6182360Z [INFO 2026-04-28 03:02:27,738 run_llm_hf.py:156] Generating up to 50 tokens...
2026-04-28T03:02:31.6182600Z 
2026-04-28T03:02:31.6182680Z Prefill time: 4.017s (4.0 tok/s)
2026-04-28T03:02:31.6182920Z Decode time:  2.137s (50 tokens, 23.4 tok/s)
2026-04-28T03:02:31.6183100Z 
2026-04-28T03:02:31.6183190Z ============================================================
2026-04-28T03:02:31.6183450Z Generated text:
2026-04-28T03:02:31.6183680Z ============================================================
2026-04-28T03:02:31.6183900Z The
2026-04-28T03:02:31.6184040Z ============================================================'
2026-04-28T03:02:31.6184300Z + echo 'Failed: Expected '\''Paris'\'' not found in output'
2026-04-28T03:02:31.6184560Z Failed: Expected 'Paris' not found in output
2026-04-28T03:02:31.6184770Z + exit 1
2026-04-28T03:02:31.6184930Z Traceback (most recent call last):
2026-04-28T03:02:31.6185340Z   File "/Users/runner/work/executorch/executorch/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
2026-04-28T03:02:31.6185790Z     main()
2026-04-28T03:02:31.6185970Z     ~~~~^^
2026-04-28T03:02:31.6186360Z   File "/Users/runner/work/executorch/executorch/test-infra/.github/scripts/run_with_env_secrets.py", line 61, in main
2026-04-28T03:02:31.6186860Z     run_cmd_or_die(f"bash {os.environ.get('RUNNER_TEMP', '')}/exec_script")
2026-04-28T03:02:31.6187170Z     ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-04-28T03:02:31.6187620Z   File "/Users/runner/work/executorch/executorch/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
2026-04-28T03:02:31.6188150Z     raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
2026-04-28T03:02:31.6188600Z RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
2026-04-28T03:02:31.6402700Z ##[error]Process completed with exit code 1.
2026-04-28T03:02:31.6839100Z ##[group]Run pmeier/pytest-results-action@a2c1430e2bddadbad9f49a6f9b879f062c6b19b1
2026-04-28T03:02:31.6839440Z with:
2026-04-28T03:02:31.6839610Z   path: /Users/runner/work/_temp/test-results
2026-04-28T03:02:31.6839830Z   fail-on-empty: false
2026-04-28T03:02:31.6839990Z env:
2026-04-28T03:02:31.6840130Z   REPOSITORY: pytorch/executorch
2026-04-28T03:02:31.6845030Z   SCRIPT: set -eux

Comment thread .github/workflows/mlx.yml Outdated
if [ "${MODEL_ID}" = "google/gemma-4-E2B-it" ]; then
# Gemma 4 requires a newer Transformers build than the CI-wide
# optimum-executorch pin currently brings in.
${CONDA_RUN} pip install -U "transformers @ git+https://github.com/huggingface/transformers.git"
Copy link
Copy Markdown
Contributor

@metascroy metascroy Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pin on something specific? Whatever version you pin on, add to README under gemma4 section.

@zeel2104
Copy link
Copy Markdown
Author

@metascroy

The Gemma 4 failure changed after the last CI fix, export and runtime now work, but output quality regressed on a floating transformers HEAD.

I pinned Gemma 4 to the transformers commit I validated locally:
61461a7bcb458db7cf6eeea49678b9ab776a7821

I added the same pin to the README as well.

I haven’t touched the Qwen35-MoE threshold yet since that still looks separate.

@metascroy
Copy link
Copy Markdown
Contributor

@zeel2104 it looks like the gemma4 test is failing in CI with:

2026-04-28T21:04:23.9629920Z Prefill time: 4.052s (3.9 tok/s)
2026-04-28T21:04:23.9630110Z Decode time:  2.477s (50 tokens, 20.2 tok/s)
2026-04-28T21:04:23.9630250Z 
2026-04-28T21:04:23.9630330Z ============================================================
2026-04-28T21:04:23.9630540Z Generated text:
2026-04-28T21:04:23.9630700Z ============================================================
2026-04-28T21:04:23.9630890Z The
2026-04-28T21:04:23.9631040Z ============================================================'
2026-04-28T21:04:23.9631230Z + grep -iq Paris
2026-04-28T21:04:23.9631430Z + echo 'Failed: Expected '\''Paris'\'' not found in output'
2026-04-28T21:04:23.9631670Z Failed: Expected 'Paris' not found in output
2026-04-28T21:04:23.9631890Z + exit 1
2026-04-28T21:04:24.0010380Z ##[error]Process completed with exit code 1.
2026-04-28T21:04:24.0663200Z ##[group]Run pmeier/pytest-results-action@a2c1430e2bddadbad9f49a6f9b879f062c6b19b1
2026-04-28T21:04:24.0663610Z with:
2026-04-28T21:04:24.0663780Z   path: /Users/runner/work/_temp/test-results
2026-04-28T21:04:24.0663990Z   fail-on-empty: false
2026-04-28T21:04:24.0664140Z env:
2026-04-28T21:04:24.0664280Z   REPOSITORY: pytorch/executorch
2026-04-28T21:04:24.0669490Z   SCRIPT: set -eux

@zeel2104
Copy link
Copy Markdown
Author

I pushed one more Gemma 4 follow-up.

CI is getting through export and runtime now, so I updated run_llm_hf.py to prefer AutoTokenizer for this text-only flow and only fall back to AutoProcessor if needed.

I left the Qwen35-MoE threshold unchanged since that still looks separate.

@metascroy
Copy link
Copy Markdown
Contributor

I pushed one more Gemma 4 follow-up.

CI is getting through export and runtime now, so I updated run_llm_hf.py to prefer AutoTokenizer for this text-only flow and only fall back to AutoProcessor if needed.

I left the Qwen35-MoE threshold unchanged since that still looks separate.

Yeah, you can ignore the Qwen3.5-MOE. It is unrelated.

Re-running CI with your latest changes.

@zeel2104
Copy link
Copy Markdown
Author

Again the gemma4 CI failed, it was due to moving dependencies in the validation path.

First, CI was using a transformers version that couldn’t load gemma4. After fixing that, CI was still pulling a newer Hugging Face model snapshot than the one I had validated locally.

To make the Gemma 4 path reproducible, I pinned both:

  • transformers commit: 61461a7bcb458db7cf6eeea49678b9ab776a7821
  • model revision: b4a601102c3d45e2b7b50e2057a6d5ec8ed4adcf

I also wired the model revision through the export/run scripts and added the same pins to the README.
Hopefully it works now

else:
raise NotImplementedError(f"Support for input {arg} is not implemented")

placeholder_nodes = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this change.

Why is gemma4 sensistive to this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got here by diffing a previously working Gemma 4 .pte against a fresh export.

What changed there was the slot assignment for the two rotary constants used by sliding-window vs full attention. This change was just to make that assignment deterministic instead of depending on raw placeholder traversal order.

Gemma 4 is where I noticed it because that model exercises both constants in the same path.

If you’d prefer, I can drop this

@zeel2104
Copy link
Copy Markdown
Author

@metascroy
I tried a few different things to narrow this down. Export and runtime are working, and it doesn’t seem to be coming from the custom SDPA path, custom KV cache path, or the custom HF wrapper path anymore.

At this point it looks more like a Gemma 4 4w issue specifically. Would you like me to keep digging on 4w here, or should we narrow the Gemma 4 scope for now and handle 4w separately?

@metascroy
Copy link
Copy Markdown
Contributor

@metascroy I tried a few different things to narrow this down. Export and runtime are working, and it doesn’t seem to be coming from the custom SDPA path, custom KV cache path, or the custom HF wrapper path anymore.

At this point it looks more like a Gemma 4 4w issue specifically. Would you like me to keep digging on 4w here, or should we narrow the Gemma 4 scope for now and handle 4w separately?

It would be good to get 4w working. Let me try checking out your PR today to see if I notice anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mlx CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants