-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
Hi,
is there a specific reason why the input shape seems to be [time, mel_bins] in vision_transformer.py:
PretrainedSED/models/asit/vision_transformer.py
Lines 118 to 123 in 1aa47e4
| class VisionTransformer(nn.Module): | |
| """ Vision Transformer """ | |
| def __init__(self, audio_size=[1024, 128], patch_size=[16, 16], in_chans=3, num_classes=0, embed_dim=768, depth=12, | |
| num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0., | |
| drop_path_rate=0., norm_layer=nn.LayerNorm, **kwargs): |
but it is [mel_bins, time] when the model is defined in the ASIT_wrapper.py?
PretrainedSED/models/asit/ASIT_wrapper.py
Lines 10 to 16 in 1aa47e4
| self.asit = vit_base( | |
| patch_size=[16, 16], | |
| audio_size=[128, 592], | |
| stride=[16, 16], | |
| in_chans=1, | |
| num_classes=0 | |
| ) |
I would expect that the order of these dimensions is fixed, since the positional encodings will change otherwise.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels