ASiT input shape

Hi,
is there a specific reason why the input shape seems to be [time, mel_bins] in vision_transformer.py:
https://github.com/fschmid56/PretrainedSED/blob/1aa47e482f7e89904cba2338999345025d8b4e36/models/asit/vision_transformer.py#L118-L123
but it is [mel_bins, time] when the model is defined in the ASIT_wrapper.py?
https://github.com/fschmid56/PretrainedSED/blob/1aa47e482f7e89904cba2338999345025d8b4e36/models/asit/ASIT_wrapper.py#L10-L16

I would expect that the order of these dimensions is fixed, since the positional encodings will change otherwise.


	class VisionTransformer(nn.Module):
	""" Vision Transformer """

	def __init__(self, audio_size=[1024, 128], patch_size=[16, 16], in_chans=3, num_classes=0, embed_dim=768, depth=12,
	num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
	drop_path_rate=0., norm_layer=nn.LayerNorm, **kwargs):

	self.asit = vit_base(
	patch_size=[16, 16],
	audio_size=[128, 592],
	stride=[16, 16],
	in_chans=1,
	num_classes=0
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASiT input shape #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ASiT input shape #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions