# Configuration PLSC uses [yaml](https://en.wikipedia.org/wiki/YAML) files for unified configuration. The aim is to make all experimental results clearly expressed and reproducible. In the file, there are several sections, including: * Global * FP16 * DistributedStrategy * Model * Loss * Metric * LRScheduler * Optimizer * DataLoader * Export ## Global ```yaml # example Global: task_type: recognition train_epoch_func: default_train_one_epoch eval_func: face_verification_eval checkpoint: null finetune: False pretrained_model: null output_dir: ./output/ device: gpu save_interval: 1 max_num_latest_checkpoint: 0 eval_during_train: True eval_interval: 2000 eval_unit: "step" accum_steps: 1 epochs: 25 print_batch_step: 100 use_visualdl: True seed: 2022 ``` * `task_type`: Task type, currently supports `classification` and `recognition`. Default is `classification`. * `train_epoch_func`: The training function, usually defined in `plsc/engine/task_type/train.py`. Each task will define a default `default_train_one_epoch` function. If the provided training function cannot be satisfied, the user can add a custom training function. * `eval_func`: Similar to `train_epoch_func`, it is an evaluation function, usually defined in `plsc/engine/task_type/evaluation.py`. Default is `default_eval`. * `checkpoint`: When training is terminated midway, set the saved checkpoint prefix to resume training, e.g. `output/IResNet50/latest`. Default is `null`. * `pretrained_model`: Pre-trained weight path prefix, which needs to be set together with the `finetune` parameter. E.g. `output/IResNet50/best_model`. Default is `null`. * `finetune`: Indicates whether the loaded pretrained weights are for fine-tuning. Default is `False`. * `output_dir`: Output directory path. * `device`: Device type, currently only `cpu` and `gpu` are supported. * `save_interval`: How many `epoch` to save the checkpoint. * `max_num_latest_checkpoint`: How many recent checkpoints are kept, others will be deleted. * `eval_during_train`: Indicates whether to evaluate during training. * `eval_interval`: The frequency of evaluation, which needs to be set together with `eval_unit`. * `eval_unit`: The unit of evaluation, optional `step` and `epoch`. * `accum_steps`: Gradient accumulation (merging), when a device stores a batch_size that does not support setting, you can set `accum_steps` > 1 to enable this function. When enabled, divide batch_size into accum_steps runs. This function only works in training mode. The default value is `1`. * `epochs`: The total epoch of training. * `print_batch_step`: How many steps to print log once. * `use_visualdl`: Whether to enable visualdl. * `seed`: Random number seed. * `max_train_step`: Maximum training step. When the current number of training steps is greater than the set maximum number of training steps, the training will be stopped early. The default is not set, then ignore this function. * `flags`: The type is a dictionary representing the FLAGS that need to be set. For example `FLAGS_cudnn_exhaustive_search=0`. The default is not set, then only enable `FLAGS_cudnn_exhaustive_search=1`, `FLAGS_cudnn_batchnorm_spatial_persistent=1`, `FLAGS_max_inplace_grad_add=8`. ## FP16 ```yaml # example FP16: level: O1 # 'O0', 'O1', 'O2' fp16_custom_white_list: [] fp16_custom_black_list: [] GradScaler: init_loss_scaling: 27648.0 max_loss_scaling: 2.**32 incr_ratio: 2.0 decr_ratio: 0.5 incr_every_n_steps: 1000 decr_every_n_nan_or_inf: 2 use_dynamic_loss_scaling: True no_unscale_list: ['dist'] ``` The FP16 `O0` level is used by default when the FP16 section is not set. The above parameters do not necessarily need to be set explicitly. If they are missing, the default parameter values in the class initialization function will be used. * `level`: AMP optimization level, optional `O0`, `O1`, `O2`. `O0` means to turn off the AMP function, `O1` means that parameters and gradients use FP32 type, activation uses FP16, `O2` means that parameters, gradients, and activations use FP16. Note that when using O2, the master weight of the parameter is not set here, but is set in the Optimizer section. * `no_unscale_list`: Provides a special function. If the name set in `no_unscale_list` is in a parameter name, the gradient of this parameter will not be unscaled. ## DistributedStrategy ```yaml # example DistributedStrategy: data_parallel: True data_sharding: False recompute: layerlist_interval: 1 names: [] ``` Note: Distributed strategy configuration, currently only supports data parallel and recompute. * `data_parallel`: Whether to use data parallelism. * `data_sharding`: Whether to use data sharding parallelism. This is mutually exclusive with `data_parallell`. * `layerlist_interval`: If `recompute` is set, when there is a `nn.LayerList` layer in the model, you can set `layerlist_interval` to indicate how many blocks to enable recompute * `names`: If `recompute` is set, when the name in `names` is in a layer's name, this layer will enable recompute. This is mutually exclusive with `data_parallell`. ## Model ```yaml # example Model: name: IResNet50 num_features : 512 data_format : "NHWC" class_num: 93431 pfc_config: sample_ratio: 0.1 model_parallel: True ``` The `Model` section contains all configuration related to the network model. The configuration of each model may be different, it is recommended to directly see the definition in the model file. The `name` field must be set, and the function or class is instantiated with this string. Other fields are parameters to this function or class initialization function. ## Loss ```yaml # example Loss: Train: - ViTCELoss: weight: 1.0 epsilon: 0.0001 Eval: - CELoss: weight: 1.0 ``` The `Loss` section contains `Train` and `Eval[optional]` fields. Each field can contain multiple loss functions. For parameters, refer to the definition of the initialization function of the Loss class. Each loss function has a `weight` field, which represents the weight of multiple loss functions. ## Metric ```yaml # example Metric: Train: - TopkAcc: topk: [1, 5] Eval: - TopkAcc: topk: [1, 5] ``` The `Metric` section contains `Train` and `Eval[optional]` fields. Each field can contain multiple metric functions. For parameters, refer to the definition of the initialization function of the Metric class. ## LRScheduler ```yaml # example LRScheduler: name: Step boundaries: [10, 16, 22] values: [0.2, 0.02, 0.002, 0.0002] decay_unit: epoch ``` The `LRScheduler` section contains all configuration related to the learning rate scheduler. The configuration of each `LRScheduler` may be different, it is recommended to directly see the definition in `plsc/scheduler/`. The `name` field must be set, and the function or class is instantiated with this string. Other fields are parameters to this function or class initialization function. ## Optimizer ```yaml # example Optimizer: name: AdamW betas: (0.9, 0.999) epsilon: 1e-8 weight_decay: 0.3 use_master_param: False grad_clip: name: ClipGradByGlobalNorm clip_norm: 1.0 ``` The `Optimizer` section contains all configuration related to the optimizer. The configuration of each `Optimizer` may be different, it is recommended to directly see the definition in `plsc/optimizer/`. The `name` field must be set, and the function or class is instantiated with this string. Other fields are parameters to this function or class initialization function. When instantiating the optimizer, the model parameters are organized in parameter groups. * `use_master_param`: Indicates whether to use master weight during FP16 `O2` training. * `grad_clip`: Configuration for gradient clipping. **Note:** Gradient clipping is performed separately for each param group. ## DataLoader ```yaml # example DataLoader: Train: dataset: name: FaceIdentificationDataset image_root: ./dataset/MS1M_v3/ cls_label_path: ./dataset/MS1M_v3/label.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - RandFlipImage: flip_code: 1 - NormalizeImage: scale: 1.0/255.0 mean: [0.5, 0.5, 0.5] std: [0.5, 0.5, 0.5] order: '' - ToCHWImage: sampler: name: DistributedBatchSampler batch_size: 128 drop_last: False shuffle: True loader: num_workers: 8 use_shared_memory: True Eval: dataset: name: FaceVerificationDataset image_root: ./dataset/MS1M_v3/agedb_30 cls_label_path: ./dataset/MS1M_v3/agedb_30/label.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - NormalizeImage: scale: 1.0/255.0 mean: [0.5, 0.5, 0.5] std: [0.5, 0.5, 0.5] order: '' - ToCHWImage: sampler: name: BatchSampler batch_size: 128 drop_last: False shuffle: False loader: num_workers: 0 use_shared_memory: True ``` The `DataLoader` section contains `Train` and `Eval` fields. * `dataset`: The configuration of each `dataset` may be different, it is recommended to directly see the definition in `plsc/data/dataset`. For data preprocessing operations, see `plsc/data/preprocess`. * `sampler`: In general, `DistributedBatchSampler` can meet the requirements of most data parallelism. If there is an unsatisfied batch sampler, you can add a custom one in `plsc/data/sampler`, e.g. `RepeatedAugSampler`. * `loader`: Set multi-process configuration for data preprocessing. ## Export ```yaml # example Export: export_type: onnx input_shape: [None, 3, 112, 112] ``` The `Export` section contains the parameter configuration required to export the model. * `export_type`: The type of the exported model, currently only `paddle` and `onnx` types are supported * `input_shape`: Specifies the input shape of the exported model.