Skip to content

Latest commit

 

History

History
253 lines (191 loc) · 12.6 KB

File metadata and controls

253 lines (191 loc) · 12.6 KB

Deep Learning Cheatsheet

Table of Contents

  1. Foundations of Deep Learning
  2. Neural Network Fundamentals
  3. Advanced Neural Architectures
  4. Training Dynamics and Optimization
  5. Regularization and Generalization
  6. Deep Learning for Specific Domains
  7. Advanced Training Techniques best-practices-and-tips)

Foundations of Deep Learning

The Neural Network as a Universal Function Approximator

  • Cybenko's theorem and its implications
  • Depth vs. width trade-offs in network design

Information Theory in Deep Learning

  • Mutual information and the Information Bottleneck Theory
  • Implications for layer design and network depth

Tip: Understanding these theoretical foundations can guide your intuition when designing network architectures.

Neural Network Fundamentals

Activation Functions: A Deeper Dive

  • ReLU: f(x) = max(0, x)
    • Pros: No vanishing gradient for positive values, computationally efficient
    • Cons: "Dying ReLU" problem for negative inputs
  • Leaky ReLU: f(x) = max(αx, x), where α is a small constant (e.g., 0.01)
    • Addresses the dying ReLU problem
  • Parametric ReLU (PReLU): Learns the α parameter during training
  • Exponential Linear Unit (ELU): f(x) = x if x > 0, else α(e^x - 1)
    • Smooth function, can produce negative outputs
  • Scaled Exponential Linear Unit (SELU): Self-normalizing properties
    • Particularly useful for deep networks
  • Swish: f(x) = x * sigmoid(x)
    • Smooth, non-monotonic function

Tip: Experiment with different activation functions. While ReLU is a good default, others might perform better for specific tasks.

Loss Functions: Advanced Considerations

  • Focal Loss: Addresses class imbalance in object detection
  • Dice Loss: Useful for image segmentation tasks
  • Contrastive Loss: For similarity learning in siamese networks
  • Triplet Loss: Used in face recognition and image retrieval

Tip: Custom loss functions can significantly improve performance for specific tasks. Don't hesitate to design task-specific losses.

Optimizers: Beyond the Basics

  • Adam: Adaptive Moment Estimation
    • Combines ideas from RMSprop and momentum
    • Default choice for many practitioners
  • AdamW: Adam with decoupled weight decay
    • Often performs better than standard Adam
  • Lookahead: Can be combined with other optimizers
    • Maintains a slow weights copy, potentially improving convergence
  • LAMB: Layer-wise Adaptive Moments for Batch normalization
    • Useful for training with large batch sizes
  • Ranger: Combines Rectified Adam and Lookahead
    • Often provides fast convergence and good generalization

Tip: While Adam is a great default, experimenting with other optimizers can lead to faster convergence or better generalization.

Advanced Neural Architectures

Convolutional Neural Networks (CNNs): Advanced Techniques

  • Depthwise Separable Convolutions: Used in MobileNets for efficiency
  • Dilated (Atrous) Convolutions: Increase receptive field without increasing parameters
  • Deformable Convolutions: Adapt to geometric variations in input
  • Squeeze-and-Excitation Blocks: Model interdependencies between channels
  • Inverted Residuals: Used in MobileNetV2 for efficient feature extraction

Tip: These advanced CNN techniques can significantly improve performance or efficiency. Consider them when designing custom architectures.

Recurrent Neural Networks: Beyond LSTMs and GRUs

  • Attention Mechanisms in RNNs: Allows focusing on specific parts of the input sequence
  • Quasi-Recurrent Neural Networks (QRNNs): Combine benefits of CNNs and RNNs
  • Independently Recurrent Neural Networks (IndRNNs): Address vanishing/exploding gradients
  • Hierarchical Multiscale LSTMs: Model different timescales in sequences

Tip: While Transformers have largely replaced RNNs for many tasks, these advanced RNN architectures can still be useful, especially for tasks with limited data.

Transformer Architecture: In-depth Analysis

  • Multi-Head Attention: Allows attending to different parts of the input simultaneously
  • Positional Encoding: Techniques beyond sinusoidal encoding (e.g., learned positional embeddings)
  • Layer Normalization: Crucial for training stability in Transformers
  • Adaptive Computation Time: Dynamically adjust the number of decoding steps
  • Sparse Transformers: Efficient attention mechanisms for long sequences

Tip: Understanding the intricacies of Transformers is crucial for many modern NLP and even computer vision tasks.

Graph Neural Networks (GNNs): Advanced Topics

  • Graph Convolutional Networks (GCNs): Extend convolution to graph-structured data
  • Graph Attention Networks (GATs): Apply attention mechanisms to graphs
  • GraphSAGE: Efficient sampling-based approach for large graphs
  • Gated Graph Neural Networks: Incorporate LSTM-like gating mechanisms

Tip: GNNs are powerful for tasks involving relational data. Consider using them for problems that can be naturally represented as graphs.

Training Dynamics and Optimization

Learning Rate Schedules: Advanced Techniques

  • Cyclical Learning Rates: Cycle between lower and upper learning rate boundaries
  • One Cycle Policy: Single cycle with cosine annealing
  • Stochastic Weight Averaging (SWA): Average weights over different points in training
  • Layerwise Adaptive Rate Scaling (LARS): Adjust learning rates per layer
  • Gradient Centralization: Center gradients to improve training stability

Tip: Proper learning rate scheduling can often improve both convergence speed and final performance.

Batch Normalization Alternatives

  • Layer Normalization: Normalizes across features, useful for RNNs and Transformers
  • Instance Normalization: Useful for style transfer tasks
  • Group Normalization: Compromise between Layer and Instance Normalization
  • Weight Standardization: Standardize weights instead of activations

Tip: While Batch Normalization is powerful, these alternatives can be crucial for certain architectures or tasks.

Gradient Accumulation and Large Batch Training

  • Techniques for training with limited GPU memory
  • Effective batch size considerations
  • Scaling learning rates with batch size

Tip: Gradient accumulation can allow you to effectively use larger batch sizes than your GPU memory would normally allow.

Regularization and Generalization

Advanced Regularization Techniques

  • Spectral Normalization: Constrains the spectral norm of weight matrices
  • Dropblock: Structured dropout for convolutional networks
  • Shakeout: Combines L1 regularization with dropout
  • Cutout and Random Erasing: Image augmentation techniques that act as regularizers
  • Mixup and CutMix: Data augmentation techniques that combine multiple training examples

Tip: Combining multiple regularization techniques can lead to better generalization, but be careful not to over-regularize.

Uncertainty Estimation

  • Monte Carlo Dropout: Use dropout at inference time for uncertainty estimation
  • Deep Ensembles: Train multiple models for robust predictions
  • Bayesian Neural Networks: Explicitly model weight uncertainties

Tip: Estimating model uncertainty is crucial for many real-world applications, especially in high-stakes domains.

Deep Learning for Specific Domains

Computer Vision: State-of-the-Art Techniques

  • Vision Transformers (ViT): Applying Transformers to image tasks
  • DETR (DEtection TRansformer): End-to-end object detection with Transformers
  • Swin Transformer: Hierarchical Transformer for various vision tasks
  • MoCo and SimCLR: Self-supervised learning for visual representations

Tip: Keep an eye on the rapidly evolving landscape of vision Transformers. They're increasingly competitive with CNNs.

Natural Language Processing: Advanced Methods

  • BERT and its variants: RoBERTa, ALBERT, DistilBERT
  • GPT series: Autoregressive language models
  • T5: Text-to-Text Transfer Transformer
  • ELECTRA: Efficiently learning an encoder for NLP tasks

Tip: Fine-tuning pre-trained language models is often more effective than training from scratch, even for specialized domains.

Reinforcement Learning: Deep RL Techniques

  • Proximal Policy Optimization (PPO): Stable policy gradient method
  • Soft Actor-Critic (SAC): Off-policy algorithm for continuous action spaces
  • Rainbow DQN: Combination of multiple improvements to DQN
  • AlphaZero: Self-play reinforcement learning for perfect information games

Tip: In deep RL, implementation details matter a lot. Pay close attention to hyperparameters and normalization techniques.

Advanced Training Techniques

Meta-Learning

  • Model-Agnostic Meta-Learning (MAML): Learn to quickly adapt to new tasks
  • Prototypical Networks: Few-shot learning technique
  • Reptile: Simplified version of MAML

Tip: Meta-learning can be powerful when you need to adapt to new tasks with limited data.

Neural Architecture Search (NAS)

  • DARTS: Differentiable architecture search
  • ProxylessNAS: Memory-efficient NAS
  • Once-for-All Networks: Train a single network to support multiple sub-networks

Tip: While powerful, NAS can be computationally expensive. Consider using pre-designed efficient architectures unless you have significant computational resources.

Federated Learning

  • Techniques for training on decentralized data
  • Secure aggregation protocols
  • Differential privacy in federated learning

Tip: Federated learning is crucial when data cannot be centralized due to privacy concerns or regulatory requirements.

Continual Learning

  • Elastic Weight Consolidation (EWC): Prevent catastrophic forgetting
  • Progressive Neural Networks: Grow network capacity for new tasks
  • Memory-based approaches: Store and replay examples from previous tasks

Tip: Continual learning is essential for systems that need to adapt to new tasks without forgetting old ones.

Best Practices and Advanced Tips

  1. Hyperparameter Optimization:

    • Use Bayesian optimization tools like Optuna or Ray Tune
    • Consider multi-fidelity optimization techniques like Hyperband
  2. Mixed Precision Training:

    • Use FP16 or bfloat16 to speed up training and reduce memory usage
    • Be aware of potential numerical instabilities
  3. Debugging Deep Neural Networks:

    • Use gradient and activation histograms to diagnose issues
    • Implement unit tests for custom layers and loss functions
    • Use tools like DeepCheck for systematic testing of deep learning models
  4. Model Interpretability:

    • Implement Grad-CAM for CNN visualization
    • Use SHAP (SHapley Additive exPlanations) values for feature importance
    • Explore counterfactual explanations for individual predictions
  5. Efficient Data Pipeline:

    • Use data loading libraries like DALI for GPU-accelerated data loading
    • Implement prefetching and parallel data loading
    • Consider using TFRecords (for TensorFlow) or Lightning Data Modules (for PyTorch Lightning)
  6. Model Distillation and Compression:

    • Use techniques like quantization-aware training for efficient deployment
    • Explore lottery ticket hypothesis for model pruning
    • Consider neural architecture search for hardware-aware model design
  7. Experiment Management:

    • Use tools like MLflow, Weights & Biases, or Neptune.ai for tracking experiments
    • Implement version control for datasets and models
    • Document your experiments thoroughly, including failed attempts
  8. Reproducibility:

    • Set random seeds for all sources of randomness
    • Record software versions and hardware specifications
    • Consider using Docker containers for consistent environments
  9. Ethical Considerations:

    • Regularly audit your models for biases
    • Implement fairness constraints in your training process
    • Consider the environmental impact of large-scale model training
  10. Staying Updated:

    • Follow top conferences (NeurIPS, ICML, ICLR, CVPR, ACL)
    • Join discussion forums like /r/MachineLearning or participate in Kaggle competitions
    • Contribute to open-source projects to learn from and collaborate with others

Remember, deep learning is as much an art as it is a science. While these advanced techniques can be powerful, always start with a simple baseline and gradually increase complexity. Continuous experimentation and a solid understanding of the fundamentals are key to success in this rapidly evolving field. Happy deep learning!