Skip to content

Commit 9a975c1

Browse files
committed
chore: cleanup repo
1 parent 320b487 commit 9a975c1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+3
-1946
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
__pycache__
File renamed without changes.

README.md

Lines changed: 2 additions & 232 deletions
Original file line numberDiff line numberDiff line change
@@ -1,234 +1,4 @@
11
# Tetris AI Project
2+
> This repository is a fork of [truonging/Tetris-A.I](https://github.com/truonging/Tetris-A.I).
23
3-
## **Demo Video**
4-
[![Tetris AI Demo](https://img.youtube.com/vi/D8MjBG5kSzU/0.jpg)](https://www.youtube.com/watch?v=D8MjBG5kSzU)
5-
6-
## **Tetris AI in Action**
7-
![Tetris AI Playing](assets/tetris_ai_demo.gif)
8-
9-
## **Genetic Algorithm in Action**
10-
![Tetris AI Playing](assets/ga_ai_demo.gif)
11-
12-
## Overview
13-
This project is an AI-driven Tetris player built using **Python** and **Pygame**. It leverages **Deep Q-Networks (DQN), Double DQN, Prioritized Experience Replay, and Genetic Algorithms** to train an agent that can efficiently play Tetris. The project underwent significant optimizations from **Version 1** to **Version 2** to enhance training speed and efficiency.
14-
15-
## Environment
16-
The game environment follows NES Tetris rules, implementing:
17-
- Scoring system similar to NES Tetris.
18-
- Gravity mechanics for line clears.
19-
20-
The AI interacts with the game through state-based decisions, selecting moves from all possible placements and rotations.
21-
22-
## AI Agent
23-
The initial AI agent was based on **Deep Q-Learning (DQN)**, which uses a **single neural network** to estimate both **current and target Q-values**. However, this approach had issues with **Q-value overestimation** and **early convergence**, leading me to explore improvements.
24-
25-
### Why Q-Learning and DQN?
26-
- Tetris has a **well-defined state space**: represent the board state using **6 features** (`total_height, bumpiness, holes, line_cleared, y_pos, pillar`).
27-
- The agent **selects only one action per move**, making Q-learning a good fit for evaluating discrete actions efficiently.
28-
- **Experience Replay** helped stabilize learning by allowing the agent to learn from past moves, improving long-term decision-making.
29-
- With this setup, some agents **achieved 500+ lines** by **game 10,000**, demonstrating strong learning potential.
30-
31-
### Transition to Double Q-Learning
32-
- Initially, I implemented **Double Q-Learning**, which **separates action selection from Q-value estimation** to **reduce overestimation bias**.
33-
- This led to more accurate value estimations, improving learning stability.
34-
35-
### Switching to Double DQN (DDQN)
36-
I later adopted **Double DQN (DDQN)**, which expands on Double Q-Learning by using **two separate neural networks**:
37-
- **Primary Network**: Predicts actions and updates **every 200 pieces placed**.
38-
- **Target Network**: Computes target Q-values and updates **every 1000 pieces** to provide more stable training.
39-
40-
This approach **reduces instability** in training, **prevents premature convergence**, and allows the agent to **generalize better across different board states**.
41-
42-
### Prioritized Experience Replay (PER)
43-
Initially, my agent used **Experience Replay**, where past experiences were **randomly sampled** for training. This method helped the agent make **long-term decisions** by allowing it to learn from **past moves**, rather than relying solely on recent experiences.
44-
45-
However, **random sampling treats all experiences equally**, even though some experiences provide **more learning value** than others. To improve this, I implemented **Prioritized Experience Replay (PER).**
46-
47-
#### Why Prioritized Experience Replay?
48-
- Instead of selecting experiences at random, **PER selects experiences based on their TD error** (**Temporal Difference Error**).
49-
- **TD Error = Difference between predicted and actual Q-values**.
50-
- **High TD Error** → The agent’s prediction was far off, meaning **there’s more to learn from this experience**.
51-
- **Low TD Error** → The agent already understands this experience well, meaning **less learning value**.
52-
53-
By prioritizing high **TD error** experiences, the agent **learns from its biggest mistakes first**, leading to **faster and more efficient training**—especially in early stages.
54-
55-
#### Implementation of PER
56-
- I replaced the traditional deque-based replay buffer with a **heap-based structure**, allowing efficient retrieval of **high-priority experiences**.
57-
- The heap keeps track of the **maximum TD error**, ensuring that the most **informative experiences are sampled more frequently**.
58-
59-
This approach **significantly improved early training efficiency**, allowing the agent to **focus on valuable experiences** rather than wasting computation on redundant ones.
60-
61-
### Reward Function Design
62-
A well-balanced reward function was necessary to help the agent learn **long-term strategies**. Simply rewarding line clears resulted in poor planning, so I introduced **sparse rewards** to encourage **better board management**.
63-
64-
#### Key Objectives of a Good Board State:
65-
- **Minimal bumpiness** → Smoother surfaces for easier line clears.
66-
- **Minimal holes** → Avoiding trapped empty spaces.
67-
- **Small pillars** → Preventing difficult-to-clear structures.
68-
69-
#### Reward & Penalty System:
70-
- **Penalties for** increasing bumpiness, holes, or large pillars.
71-
- **Punishment for stacking too high** to prevent early game over.
72-
- **Encouragement for moves that improve board stability.**
73-
74-
#### Handling Delayed Rewards (Temporal Credit Assignment Problem)
75-
A good move in Tetris **does not always have an immediate impact**. The agent may place a piece that **sets up a Tetris many moves later**.
76-
77-
- **Short-term rewards** (clearing a single line) might seem optimal, but **setting up for a Tetris (4-line clear) is more valuable**.
78-
- **Experience Replay** helps the agent revisit **earlier moves that contributed to major rewards later**, reinforcing good strategies.
79-
- **Discount Factor (Gamma = 0.999)** ensures that the agent **values long-term rewards**, preventing greed for short-term gains.
80-
81-
By **considering the delayed impact of moves**, the agent learns **how to set up better board states**, instead of focusing only on immediate rewards.
82-
83-
### Exploration vs. Exploitation Strategy
84-
Instead of relying solely on a **typical decay schedule**, I combined it with an **alternating strategy** between **high exploration and high exploitation** in **500-game cycles**. This method **sped up learning while maintaining stability**.
85-
86-
#### High Exploration Phase (500 games)
87-
- **Epsilon:** `0.3 → 0.0001`
88-
- **Learning Rate (LR):** `0.01 → 0.001`
89-
- Since the agent has **10-40 move choices per state**, high exploration **encourages broader strategy discovery**.
90-
- A **higher learning rate (LR)** allows more aggressive updates, helping the agent learn **board setup strategies faster**.
91-
92-
#### High Exploitation Phase (500 games)
93-
- **Epsilon:** `0.0001`
94-
- **Learning Rate (LR):** `0.001`
95-
- The agent **tests its learned strategies** from the exploration phase.
96-
- **Lower LR prevents drastic updates**, refining the strategy without overfitting.
97-
- This phase **stabilizes** the agent's learning, similar to how **stocks correct after a surge**.
98-
99-
#### Second Cycle of Exploration & Exploitation
100-
- **First cycle**: The agent explored **without prior knowledge**.
101-
- **Second cycle**: The agent **explored with refined strategies**, leading to more **targeted discoveries**.
102-
- **Another 500-game exploration phase** allowed for additional improvements.
103-
- **Final exploitation phase** fine-tuned an even better strategy.
104-
105-
This **alternating method** allowed the agent to **learn, refine, explore deeper, and perfect its strategy**.
106-
107-
### Genetic Algorithm (GA)
108-
Balancing the reward function for Tetris AI proved to be **extremely difficult**:
109-
- **Punishing holes too much** led to agents building tall pillars.
110-
- **Punishing pillars too much** made agents cover them too early, avoiding **Tetris clears**.
111-
- **Over-rewarding Tetris clears** made agents stack high and wait for an I-piece, often leading to failure.
112-
- **Under-rewarding Tetris clears** led to single and double line clears, missing higher scores.
113-
114-
Initially, **tuning these rewards required manually adjusting values** and running **500+ games per test**—an impractical and slow process. **Genetic Algorithms (GA)** provided a **brute-force approach** to optimizing these parameters efficiently.
115-
116-
### Evolutionary Strategy
117-
Taking inspiration from **natural selection (survival of the fittest)**, I designed the GA to evolve **the best reward function** by:
118-
- **High exploration early on**, allowing diverse strategies to develop.
119-
- **Gradual transition to exploitation**, refining the best strategies over generations.
120-
121-
Each agent’s performance was measured by its **average number of lines cleared over 500 games**.
122-
123-
### **Selection Process**
124-
We used a **hybrid of elite selection and tournament selection**:
125-
- **Elite Selection (50%)**: The **top 50%** of agents were **directly passed** to the next generation to preserve high-performing strategies.
126-
- **Tournament Selection (50%)**: The remaining 50% were selected **randomly from the top-performing agents**, maintaining diversity.
127-
128-
### **Crossover Strategy**
129-
- **Offspring inherited reward function parameters from parents**.
130-
- **Used a mix of Uniform and Alpha crossover**:
131-
- **100% uniform crossover in early generations** (high randomness).
132-
- **Gradually transitioned to 100% alpha crossover by generation 100** (favoring one parent’s values).
133-
- This **ensured high exploration early on and stable exploitation later**.
134-
135-
### **Mutation Strategy**
136-
- **50% mutation rate early on**, ensuring **diverse strategies**.
137-
- **Gradually decayed to 5% by generation 100**, stabilizing learned behaviors.
138-
- Mutations introduced **small adjustments** to reward parameters, preventing premature convergence.
139-
140-
This **exploration-to-exploitation strategy** allowed me to **discover an optimal balance of rewards**, creating a **highly competitive AI**.
141-
142-
---
143-
144-
## **Optimizations (Version 1 → Version 2)**
145-
146-
- **Version 1:** The project was **not originally designed** to handle multiple game boards in one window. As a workaround, I used **multiprocessing**, giving each agent its **own CPU core**. However, this approach **limited me to 10 agents**, constrained by available CPU processors.
147-
148-
- **Version 2:** Knowing I wanted **many agents running at once**, I **redesigned the project** to support multiple boards within a single process. This **eliminated the need for multiprocessing**, allowing the computer to efficiently manage tasks internally. Thanks to optimizations, I increased the number of agents from **10 to 250**.
149-
150-
### **Profiling revealed two major bottlenecks**:
151-
1. **Rendering inefficiencies** – Redrawing **static elements** every frame.
152-
2. **State calculation overhead** – Dropping pieces in **all possible positions** consumed excessive time.
153-
154-
### **Rendering Optimizations**
155-
- **Old Approach**: Redrew **every block** in every frame.
156-
- **New Approach**: Used **dirty rects** (only updating changed areas).
157-
- **Result**: Rendering time reduced from **90s → 5s**.
158-
159-
### **State Calculation Optimizations**
160-
- **Old Approach**: Used **Python loops**, making `calc_all_states()` slow (**~180s**).
161-
- **New Approach**: Rewrote with **Numba’s njit** for **machine code execution**.
162-
- **Result**: Execution time reduced from **180s → 15s**.
163-
164-
### **Additional Optimizations**
165-
- **Blitting Optimization**: Rendered **directly to the main screen** instead of intermediate surfaces.
166-
- **Batch Processing**: Consolidated **multiple small calculations** into fewer large ones.
167-
- **Reduced Redundant Board Operations**: Minimized **unnecessary board evaluations**.
168-
169-
These optimizations allowed **seamless Genetic Algorithm training**, unlocking **massive scalability improvements**.
170-
171-
---
172-
173-
### **Version 1 Profiling (500 games)**
174-
```plaintext
175-
223807016 function calls (210035110 primitive calls) in 483.520 seconds
176-
177-
Ordered by: cumulative time
178-
179-
ncalls tottime percall cumtime percall filename:lineno(function)
180-
181-
1 2.155 2.155 481.741 481.741 train.py:85(run_simulation)
182-
42254 0.305 0.000 206.318 0.005 tetris.py:87(play_full)
183-
82161 0.575 0.000 205.107 0.002 tetris.py:139(play_step)
184-
42383 12.129 0.000 180.555 0.004 game.py:203(calc_all_states)
185-
82160 0.253 0.000 114.913 0.001 game.py:250(run)
186-
1005317 4.698 0.000 96.764 0.000 game.py:311(hard_drop)
187-
10562398 16.251 0.000 92.066 0.000 game.py:316(move_down)
188-
```
189-
190-
### **Version 2 Profiling (500 games)**
191-
```plaintext
192-
22190082 function calls (20214157 primitive calls) in 52.530 seconds
193-
194-
Ordered by: cumulative time
195-
196-
ncalls tottime percall cumtime percall filename:lineno(function)
197-
198-
1 0.423 0.423 50.491 50.491 main_screen.py:155(run2)
199-
46619 3.519 0.000 20.576 0.000 main_screen.py:113(play_action)
200-
31/21 0.000 0.000 17.968 0.856 _ops.py:291(fallthrough)
201-
698733/140673 0.902 0.000 17.785 0.000 module.py:1735(_wrapped_call_impl)
202-
698733/140673 1.156 0.000 17.579 0.000 module.py:1743(_call_impl)
203-
139515 1.450 0.000 17.034 0.000 model.py:12(forward)
204-
```
205-
206-
### **Key Takeaways**
207-
- **Total runtime reduced from 483.52s → 52.53s (≈89% speedup)**
208-
- **`calc_all_states()` reduced from 180s → ~15s**
209-
- **Rendering reduced from 90s → ~5s**
210-
- **Overall, training is significantly faster and more scalable.**
211-
212-
---
213-
214-
## **Running the Project**
215-
To run the AI, navigate to the appropriate version and execute:
216-
217-
### **Install requirements**
218-
```bash
219-
pip install -r requirements.txt
220-
```
221-
222-
### **Version 1**
223-
```bash
224-
cd Version1
225-
python -c "import train; train.run_game(True)" # Enable slow drop
226-
python -c "import train; train.run_game(False)" # Disable slow drop
227-
```
228-
229-
### **Version 2**
230-
```bash
231-
cd Version2
232-
python genetic_algo.py
233-
```
234-
4+
Its objective is to develop a competitive Tetris bot capable of playing in multiplayer duels.
-7.82 KB
Binary file not shown.
-4.35 KB
Binary file not shown.
-24.8 KB
Binary file not shown.
-6.09 KB
Binary file not shown.
-3.02 KB
Binary file not shown.
-3.24 KB
Binary file not shown.
-3.3 KB
Binary file not shown.

0 commit comments

Comments
 (0)