|
1 | 1 | # Tetris AI Project |
| 2 | +> This repository is a fork of [truonging/Tetris-A.I](https://github.com/truonging/Tetris-A.I). |
2 | 3 |
|
3 | | -## **Demo Video** |
4 | | -[](https://www.youtube.com/watch?v=D8MjBG5kSzU) |
5 | | - |
6 | | -## **Tetris AI in Action** |
7 | | - |
8 | | - |
9 | | -## **Genetic Algorithm in Action** |
10 | | - |
11 | | - |
12 | | -## Overview |
13 | | -This project is an AI-driven Tetris player built using **Python** and **Pygame**. It leverages **Deep Q-Networks (DQN), Double DQN, Prioritized Experience Replay, and Genetic Algorithms** to train an agent that can efficiently play Tetris. The project underwent significant optimizations from **Version 1** to **Version 2** to enhance training speed and efficiency. |
14 | | - |
15 | | -## Environment |
16 | | -The game environment follows NES Tetris rules, implementing: |
17 | | -- Scoring system similar to NES Tetris. |
18 | | -- Gravity mechanics for line clears. |
19 | | - |
20 | | -The AI interacts with the game through state-based decisions, selecting moves from all possible placements and rotations. |
21 | | - |
22 | | -## AI Agent |
23 | | -The initial AI agent was based on **Deep Q-Learning (DQN)**, which uses a **single neural network** to estimate both **current and target Q-values**. However, this approach had issues with **Q-value overestimation** and **early convergence**, leading me to explore improvements. |
24 | | - |
25 | | -### Why Q-Learning and DQN? |
26 | | -- Tetris has a **well-defined state space**: represent the board state using **6 features** (`total_height, bumpiness, holes, line_cleared, y_pos, pillar`). |
27 | | -- The agent **selects only one action per move**, making Q-learning a good fit for evaluating discrete actions efficiently. |
28 | | -- **Experience Replay** helped stabilize learning by allowing the agent to learn from past moves, improving long-term decision-making. |
29 | | -- With this setup, some agents **achieved 500+ lines** by **game 10,000**, demonstrating strong learning potential. |
30 | | - |
31 | | -### Transition to Double Q-Learning |
32 | | -- Initially, I implemented **Double Q-Learning**, which **separates action selection from Q-value estimation** to **reduce overestimation bias**. |
33 | | -- This led to more accurate value estimations, improving learning stability. |
34 | | - |
35 | | -### Switching to Double DQN (DDQN) |
36 | | -I later adopted **Double DQN (DDQN)**, which expands on Double Q-Learning by using **two separate neural networks**: |
37 | | -- **Primary Network**: Predicts actions and updates **every 200 pieces placed**. |
38 | | -- **Target Network**: Computes target Q-values and updates **every 1000 pieces** to provide more stable training. |
39 | | - |
40 | | -This approach **reduces instability** in training, **prevents premature convergence**, and allows the agent to **generalize better across different board states**. |
41 | | - |
42 | | -### Prioritized Experience Replay (PER) |
43 | | -Initially, my agent used **Experience Replay**, where past experiences were **randomly sampled** for training. This method helped the agent make **long-term decisions** by allowing it to learn from **past moves**, rather than relying solely on recent experiences. |
44 | | - |
45 | | -However, **random sampling treats all experiences equally**, even though some experiences provide **more learning value** than others. To improve this, I implemented **Prioritized Experience Replay (PER).** |
46 | | - |
47 | | -#### Why Prioritized Experience Replay? |
48 | | -- Instead of selecting experiences at random, **PER selects experiences based on their TD error** (**Temporal Difference Error**). |
49 | | -- **TD Error = Difference between predicted and actual Q-values**. |
50 | | - - **High TD Error** → The agent’s prediction was far off, meaning **there’s more to learn from this experience**. |
51 | | - - **Low TD Error** → The agent already understands this experience well, meaning **less learning value**. |
52 | | - |
53 | | -By prioritizing high **TD error** experiences, the agent **learns from its biggest mistakes first**, leading to **faster and more efficient training**—especially in early stages. |
54 | | - |
55 | | -#### Implementation of PER |
56 | | -- I replaced the traditional deque-based replay buffer with a **heap-based structure**, allowing efficient retrieval of **high-priority experiences**. |
57 | | -- The heap keeps track of the **maximum TD error**, ensuring that the most **informative experiences are sampled more frequently**. |
58 | | - |
59 | | -This approach **significantly improved early training efficiency**, allowing the agent to **focus on valuable experiences** rather than wasting computation on redundant ones. |
60 | | - |
61 | | -### Reward Function Design |
62 | | -A well-balanced reward function was necessary to help the agent learn **long-term strategies**. Simply rewarding line clears resulted in poor planning, so I introduced **sparse rewards** to encourage **better board management**. |
63 | | - |
64 | | -#### Key Objectives of a Good Board State: |
65 | | -- **Minimal bumpiness** → Smoother surfaces for easier line clears. |
66 | | -- **Minimal holes** → Avoiding trapped empty spaces. |
67 | | -- **Small pillars** → Preventing difficult-to-clear structures. |
68 | | - |
69 | | -#### Reward & Penalty System: |
70 | | -- **Penalties for** increasing bumpiness, holes, or large pillars. |
71 | | -- **Punishment for stacking too high** to prevent early game over. |
72 | | -- **Encouragement for moves that improve board stability.** |
73 | | - |
74 | | -#### Handling Delayed Rewards (Temporal Credit Assignment Problem) |
75 | | -A good move in Tetris **does not always have an immediate impact**. The agent may place a piece that **sets up a Tetris many moves later**. |
76 | | - |
77 | | -- **Short-term rewards** (clearing a single line) might seem optimal, but **setting up for a Tetris (4-line clear) is more valuable**. |
78 | | -- **Experience Replay** helps the agent revisit **earlier moves that contributed to major rewards later**, reinforcing good strategies. |
79 | | -- **Discount Factor (Gamma = 0.999)** ensures that the agent **values long-term rewards**, preventing greed for short-term gains. |
80 | | - |
81 | | -By **considering the delayed impact of moves**, the agent learns **how to set up better board states**, instead of focusing only on immediate rewards. |
82 | | - |
83 | | -### Exploration vs. Exploitation Strategy |
84 | | -Instead of relying solely on a **typical decay schedule**, I combined it with an **alternating strategy** between **high exploration and high exploitation** in **500-game cycles**. This method **sped up learning while maintaining stability**. |
85 | | - |
86 | | -#### High Exploration Phase (500 games) |
87 | | -- **Epsilon:** `0.3 → 0.0001` |
88 | | -- **Learning Rate (LR):** `0.01 → 0.001` |
89 | | -- Since the agent has **10-40 move choices per state**, high exploration **encourages broader strategy discovery**. |
90 | | -- A **higher learning rate (LR)** allows more aggressive updates, helping the agent learn **board setup strategies faster**. |
91 | | - |
92 | | -#### High Exploitation Phase (500 games) |
93 | | -- **Epsilon:** `0.0001` |
94 | | -- **Learning Rate (LR):** `0.001` |
95 | | -- The agent **tests its learned strategies** from the exploration phase. |
96 | | -- **Lower LR prevents drastic updates**, refining the strategy without overfitting. |
97 | | -- This phase **stabilizes** the agent's learning, similar to how **stocks correct after a surge**. |
98 | | - |
99 | | -#### Second Cycle of Exploration & Exploitation |
100 | | -- **First cycle**: The agent explored **without prior knowledge**. |
101 | | -- **Second cycle**: The agent **explored with refined strategies**, leading to more **targeted discoveries**. |
102 | | -- **Another 500-game exploration phase** allowed for additional improvements. |
103 | | -- **Final exploitation phase** fine-tuned an even better strategy. |
104 | | - |
105 | | -This **alternating method** allowed the agent to **learn, refine, explore deeper, and perfect its strategy**. |
106 | | - |
107 | | -### Genetic Algorithm (GA) |
108 | | -Balancing the reward function for Tetris AI proved to be **extremely difficult**: |
109 | | -- **Punishing holes too much** led to agents building tall pillars. |
110 | | -- **Punishing pillars too much** made agents cover them too early, avoiding **Tetris clears**. |
111 | | -- **Over-rewarding Tetris clears** made agents stack high and wait for an I-piece, often leading to failure. |
112 | | -- **Under-rewarding Tetris clears** led to single and double line clears, missing higher scores. |
113 | | - |
114 | | -Initially, **tuning these rewards required manually adjusting values** and running **500+ games per test**—an impractical and slow process. **Genetic Algorithms (GA)** provided a **brute-force approach** to optimizing these parameters efficiently. |
115 | | - |
116 | | -### Evolutionary Strategy |
117 | | -Taking inspiration from **natural selection (survival of the fittest)**, I designed the GA to evolve **the best reward function** by: |
118 | | -- **High exploration early on**, allowing diverse strategies to develop. |
119 | | -- **Gradual transition to exploitation**, refining the best strategies over generations. |
120 | | - |
121 | | -Each agent’s performance was measured by its **average number of lines cleared over 500 games**. |
122 | | - |
123 | | -### **Selection Process** |
124 | | -We used a **hybrid of elite selection and tournament selection**: |
125 | | -- **Elite Selection (50%)**: The **top 50%** of agents were **directly passed** to the next generation to preserve high-performing strategies. |
126 | | -- **Tournament Selection (50%)**: The remaining 50% were selected **randomly from the top-performing agents**, maintaining diversity. |
127 | | - |
128 | | -### **Crossover Strategy** |
129 | | -- **Offspring inherited reward function parameters from parents**. |
130 | | -- **Used a mix of Uniform and Alpha crossover**: |
131 | | - - **100% uniform crossover in early generations** (high randomness). |
132 | | - - **Gradually transitioned to 100% alpha crossover by generation 100** (favoring one parent’s values). |
133 | | - - This **ensured high exploration early on and stable exploitation later**. |
134 | | - |
135 | | -### **Mutation Strategy** |
136 | | -- **50% mutation rate early on**, ensuring **diverse strategies**. |
137 | | -- **Gradually decayed to 5% by generation 100**, stabilizing learned behaviors. |
138 | | -- Mutations introduced **small adjustments** to reward parameters, preventing premature convergence. |
139 | | - |
140 | | -This **exploration-to-exploitation strategy** allowed me to **discover an optimal balance of rewards**, creating a **highly competitive AI**. |
141 | | - |
142 | | ---- |
143 | | - |
144 | | -## **Optimizations (Version 1 → Version 2)** |
145 | | - |
146 | | -- **Version 1:** The project was **not originally designed** to handle multiple game boards in one window. As a workaround, I used **multiprocessing**, giving each agent its **own CPU core**. However, this approach **limited me to 10 agents**, constrained by available CPU processors. |
147 | | - |
148 | | -- **Version 2:** Knowing I wanted **many agents running at once**, I **redesigned the project** to support multiple boards within a single process. This **eliminated the need for multiprocessing**, allowing the computer to efficiently manage tasks internally. Thanks to optimizations, I increased the number of agents from **10 to 250**. |
149 | | - |
150 | | -### **Profiling revealed two major bottlenecks**: |
151 | | -1. **Rendering inefficiencies** – Redrawing **static elements** every frame. |
152 | | -2. **State calculation overhead** – Dropping pieces in **all possible positions** consumed excessive time. |
153 | | - |
154 | | -### **Rendering Optimizations** |
155 | | -- **Old Approach**: Redrew **every block** in every frame. |
156 | | -- **New Approach**: Used **dirty rects** (only updating changed areas). |
157 | | - - **Result**: Rendering time reduced from **90s → 5s**. |
158 | | - |
159 | | -### **State Calculation Optimizations** |
160 | | -- **Old Approach**: Used **Python loops**, making `calc_all_states()` slow (**~180s**). |
161 | | -- **New Approach**: Rewrote with **Numba’s njit** for **machine code execution**. |
162 | | - - **Result**: Execution time reduced from **180s → 15s**. |
163 | | - |
164 | | -### **Additional Optimizations** |
165 | | -- **Blitting Optimization**: Rendered **directly to the main screen** instead of intermediate surfaces. |
166 | | -- **Batch Processing**: Consolidated **multiple small calculations** into fewer large ones. |
167 | | -- **Reduced Redundant Board Operations**: Minimized **unnecessary board evaluations**. |
168 | | - |
169 | | -These optimizations allowed **seamless Genetic Algorithm training**, unlocking **massive scalability improvements**. |
170 | | - |
171 | | ---- |
172 | | - |
173 | | -### **Version 1 Profiling (500 games)** |
174 | | -```plaintext |
175 | | -223807016 function calls (210035110 primitive calls) in 483.520 seconds |
176 | | -
|
177 | | -Ordered by: cumulative time |
178 | | -
|
179 | | -ncalls tottime percall cumtime percall filename:lineno(function) |
180 | | -
|
181 | | -1 2.155 2.155 481.741 481.741 train.py:85(run_simulation) |
182 | | -42254 0.305 0.000 206.318 0.005 tetris.py:87(play_full) |
183 | | -82161 0.575 0.000 205.107 0.002 tetris.py:139(play_step) |
184 | | -42383 12.129 0.000 180.555 0.004 game.py:203(calc_all_states) |
185 | | -82160 0.253 0.000 114.913 0.001 game.py:250(run) |
186 | | -1005317 4.698 0.000 96.764 0.000 game.py:311(hard_drop) |
187 | | -10562398 16.251 0.000 92.066 0.000 game.py:316(move_down) |
188 | | -``` |
189 | | - |
190 | | -### **Version 2 Profiling (500 games)** |
191 | | -```plaintext |
192 | | -22190082 function calls (20214157 primitive calls) in 52.530 seconds |
193 | | -
|
194 | | -Ordered by: cumulative time |
195 | | -
|
196 | | -ncalls tottime percall cumtime percall filename:lineno(function) |
197 | | -
|
198 | | -1 0.423 0.423 50.491 50.491 main_screen.py:155(run2) |
199 | | -46619 3.519 0.000 20.576 0.000 main_screen.py:113(play_action) |
200 | | -31/21 0.000 0.000 17.968 0.856 _ops.py:291(fallthrough) |
201 | | -698733/140673 0.902 0.000 17.785 0.000 module.py:1735(_wrapped_call_impl) |
202 | | -698733/140673 1.156 0.000 17.579 0.000 module.py:1743(_call_impl) |
203 | | -139515 1.450 0.000 17.034 0.000 model.py:12(forward) |
204 | | -``` |
205 | | - |
206 | | -### **Key Takeaways** |
207 | | -- **Total runtime reduced from 483.52s → 52.53s (≈89% speedup)** |
208 | | -- **`calc_all_states()` reduced from 180s → ~15s** |
209 | | -- **Rendering reduced from 90s → ~5s** |
210 | | -- **Overall, training is significantly faster and more scalable.** |
211 | | - |
212 | | ---- |
213 | | - |
214 | | -## **Running the Project** |
215 | | -To run the AI, navigate to the appropriate version and execute: |
216 | | - |
217 | | -### **Install requirements** |
218 | | -```bash |
219 | | -pip install -r requirements.txt |
220 | | -``` |
221 | | - |
222 | | -### **Version 1** |
223 | | -```bash |
224 | | -cd Version1 |
225 | | -python -c "import train; train.run_game(True)" # Enable slow drop |
226 | | -python -c "import train; train.run_game(False)" # Disable slow drop |
227 | | -``` |
228 | | - |
229 | | -### **Version 2** |
230 | | -```bash |
231 | | -cd Version2 |
232 | | -python genetic_algo.py |
233 | | -``` |
234 | | - |
| 4 | +Its objective is to develop a competitive Tetris bot capable of playing in multiplayer duels. |
0 commit comments