Skip to content

Commit 374eca1

Browse files
committed
Update README.md
1 parent de15555 commit 374eca1

File tree

1 file changed

+19
-12
lines changed

1 file changed

+19
-12
lines changed

README.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -85,11 +85,16 @@ These implementations should only be a few lines long.
8585

8686
In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
8787

88-
This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. However, note
89-
that they use shared memory in Example 39-1; don't do that yet. Instead, write
88+
This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
89+
taught shared memory, but you **shouldn't use it yet**. Example 39-1 uses
90+
shared memory, but is limited to operating on very small arrays! Instead, write
9091
this using global memory only. As a result of this, you will have to do
9192
`ilog2ceil(n)` separate kernel invocations.
9293

94+
Beware of errors in Example 39-1 in the book; the pseudocode (Example 2) is
95+
probably correct, but the CUDA code has a few small errors (missing braces, bad
96+
indentation, etc.)
97+
9398
Make sure your implementation works on non-power-of-two sized arrays (see
9499
`ilog2ceil`).
95100

@@ -101,9 +106,18 @@ Make sure your implementation works on non-power-of-two sized arrays (see
101106
In `stream_compaction/efficient.cu`, implement
102107
`StreamCompaction::Efficient::scan`
103108

104-
This is equivalent to the "Work-Efficient Parallel Scan" from the slides and
105-
*GPU Gems 3* section 39.2.2. Instead of using shared memory as in Example 39-2,
106-
use global memory only. This will again require multiple kernel invocations.
109+
This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
110+
taught shared memory, but you **shouldn't use it yet**. Example 39-1 uses
111+
shared memory, but is limited to operating on very small arrays! Instead, write
112+
this using global memory only. As a result of this, you will have to do
113+
`ilog2ceil(n)` separate kernel invocations.
114+
115+
Beware of errors in Example 39-1 in the book; the pseudocode (Example 2) is
116+
probably correct, but the CUDA code has a few small errors (missing braces, bad
117+
indentation, etc.)
118+
119+
Make sure your implementation works on non-power-of-two sized arrays (see
120+
`ilog2ceil`).
107121

108122
### 3.2. Stream Compaction
109123
In `stream_compaction/efficient.cu`, implement
@@ -117,13 +131,6 @@ In `stream_compaction/common.cu`, implement these for use in `compact`:
117131
* `StreamCompaction::Common::kernMapToBoolean`
118132
* `StreamCompaction::Common::kernScatter`
119133

120-
Beware of errors in Example 39-2 in the book; the pseudocode (Examples 3/4) is
121-
correct, but the CUDA code has a few errors (missing braces, bad indentation,
122-
etc.)
123-
124-
Make sure your implementation works on non-power-of-two sized arrays (see
125-
`ilog2ceil`).
126-
127134

128135
## Part 4: Using Thrust's Implementation
129136

0 commit comments

Comments
 (0)