Parallelism problem and read/write performance in ZNS mode

Hello:
I noticed that physical page would always be in Channel 0 and Channel 1 no matter how many channels I actually configure, because the CH_BITS is set to 1. And I think this would hamper parallelism and read/write performance. So I changed the CH_BITS to 3  and make rsv = 6 to utilize 8 channels. But I got two very confusing problems.
```
#define CH_BITS     (1)

struct ppa {
    union {
        struct {
            uint64_t spg  : SPG_BITS; //sub page
            uint64_t pg   : PG_BITS;
	    uint64_t blk  : BLK_BITS;
	    uint64_t fc   : FC_BITS;
            uint64_t pl   : PL_BITS;
	    uint64_t ch   : CH_BITS;
            uint64_t V    : 1; //padding page or not
            uint64_t rsv  : 8;
        } g;

	uint64_t ppa;
    };
};
```

My ZNS configuration is:
LOGICAL_PAGE_SIZE = ZNS_PAGE_SIZE = 4KB
8 channels, 4 chips/channel, 2 planes/chip , 32 blocks/plane
1GB / zone, 32 zone in total.
 
My first question is **why improving parallelism seems hamper read performance while helps write performance?**

I used fio to test the read/write performance, my fio command is:
`fio --ioengine=psync --direct=1 --filename=/dev/nvme0n1 --rw=write --iodepth=16 --bs=32k --group_reporting --zonemode=zbd --name=seqwrite --offset_increment=0z --size=16z`

`fio --ioengine=psync --direct=1 --filename=/dev/nvme0n1 --rw=read --offset_increment=0z --size=2z --group_reporting --zonemode=zbd --bs=32k --name=seqread --numjobs=8`

When CH_BITS=1, the read/write performance is shown below, and we can see that write bandwidth is only 19.6MB/s, and read is 241MB/s.

```
**seqwrite**: (g=0): rw=write, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=16
fio-3.38-4-gcd56
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
^Cbs: 1 (f=1): [W(1)][0.8%][w=20.0MiB/s][w=640 IOPS][eta 13m:56s]
fio: terminating on signal 2

seqwrite: (groupid=0, jobs=1): err= 0: pid=1111: Thu Nov 14 07:34:59 2024
  write: IOPS=626, BW=19.6MiB/s (20.5MB/s)(150MiB/7661msec); 0 zone resets
    clat (usec): min=29, max=68982, avg=1591.88, stdev=8555.80
     lat (usec): min=30, max=68984, avg=1593.26, stdev=8555.78
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
     | 30.00th=[   42], 40.00th=[   47], 50.00th=[   55], 60.00th=[   57],
     | 70.00th=[   59], 80.00th=[   59], 90.00th=[   63], 95.00th=[   77],
     | 99.00th=[49021], 99.50th=[49021], 99.90th=[49021], 99.95th=[49021],
     | 99.99th=[68682]
   bw (  KiB/s): min=18432, max=20480, per=100.00%, avg=20206.93, stdev=637.92, samples=15
   iops        : min=  576, max=  640, avg=631.47, stdev=19.94, samples=15
  lat (usec)   : 50=45.74%, 100=50.78%, 250=0.31%, 500=0.02%
  lat (msec)   : 50=3.12%, 100=0.02%
  cpu          : usr=0.61%, sys=2.13%, ctx=4585, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4801,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=19.6MiB/s (20.5MB/s), 19.6MiB/s-19.6MiB/s (20.5MB/s-20.5MB/s), io=150MiB (157MB), run=7661-7661msec

Disk stats (read/write):
  nvme0n1: ios=51/4800, sectors=2112/307200, merge=0/0, ticks=6/7514, in_queue=7519, util=98.69%


**seqread**: (g=0): rw=read, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=1
...
fio-3.38-4-gcd56
Starting 8 processes
Jobs: 3 (f=3): [R(1),_(4),R(2),_(1)][88.5%][r=1047MiB/s][r=33.5k IOPS][eta 00m:09s]
seqread: (groupid=0, jobs=8): err= 0: pid=1092: Thu Nov 14 07:30:57 2024
  read: IOPS=7699, BW=241MiB/s (252MB/s)(16.0GiB/68095msec)
    clat (usec): min=13, max=29836, avg=137.70, stdev=290.41
     lat (usec): min=13, max=29836, avg=137.92, stdev=290.52
    clat percentiles (usec):
     |  1.00th=[   20],  5.00th=[   22], 10.00th=[   23], 20.00th=[   24],
     | 30.00th=[   26], 40.00th=[   28], 50.00th=[   31], 60.00th=[   37],
     | 70.00th=[   47], 80.00th=[  420], 90.00th=[  445], 95.00th=[  478],
     | 99.00th=[  498], 99.50th=[  506], 99.90th=[  586], 99.95th=[ 1057],
     | 99.99th=[15270]
   bw (  KiB/s): min= 2048, max=3113088, per=100.00%, avg=349634.07, stdev=69274.13, samples=729
   iops        : min=   64, max=97284, avg=10925.81, stdev=2164.80, samples=729
  lat (usec)   : 20=1.65%, 50=69.44%, 100=3.46%, 250=0.38%, 500=24.37%
  lat (usec)   : 750=0.64%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.71%, sys=4.32%, ctx=858341, majf=0, minf=154
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=241MiB/s (252MB/s), 241MiB/s-241MiB/s (252MB/s-252MB/s), io=16.0GiB (17.2GB), run=68095-68095msec

Disk stats (read/write):
  nvme0n1: ios=516011/0, sectors=33024704/0, merge=0/0, ticks=60955/0, in_queue=60954, util=99.96%
```
Then I changed the CH_BITS to 3. and conducted the same fio experiments as above. Results show that write bandwidth rises to 72.3MB/s as expected but read bandwidth falls to 78.2MB/s. Why would this happen?

```
**seqwrite**: (g=0): rw=write, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=16
fio-3.38-4-gcd56
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 1 (f=1): [W(1)][100.0%][w=72.0MiB/s][w=2304 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=1048: Thu Nov 14 07:58:45 2024
  write: IOPS=2312, BW=72.3MiB/s (75.8MB/s)(16.0GiB/226764msec); 0 zone resets
    clat (usec): min=26, max=31048, avg=429.29, stdev=2160.66
     lat (usec): min=27, max=31050, avg=430.43, stdev=2160.66
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   38], 10.00th=[   38], 20.00th=[   38],
     | 30.00th=[   38], 40.00th=[   38], 50.00th=[   38], 60.00th=[   39],
     | 70.00th=[   39], 80.00th=[   41], 90.00th=[   58], 95.00th=[   67],
     | 99.00th=[12256], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
     | 99.99th=[22152]
   bw (  KiB/s): min=67584, max=75927, per=100.00%, avg=74078.05, stdev=1383.76, samples=452
   iops        : min= 2112, max= 2372, avg=2314.90, stdev=43.23, samples=452
  lat (usec)   : 50=87.74%, 100=9.01%, 250=0.09%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=3.13%, 50=0.01%
  cpu          : usr=1.45%, sys=7.37%, ctx=489934, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=72.3MiB/s (75.8MB/s), 72.3MiB/s-72.3MiB/s (75.8MB/s-75.8MB/s), io=16.0GiB (17.2GB), run=226764-226764msec

Disk stats (read/write):
  nvme0n1: ios=50/524000, sectors=2104/33536000, merge=0/0, ticks=2/218353, in_queue=218355, util=100.00%

**seqread**: (g=0): rw=read, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=psync, iodepth=1
...
fio-3.38-4-gcd56
Starting 8 processes
Jobs: 1 (f=1): [_(3),R(1),_(4)][98.1%][r=70.4MiB/s][r=2252 IOPS][eta 00m:04s]                    
seqread: (groupid=0, jobs=8): err= 0: pid=1066: Thu Nov 14 08:02:46 2024
  read: IOPS=2502, BW=78.2MiB/s (82.0MB/s)(16.0GiB/209537msec)
    clat (usec): min=26, max=23734, avg=449.98, stdev=264.21
     lat (usec): min=26, max=23734, avg=450.45, stdev=264.22
    clat percentiles (usec):
     |  1.00th=[  371],  5.00th=[  396], 10.00th=[  404], 20.00th=[  420],
     | 30.00th=[  429], 40.00th=[  433], 50.00th=[  437], 60.00th=[  445],
     | 70.00th=[  465], 80.00th=[  482], 90.00th=[  494], 95.00th=[  498],
     | 99.00th=[  523], 99.50th=[  537], 99.90th=[  586], 99.95th=[ 1172],
     | 99.99th=[16057]
   bw (  KiB/s): min= 3328, max=574879, per=100.00%, avg=170196.40, stdev=18692.56, samples=1631
   iops        : min=  104, max=17964, avg=5318.13, stdev=584.13, samples=1631
  lat (usec)   : 50=0.04%, 100=0.01%, 250=0.01%, 500=95.30%, 750=4.60%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.03%, 50=0.01%
  cpu          : usr=0.62%, sys=3.03%, ctx=999070, majf=0, minf=151
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=78.2MiB/s (82.0MB/s), 78.2MiB/s-78.2MiB/s (82.0MB/s-82.0MB/s), io=16.0GiB (17.2GB), run=209537-209537msec

Disk stats (read/write):
  nvme0n1: ios=523929/0, sectors=33531456/0, merge=0/0, ticks=210808/0, in_queue=210808, util=100.00%
```

My second question is **why is performance increasment limited when I tried to improve read/write performance?**
For example, if we managed to compress multiple lpn data into 1 ppn , then we can definitely improve write performance. But as I tried, it seems that the improvement is limited to around 5x even though the compression ratio is really high. 
I said this by two facts:
1. when I compress 118 lpn into 1 ppn, the write bandwidth improvement is 5.6x. When I set the compressed data size to zero, which means hole dataset would be written in 1 ppn no matter how large it actually is , the write bandwidth improvement is still 5.6x.
2. The maximum improvement is the same when CH_BITS is set to 1 and to 3. But it is expected to have more improvement when CH_BITS is 3 because CH_BITS=3 can utilize all 8 channels.

These two problems have been confusing me for a long time. I'd be very grateful if you would kindly answer them. Thanks a lot!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism problem and read/write performance in ZNS mode #164

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parallelism problem and read/write performance in ZNS mode #164

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions