Skip to content

Some questions and suggestions  #5

Description

@jjfumero

Hi Jürgen,
this is an awesome exploration work. I have 2 questions and 1 suggestion (sorry for the long issue):

  1. How do we know results are corect? I am running with the tornado-multi-tg branch with the PTX backend and I get the following results:
tornado --jvm="-Dtb.device=0:0 -Dol.device=0:0 -DUseVectorAPI=true -Dtornado.device.memory=2GB" --classpath bin com.otabuzzman.llmj.TestGpt2
WARNING: Using incubator modules: jdk.incubator.vector
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
padded_vocab_size: 50304
num_layers: 12
num_heads:12
channels: 768
num_parameters: 124475904
[State]
batch_size: 4
seq_len: 64
num_activations: 73347840
final matmul forward took 178 ms
forward pass took 813 ms
initial matmul_backward took 1926 ms
backward pass took 6448 ms
-43.431618, -43.431759
-39.836346, -39.836483
-43.065910, -43.066025
-42.828045, -42.828171
-43.529541, -43.529675
-44.318398, -44.318546
-41.227425, -41.227543
-41.270760, -41.270866
-42.541393, -42.541557
-42.394997, -42.395157
OK (LOGITS), max_diff = 5.355835e-03
LOSS OK: 5.269892 5.270009
dwte
OK -0.002320 -0.002320
OK 0.002072 0.002072
OK 0.003716 0.003717
OK 0.001307 0.001307
OK 0.000631 0.000632
TENSOR OK, maxdiff = 1.363754e-03
dwpe
OK -0.005118 -0.005110
OK -0.000001 -0.000012
OK -0.003267 -0.003262
OK 0.009910 0.009909
OK 0.002155 0.002145
TENSOR OK, maxdiff = 5.488563e-05
dln1w
OK -0.007520 -0.007523
OK 0.008624 0.008643
OK 0.005003 0.005029
OK -0.011099 -0.011095
OK -0.001666 -0.001664
TENSOR OK, maxdiff = 3.605247e-03
dln1b
OK -0.038494 -0.038458
OK -0.030547 -0.030600
OK 0.010189 0.010223
OK 0.080135 0.080176
OK -0.060991 -0.060901
TENSOR OK, maxdiff = 1.531513e-03
dqkvw
OK -0.000031 -0.000031
OK -0.000026 -0.000025
OK -0.000064 -0.000064
OK 0.000074 0.000074
OK 0.000020 0.000020
TENSOR OK, maxdiff = 5.583316e-04
dqkvb
OK -0.000414 -0.000411
OK -0.000410 -0.000412
OK 0.000113 0.000113
OK -0.000564 -0.000565
OK 0.000574 0.000570
TENSOR OK, maxdiff = 3.140289e-04
dattprojw
OK 0.000081 0.000080
OK -0.000005 -0.000005
OK -0.000019 -0.000019
OK 0.000005 0.000004
OK 0.000031 0.000031
TENSOR OK, maxdiff = 2.241824e-04
dattprojb
OK 0.000456 0.000470
OK -0.009969 -0.009979
OK -0.001794 -0.001804
OK 0.037638 0.037584
OK -0.031287 -0.031239
TENSOR OK, maxdiff = 2.013203e-04
dln2w
OK -0.018372 -0.018312
OK 0.004812 0.004813
OK 0.008084 0.008091
OK -0.001465 -0.001470
OK -0.002739 -0.002737
TENSOR OK, maxdiff = 1.153624e-02
dln2b
OK -0.026405 -0.026368
OK -0.016711 -0.016695
OK 0.001067 0.001074
OK 0.034754 0.034711
OK -0.028630 -0.028584
TENSOR OK, maxdiff = 9.740144e-04
dfcw
OK 0.000438 0.000440
OK -0.000000 -0.000000
OK -0.000153 -0.000154
OK -0.000165 -0.000165
OK 0.000404 0.000405
TENSOR OK, maxdiff = 9.585470e-04
dfcb
OK 0.003282 0.003293
OK 0.002038 0.002043
OK -0.001386 -0.001386
OK 0.000381 0.000386
OK 0.001602 0.001604
TENSOR OK, maxdiff = 2.334092e-04
dfcprojw
OK 0.000678 0.000681
OK 0.000073 0.000073
OK -0.000415 -0.000416
OK -0.000059 -0.000061
OK -0.000603 -0.000604
TENSOR OK, maxdiff = 4.579828e-04
dfcprojb
OK 0.003572 0.003584
OK -0.007148 -0.007158
OK -0.001955 -0.001964
OK 0.001466 0.001462
OK 0.001219 0.001217
TENSOR OK, maxdiff = 1.408001e-04
dlnfw
OK -0.000022 -0.000022
OK 0.000811 0.000811
OK 0.001161 0.001161
OK -0.002956 -0.002957
OK 0.001146 0.001145
TENSOR OK, maxdiff = 3.411621e-04
dlnfb
OK -0.011101 -0.011101
OK 0.008007 0.008007
OK -0.004763 -0.004769
OK -0.002110 -0.002113
OK -0.005903 -0.005905
TENSOR OK, maxdiff = 6.320933e-05
step 0: loss 5.269892 (took 8446 ms) OK = true
final matmul forward took 155 ms
forward pass took 425 ms
initial matmul_backward took 1550 ms
backward pass took 5711 ms
step 1: loss 4.059387 (took 6201 ms) OK = true
final matmul forward took 154 ms
forward pass took 414 ms
initial matmul_backward took 1569 ms
backward pass took 5726 ms
step 2: loss 3.374211 (took 6199 ms) OK = true
final matmul forward took 155 ms
forward pass took 417 ms
initial matmul_backward took 1578 ms
backward pass took 5715 ms
step 3: loss 2.800125 (took 6197 ms) OK = true
final matmul forward took 148 ms
forward pass took 398 ms
initial matmul_backward took 1577 ms
backward pass took 5736 ms
step 4: loss NaN (took 6191 ms) OK = false
final matmul forward took 146 ms
forward pass took 387 ms
initial matmul_backward took 1562 ms
backward pass took 5495 ms
step 5: loss NaN (took 5939 ms) OK = false
final matmul forward took 149 ms
forward pass took 386 ms
initial matmul_backward took 1569 ms
backward pass took 5515 ms
step 6: loss NaN (took 5957 ms) OK = false
final matmul forward took 149 ms
forward pass took 385 ms
initial matmul_backward took 1561 ms
backward pass took 5484 ms
step 7: loss NaN (took 5926 ms) OK = false
final matmul forward took 145 ms
forward pass took 377 ms
initial matmul_backward took 1557 ms
backward pass took 5501 ms
step 8: loss NaN (took 5935 ms) OK = false
final matmul forward took 146 ms
forward pass took 387 ms
initial matmul_backward took 1578 ms
backward pass took 5515 ms
step 9: loss NaN (took 5959 ms) OK = false
overall okay: false
  1. I am getting some errors when using the OpenCL backend:
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

Are you using a different thread-block for each backend? or leave this decision for the TornadoVM runtime? I saw the code was actually generated and installed using the OpenCL driver, but the thread-block seems to be the problem, so I wonder if this could be a bug in TornadoVM.

  1. Just a minor thing. I noticed the Makefile could be a bit more generic by using the TORNADO_SDK variable, at least for Linux.
diff --git a/Makefile b/Makefile
index 1a47726..fcff010 100644
--- a/Makefile
+++ b/Makefile
@@ -23,7 +23,7 @@ ifdef winos
                src\com\otabuzzman\llmj\*.java
 else
        javac \
-               -classpath "../TornadoVM/bin/sdk/share/java/tornado/*" \
+               -classpath "${TORNADO_SDK}/share/java/tornado/*" \
                -g:vars --enable-preview \
                --add-modules jdk.incubator.vector \
                -target 21 -source 21 \

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions