- HPU usage with TFHE-rs
- General card usage
- Debug
- Common issues
The HPU can be controlled directly by the rust library TFHE-rs starting from version v1.2.
One example is located in tfhe/examples/hpu/matmul.rs.
git clone https://github.com/zama-ai/tfhe-rs.git
cd tfhe-rs
source setup_hpu.sh --config v80 --init-qdma
cargo run --profile devo --features=hpu-v80 --example hpu_matmul
For faster build time, when developing, it is recommended to use the profile devo.
cargo test --release --features hpu-v80 --test hpu
make bench_integer_hpu
or
RUSTFLAGS="-C target-cpu=native" __TFHE_RS_BENCH_OP_FLAVOR=DEFAULT __TFHE_RS_FAST_BENCH=FALSE __TFHE_RS_BENCH_TYPE=latency \
cargo bench --bench integer-bench --features=integer,internal-keycache,pbs-stats,hpu,hpu-v80 -p tfhe-benchmark -- --quick
Warning
Vivado must be installed on the machine and JTAG connected from host to the board (USB cable).
Let's say you made modifications in the firmware fw/arm and want to use the compiled elf on board.
source setup.sh
cd versal
just compile_fw
# firmware will be located in versal/output/amc.elf
xsdb
connect
ta 3
dow versal/output/amc.elf
rst -proc
con
In order to run card diagnostics, you must install xbtest and have an example AVED bitstream loaded into FPGA.
Tip
You should checkout our AVED xbtest branch if you use 'amd_v80_gen5x8_23.2_exdes_2_xbtest_stress_20240409.zip' package.
Once done, compile the driver and load the ami module.
git clone https://github.com/zama-ai/AVED.git
cd AVED
git checkout xbtest # for version 20240409
cd sw/AMI/driver/
make clean && make
sudo modprobe -r ami && sudo insmod ami.ko
Warning
You must have booted the FPGA with an original AVED bitstream for running xbtest.
If you want to launch memory diagnostics you can do :
# Get pcie device number
PCIE_CARD=$(lspci -d 10ee:50b4)
DEVICE="${PCIE_CARD%% *}"
xbtest -d $DEVICE -c memory
The variable $DEVICE corresponds to your board Bus Device Function. You can easily find yours with lspci -d 10ee:50b4.
You can read internal registers with HPUtils in TFHE-rs.
In order to build it you can launch: cargo build --profile devo --features=hpu-v80,utils --bin hputil
Then you can read registers with this tool.
source setup_hpu.sh --config v80 --init-qdma
./target/devo/hputil read --name info::version
You can as well dump sets of parameters read in the HPU:
./target/devo/hputil dump arch // dumps crypto parameter set and HPU parameters
./target/devo/hputil dump isc // dumps Instruction Scheduler registers
./target/devo/hputil dump pe-mem // dumps Load/Store processing element registers
./target/devo/hputil dump pe-pbs // dumps PBS processing element registers
./target/devo/hputil dump pe-alu // dumps ALU processing element registers
The same way as is instructed by Xlilinx: sudo ami_tool debug_verbosity -d $DEVICE -l debug.
It will allow you to see more messages published by the firmware running on the ARM core (RPU). By default you will see only the errors.
Warning
If you loaded the FPGA through JTAG, this solution will not work.
We currently don't have a general reset. To circumvent this you can do a "reload -sbr".
This command will trigger the secondary bus reset, effectively resetting parts of the control of the FPGA and then will entirely reload the FPGA matrix with what is in the FLASH. This uses the current partition.
sudo ami_tool reload -d $DEVICE -t sbr
lspci -d 10ee:50b4
lspci -d 10ee:50b5
The Bus Device Function of Xilinx V80 board has this form 0X:00.Y.
X can be a different number, server to server. Y can only be (0;1): we only have two Physical Functions.
You must find with previous command:
0X:00.0 Processing accelerators: Xilinx Corporation Device 50b4
0X:00.1 Processing accelerators: Xilinx Corporation Device 50b5
If ever this is not the case, you can try to remove and rescan the two physical functions.
sudo bash -c "echo 1 > /sys/bus/pci/devices/0000:{DEVICE}:00.0/remove"
sudo bash -c "echo 1 > /sys/bus/pci/devices/0000:{DEVICE}:00.1/remove"
sudo bash -c "echo 1 > /sys/bus/pci/rescan"
If after this you still cannot find the device, we would suggest you to do a cold reboot.
Warning
JTAG must be plugged.
xsdb will allow you to have some information about the current status of the System On Chip (SOC). Including processors, FPGA and PMC.
xsdb
connect
ta
You should see the processor RPU as 3 Cortex-R5 #0 (Running).
If you cannot connect : check that Future Technology Devices International is present when doing lsusb.
This is not the case: JTAG is unplugged or has an issue.
xsdb
connect
ta 1
device status jtag_status
If the Done bit is ‘0’: FPGA has not been properly programmed. You will certainly need to re-program.
If something is suspect you can have a look at this documentation.
Note
boot mode should be ‘b1000 or ‘b0100
boot mode 1000 means boot mode is OSPI: you will be able to boot from flash
boot mode 0100 means boot mode is JTAG: you will be able to boot from JTAG
- If ever you need to boot from the flash (OSPI), do xsdb versal/jtag/write_ospi_mode.tcl
xsdb
connect
ta 1
device status -hex error_status
All output should be zeros, if something is up (this can happen with a functional bitstream), you can have a look there to get the bit signification in this documentation.
Note
We noticed that GSW ERROR can be raised with a working bitstream.
NOC NCR already happened during development, this is likely a big issue you introduced in the block design.
This means that there is an incompatibility between software/firmware versions. This is likely due to the version of your ami. Its major version number is probably not matching the AMC firmware major version. We modified the version of both pieces of software.
simply compile and load the new ami module
git clone https://github.com/zama-ai/AVED.git
cd sw/AMI/driver
make
sudo modprobe -r ami && sudo insmod ami.ko
This is likely that your software is not properly synchronized between app/api and driver. This is common when having several users on a machine.
You can circumvent this by using the relative path of the application. Make sure to recompile and reload the driver beforehand.
We noticed that on V80, the board's communication with I2C bus can get stuck, leading to being unable to boot the system.
The solution for now is simple: you need to turn off your machine and unplug it, wait enough for all the power to dissipate and only then replug/reboot your machine ;-)