DATE23-BPAN, Benchmarking LLM for automated Verilog RTL Code Generation |
Fine-tuning |
CodeGen, 345M - 16B |
Github: 50K files / ~ 300MB: Verilog Book: 100MB |
Yes |
CodeGen 2B: 1 epoch, two RTX8000(48GB), 2days; CodeGen 6B: 1 epoch, 4 RYX8000(48GB), 4 days; CodeGen 16B: 1 epoch, 3 A100, 6 days. |
Code generation from HDLBits website |
1. Compiled completions; 2. functional pass |
1. Fine-tuning increase compiling completion rate significantly; (with 10 different completions) 2. Fine-tuning is still bad at functionality correctness of intermediate and advanced problems |
LLM is only good at small scale/ light-weight task |
ChipNemo |
Fine-tuning |
LLaMA2, 7B, 13B, 70B |
Internal Data (Bug summary, Design source, Documentation, verification, other): ~22B tokens; Wiki: 1.5 B tokens [natural language]; Github: 0.7 B tokens, C++, Python, Verilog [code]. |
No |
7B: 2710 A100 hours; 13B: 5100 A100 hours |
Script Generation; Chatbot (88 practical questions in arch/design/verification); Bug summary and analysis |
Mostly human rating |
A lager lr. (3x10e-4 vs5x10e-6) degrades performance significantly |
In most cases, 70B w.o. FT is better than 13B w. FT |
ChatEDA |
Fine-Tuning |
LLaMA2, 70B |
In-context learning (give example in prompt) gives 1500 instructions + proofreading. |
No |
15 epochs, 8xA100 80GB, |
task planning; Script generation |
1. the task planning is accurate? 2. the script generation is accurate? |
|
Auto-gressive objective? |
RTL-Coder |
Fine-Tuning |
Mistra-7B-v0.1 |
27k+ samples (pair of problem + RTL code ) |
Yes |
4 RTX4090 |
RTL code generation |
VerilogEval + RTLLM |
The new training scheme seems to be very effective; Using generation method in RTLLM, the function correctness even reaches 60% for GPT4 |
|