Survey of VLSI LLM

	Category	Base Model	Data	Open-source	Training cost	Problem	Metric	Experimental Conclusions	Other Remarks
DATE23-BPAN, Benchmarking LLM for automated Verilog RTL Code Generation	Fine-tuning	CodeGen, 345M - 16B	Github: 50K files / ~ 300MB: Verilog Book: 100MB	Yes	CodeGen 2B: 1 epoch, two RTX8000(48GB), 2days; CodeGen 6B: 1 epoch, 4 RYX8000(48GB), 4 days; CodeGen 16B: 1 epoch, 3 A100, 6 days.	Code generation from HDLBits website	1. Compiled completions; 2. functional pass	1. Fine-tuning increase compiling completion rate significantly; (with 10 different completions) 2. Fine-tuning is still bad at functionality correctness of intermediate and advanced problems	LLM is only good at small scale/ light-weight task
ChipNemo	Fine-tuning	LLaMA2, 7B, 13B, 70B	Internal Data (Bug summary, Design source, Documentation, verification, other): ~22B tokens; Wiki: 1.5 B tokens [natural language]; Github: 0.7 B tokens, C++, Python, Verilog [code].	No	7B: 2710 A100 hours; 13B: 5100 A100 hours	Script Generation; Chatbot (88 practical questions in arch/design/verification); Bug summary and analysis	Mostly human rating	A lager lr. (3x10e-4 vs5x10e-6) degrades performance significantly	In most cases, 70B w.o. FT is better than 13B w. FT
ChatEDA	Fine-Tuning	LLaMA2, 70B	In-context learning (give example in prompt) gives 1500 instructions + proofreading.	No	15 epochs, 8xA100 80GB,	task planning; Script generation	1. the task planning is accurate? 2. the script generation is accurate?		Auto-gressive objective?
RTL-Coder	Fine-Tuning	Mistra-7B-v0.1	27k+ samples (pair of problem + RTL code )	Yes	4 RTX4090	RTL code generation	VerilogEval + RTLLM	The new training scheme seems to be very effective; Using generation method in RTLLM, the function correctness even reaches 60% for GPT4