GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

AI MARK Benchmark/AI MARK Benchmark English Translation

GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

AI Orchestrator 2025. 11. 5. 11:56

This article was analyzed using AI.

GPT-20B vs ERNIE-4.5-21B LLM Benchmark Log Deep Comparison
(English-based Analysis)

Analyzed Files
- gpt-20b_benchmark_20251104_182901.json
- ERNIE-4.5-21B_benchmark_20251104_181832.json
Test Date (Log Timestamp): November 4, 2025
Test Environment: Windows 10, Intel 6c/6t CPU, 32GB RAM, NVIDIA GeForce RTX 2060 SUPER (8GB), PyTorch 2.5.1+cu121, CUDA 12.1
- Acceleration: cublas_enabled: true
- llama_cpp_info (CPU ISA): AVX: true, AVX2: true, FMA: true, F16C: true
Pipeline Connection: Disabled (connect_pipeline: false), single LLM call, run_mode: simultaneous
Common Benchmark Parameters: llm_max_tokens=512, repeat_count=5, repeat_min_len=15, test_runs=3, 3 prompts
Model Families / Versions:
- GPT-20B Q4 (local inference)
- ERNIE-4.5-21B Q4 (local inference)

1) Performance Metrics Summary (Average)

Metric	GPT-20B Q4	ERNIE-21B Q4	ERNIE /GPT	Interpretation
Model Load Time (s)	4.726	7.161	+2.435	ERNIE loads slower (larger model initialization).
TTFT (ms)	732.56	818.13	+85.57	ERNIE’s first token latency is longer → slower initial response.
Tokens/sec	10.893	10.844	−0.049	Nearly identical decoding speeds.
Inference Time (s)	42.54	45.19	+2.64	ERNIE generates slightly slower overall responses.
CPU Utilization (%)	47.6	52.1	+4.5	ERNIE consumes more CPU load (thread efficiency lower).
VRAM (GB)	7.788	7.825	+0.037	ERNIE uses slightly more VRAM.
GPU Power (W)	68.10	68.67	+0.57	Similar power consumption.
GPU Temp (°C)	52.0	50.7	−1.3	ERNIE runs marginally cooler.
Total Test Duration (s)	257.22	271.14	+13.92	ERNIE takes longer to complete full runs.

Summary: GPT-20B shows consistently faster response initiation (Load/TTFT) and total inference. Power and thermal usage are nearly identical, while ERNIE shows higher CPU overhead.

2) Prompt-Level Comparison & English Output Quality

Prompt	Category	GPT-20B Q4	ERNIE-4.5-21B Q4	Notes
P1. “I had such an amazing day today, I feel like I'm floating!”	Load Time (s)	4.80	11.98	+7.18
	TTFT (ms)	705.6	965.2	+259.6
	Inference Time (s)	30.90	39.12	+8.22
	CPU Usage (%)	45	51	+6
	Total Duration (s)	36	51	+15
Output Quality		Fluent and expressive English; natural conversational tone.	Fragmented syntax; repetitive meta-instructions (“You must respond in the user’s language”).	GPT clearer and natural
P2. “The project deadline is looming, and I feel so anxious because I can't seem to focus on anything.”	Load Time (s)	4.63	4.84	+0.21
	TTFT (ms)	737.2	739.5	+2.3
	Inference Time (s)	47.88	47.99	+0.11
	CPU Usage (%)	48	52	+4
	Total Duration (s)	53	56	+3
Output Quality		Smooth structure, clear advice, logical paragraphs.	Logical but stilted; includes translated directives (“Answer in Korean”).	GPT maintains flow
P3. “I'm excited about the advancement of AI technology, but at the same time, I'm worried it might reduce the number of jobs.”	Load Time (s)	4.75	4.67	−0.08
	TTFT (ms)	754.8	749.7	−5.1
	Inference Time (s)	48.84	48.45	−0.39
	CPU Usage (%)	49	53	+4
	Total Duration (s)	50	52	+2
Output Quality		Balanced and nuanced English; grammatically consistent.	Mechanically literal phrasing, incomplete sentence (“The well-?”).	GPT better balance

3) Language Output Quality Patterns

Category	GPT-20B	ERNIE-21B	Analysis
Prompt Leakage	Rare	Frequent	ERNIE often exposes internal system prompts.
Language Consistency (EN)	Stable	Unstable	ERNIE outputs partial literal translations.
Fluency	Natural	Fragmented	ERNIE’s syntax less natural due to translation residue.
Semantic Coherence	High	Medium	ERNIE adds redundant policy lines; GPT stays focused.

Interpretation: After normalization to English, GPT-20B retains grammatical fluency and tone, while ERNIE-4.5-21B shows clear prompt echo and translation residue — an issue with language policy handling.

4) Operational Recommendations

Post-Processing Pipeline – filter system instructions, remove token artifacts.
Stop/Chat Template Review – reinforce <|end|> stop tokens to prevent leakage.
Language Policy Enforcement – enforce consistent output language (English).
Latency Optimization – implement KV-cache warmup to reduce TTFT.

5) Comprehensive Conclusion

Aspect	Observation
Model Type	Both Q4 (4-bit GGUF) quantized, tested on 8GB VRAM.
Practical Usability	Execution possible but unstable for real use; near-zero VRAM margin → high OOM risk.
Recommended VRAM	12GB+ required for stable inference.
Performance	GPT-20B faster in Load, TTFT, total time.
Language Output Quality	GPT-20B more fluent; ERNIE suffers from prompt leakage.
CPU Efficiency	GPT-20B better (47.6% vs 52.1%).
Total Runtime	GPT ≈ 257s (4m17s), ERNIE ≈ 271s (4m31s).
Overall Verdict	GPT-20B superior in speed, linguistic consistency, and stability.

저작자표시 비영리 변경금지 (새창열림)

'AI MARK Benchmark > AI MARK Benchmark English Translation' 카테고리의 다른 글

PKC Benchmark Tool MARK - Unified LLM Performance Measurement Beta Version (0)	2025.11.06
LLaVA v1.5-7B LLM Benchmark: An In-Depth AI Performance Analysis Report (0)	2025.11.03
Llm Llama 3.2-8B: 8GB VRAM Limits Smashed (RTX 2060 S) (0)	2025.10.30
RTX 2060 SUPER + Llama 3B VS 8B Korean Real-World Benchmark: Speed, VRAM, and Analysis (0)	2025.10.22
LLM Llama3 8B + Sentiment Analyzer vs. Community 4-bit Benchmark Comparison (RTX 2060 8GB benchmark Performance) (0)	2025.10.22

현재글GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

PKC Project

LLM과 LLaMA 기반 멀티모달 챗봇을 RTX 2060 Super 8GB 보급형 PC에서 구현하며 남기는 PKC Project 기록 블로그.

PKC Project

AI Orchestrator

PKC Project

GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

GPT-20B vs ERNIE-4.5-21B LLM Benchmark Log Deep Comparison
(English-based Analysis)

1) Performance Metrics Summary (Average)

2) Prompt-Level Comparison & English Output Quality

3) Language Output Quality Patterns

4) Operational Recommendations

5) Comprehensive Conclusion

'AI MARK Benchmark > AI MARK Benchmark English Translation' 카테고리의 다른 글

'AI MARK Benchmark/AI MARK Benchmark English Translation'의 다른글

티스토리툴바

« 2026/05 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

GPT-20B vs ERNIE-4.5-21B LLM Benchmark Log Deep Comparison(English-based Analysis)

1) Performance Metrics Summary (Average)

2) Prompt-Level Comparison & English Output Quality

3) Language Output Quality Patterns

4) Operational Recommendations

5) Comprehensive Conclusion

'AI MARK Benchmark > AI MARK Benchmark English Translation' 카테고리의 다른 글

'AI MARK Benchmark/AI MARK Benchmark English Translation'의 다른글

관련글

티스토리툴바

GPT-20B vs ERNIE-4.5-21B LLM Benchmark Log Deep Comparison
(English-based Analysis)