PKC AI Project

AI를 활용하여 보급형 그래픽 카드에서 멀티모달 챗봇 만들어 보기

AI MARK Benchmark/AI MARK Benchmark English Translation

GPT-20B vs ERNIE-21B LLM Benchmark Log Deep Comparison

AI Orchestrator 2025. 11. 5. 11:56

This article was analyzed using AI.

GPT-20B vs ERNIE-4.5-21B  LLM Benchmark Log Deep Comparison
(English-based Analysis)

 

  • Analyzed Files
    • gpt-20b_benchmark_20251104_182901.json
    • ERNIE-4.5-21B_benchmark_20251104_181832.json
  • Test Date (Log Timestamp): November 4, 2025
  • Test Environment: Windows 10, Intel 6c/6t CPU, 32GB RAM, NVIDIA GeForce RTX 2060 SUPER (8GB), PyTorch 2.5.1+cu121, CUDA 12.1
    • Acceleration: cublas_enabled: true
    • llama_cpp_info (CPU ISA): AVX: true, AVX2: true, FMA: true, F16C: true
  • Pipeline Connection: Disabled (connect_pipeline: false), single LLM call, run_mode: simultaneous
  • Common Benchmark Parameters: llm_max_tokens=512, repeat_count=5, repeat_min_len=15, test_runs=3, 3 prompts
  • Model Families / Versions:
    • GPT-20B Q4 (local inference)
    • ERNIE-4.5-21B Q4 (local inference)

1) Performance Metrics Summary (Average)

Metric GPT-20B
Q4
ERNIE-21B
Q4
ERNIE
/GPT
Interpretation
Model Load Time (s) 4.726 7.161 +2.435 ERNIE loads slower (larger model initialization).
TTFT (ms) 732.56 818.13 +85.57 ERNIE’s first token latency is longer → slower initial response.
Tokens/sec 10.893 10.844 −0.049 Nearly identical decoding speeds.
Inference Time (s) 42.54 45.19 +2.64 ERNIE generates slightly slower overall responses.
CPU Utilization (%) 47.6 52.1 +4.5 ERNIE consumes more CPU load (thread efficiency lower).
VRAM (GB) 7.788 7.825 +0.037 ERNIE uses slightly more VRAM.
GPU Power (W) 68.10 68.67 +0.57 Similar power consumption.
GPU Temp (°C) 52.0 50.7 −1.3 ERNIE runs marginally cooler.
Total Test Duration (s) 257.22 271.14 +13.92 ERNIE takes longer to complete full runs.

Summary: GPT-20B shows consistently faster response initiation (Load/TTFT) and total inference. Power and thermal usage are nearly identical, while ERNIE shows higher CPU overhead.

Result Comparison Log

 


2) Prompt-Level Comparison & English Output Quality

Prompt Category GPT-20B Q4 ERNIE-4.5-21B Q4 Notes
P1. “I had such an amazing day today, I feel like I'm floating!” Load Time (s) 4.80 11.98 +7.18
  TTFT (ms) 705.6 965.2 +259.6
  Inference Time (s) 30.90 39.12 +8.22
  CPU Usage (%) 45 51 +6
  Total Duration (s) 36 51 +15
Output Quality   Fluent and expressive English; natural conversational tone. Fragmented syntax; repetitive meta-instructions (“You must respond in the user’s language”). GPT clearer and natural
P2. “The project deadline is looming, and I feel so anxious because I can't seem to focus on anything.” Load Time (s) 4.63 4.84 +0.21
  TTFT (ms) 737.2 739.5 +2.3
  Inference Time (s) 47.88 47.99 +0.11
  CPU Usage (%) 48 52 +4
  Total Duration (s) 53 56 +3
Output Quality   Smooth structure, clear advice, logical paragraphs. Logical but stilted; includes translated directives (“Answer in Korean”). GPT maintains flow
P3. “I'm excited about the advancement of AI technology, but at the same time, I'm worried it might reduce the number of jobs.” Load Time (s) 4.75 4.67 −0.08
  TTFT (ms) 754.8 749.7 −5.1
  Inference Time (s) 48.84 48.45 −0.39
  CPU Usage (%) 49 53 +4
  Total Duration (s) 50 52 +2
Output Quality   Balanced and nuanced English; grammatically consistent. Mechanically literal phrasing, incomplete sentence (“The well-?”). GPT better balance

3) Language Output Quality Patterns

Category GPT-20B ERNIE-21B Analysis
Prompt Leakage Rare Frequent ERNIE often exposes internal system prompts.
Language Consistency (EN) Stable Unstable ERNIE outputs partial literal translations.
Fluency Natural Fragmented ERNIE’s syntax less natural due to translation residue.
Semantic Coherence High Medium ERNIE adds redundant policy lines; GPT stays focused.

Interpretation: After normalization to English, GPT-20B retains grammatical fluency and tone, while ERNIE-4.5-21B shows clear prompt echo and translation residue — an issue with language policy handling.


4) Operational Recommendations

  1. Post-Processing Pipeline – filter system instructions, remove token artifacts.
  2. Stop/Chat Template Review – reinforce <|end|> stop tokens to prevent leakage.
  3. Language Policy Enforcement – enforce consistent output language (English).
  4. Latency Optimization – implement KV-cache warmup to reduce TTFT.

5) Comprehensive Conclusion

Aspect Observation
Model Type Both Q4 (4-bit GGUF) quantized, tested on 8GB VRAM.
Practical Usability Execution possible but unstable for real use; near-zero VRAM margin → high OOM risk.
Recommended VRAM 12GB+ required for stable inference.
Performance GPT-20B faster in Load, TTFT, total time.
Language Output Quality GPT-20B more fluent; ERNIE suffers from prompt leakage.
CPU Efficiency GPT-20B better (47.6% vs 52.1%).
Total Runtime GPT ≈ 257s (4m17s), ERNIE ≈ 271s (4m31s).
Overall Verdict GPT-20B superior in speed, linguistic consistency, and stability.