This article was analyzed using AI.
GPT-20B vs ERNIE-4.5-21B LLM Benchmark Log Deep Comparison
(English-based Analysis)
- Analyzed Files
- gpt-20b_benchmark_20251104_182901.json
- ERNIE-4.5-21B_benchmark_20251104_181832.json
- Test Date (Log Timestamp): November 4, 2025
- Test Environment: Windows 10, Intel 6c/6t CPU, 32GB RAM, NVIDIA GeForce RTX 2060 SUPER (8GB), PyTorch 2.5.1+cu121, CUDA 12.1
- Acceleration: cublas_enabled: true
- llama_cpp_info (CPU ISA): AVX: true, AVX2: true, FMA: true, F16C: true
- Pipeline Connection: Disabled (connect_pipeline: false), single LLM call, run_mode: simultaneous
- Common Benchmark Parameters: llm_max_tokens=512, repeat_count=5, repeat_min_len=15, test_runs=3, 3 prompts
- Model Families / Versions:
- GPT-20B Q4 (local inference)
- ERNIE-4.5-21B Q4 (local inference)
1) Performance Metrics Summary (Average)
| Metric | GPT-20B Q4 |
ERNIE-21B Q4 |
ERNIE /GPT |
Interpretation |
| Model Load Time (s) | 4.726 | 7.161 | +2.435 | ERNIE loads slower (larger model initialization). |
| TTFT (ms) | 732.56 | 818.13 | +85.57 | ERNIE’s first token latency is longer → slower initial response. |
| Tokens/sec | 10.893 | 10.844 | −0.049 | Nearly identical decoding speeds. |
| Inference Time (s) | 42.54 | 45.19 | +2.64 | ERNIE generates slightly slower overall responses. |
| CPU Utilization (%) | 47.6 | 52.1 | +4.5 | ERNIE consumes more CPU load (thread efficiency lower). |
| VRAM (GB) | 7.788 | 7.825 | +0.037 | ERNIE uses slightly more VRAM. |
| GPU Power (W) | 68.10 | 68.67 | +0.57 | Similar power consumption. |
| GPU Temp (°C) | 52.0 | 50.7 | −1.3 | ERNIE runs marginally cooler. |
| Total Test Duration (s) | 257.22 | 271.14 | +13.92 | ERNIE takes longer to complete full runs. |
Summary: GPT-20B shows consistently faster response initiation (Load/TTFT) and total inference. Power and thermal usage are nearly identical, while ERNIE shows higher CPU overhead.

2) Prompt-Level Comparison & English Output Quality
| Prompt | Category | GPT-20B Q4 | ERNIE-4.5-21B Q4 | Notes |
| P1. “I had such an amazing day today, I feel like I'm floating!” | Load Time (s) | 4.80 | 11.98 | +7.18 |
| TTFT (ms) | 705.6 | 965.2 | +259.6 | |
| Inference Time (s) | 30.90 | 39.12 | +8.22 | |
| CPU Usage (%) | 45 | 51 | +6 | |
| Total Duration (s) | 36 | 51 | +15 | |
| Output Quality | Fluent and expressive English; natural conversational tone. | Fragmented syntax; repetitive meta-instructions (“You must respond in the user’s language”). | GPT clearer and natural | |
| P2. “The project deadline is looming, and I feel so anxious because I can't seem to focus on anything.” | Load Time (s) | 4.63 | 4.84 | +0.21 |
| TTFT (ms) | 737.2 | 739.5 | +2.3 | |
| Inference Time (s) | 47.88 | 47.99 | +0.11 | |
| CPU Usage (%) | 48 | 52 | +4 | |
| Total Duration (s) | 53 | 56 | +3 | |
| Output Quality | Smooth structure, clear advice, logical paragraphs. | Logical but stilted; includes translated directives (“Answer in Korean”). | GPT maintains flow | |
| P3. “I'm excited about the advancement of AI technology, but at the same time, I'm worried it might reduce the number of jobs.” | Load Time (s) | 4.75 | 4.67 | −0.08 |
| TTFT (ms) | 754.8 | 749.7 | −5.1 | |
| Inference Time (s) | 48.84 | 48.45 | −0.39 | |
| CPU Usage (%) | 49 | 53 | +4 | |
| Total Duration (s) | 50 | 52 | +2 | |
| Output Quality | Balanced and nuanced English; grammatically consistent. | Mechanically literal phrasing, incomplete sentence (“The well-?”). | GPT better balance |
3) Language Output Quality Patterns
| Category | GPT-20B | ERNIE-21B | Analysis |
| Prompt Leakage | Rare | Frequent | ERNIE often exposes internal system prompts. |
| Language Consistency (EN) | Stable | Unstable | ERNIE outputs partial literal translations. |
| Fluency | Natural | Fragmented | ERNIE’s syntax less natural due to translation residue. |
| Semantic Coherence | High | Medium | ERNIE adds redundant policy lines; GPT stays focused. |
Interpretation: After normalization to English, GPT-20B retains grammatical fluency and tone, while ERNIE-4.5-21B shows clear prompt echo and translation residue — an issue with language policy handling.
4) Operational Recommendations
- Post-Processing Pipeline – filter system instructions, remove token artifacts.
- Stop/Chat Template Review – reinforce <|end|> stop tokens to prevent leakage.
- Language Policy Enforcement – enforce consistent output language (English).
- Latency Optimization – implement KV-cache warmup to reduce TTFT.
5) Comprehensive Conclusion
| Aspect | Observation |
| Model Type | Both Q4 (4-bit GGUF) quantized, tested on 8GB VRAM. |
| Practical Usability | Execution possible but unstable for real use; near-zero VRAM margin → high OOM risk. |
| Recommended VRAM | 12GB+ required for stable inference. |
| Performance | GPT-20B faster in Load, TTFT, total time. |
| Language Output Quality | GPT-20B more fluent; ERNIE suffers from prompt leakage. |
| CPU Efficiency | GPT-20B better (47.6% vs 52.1%). |
| Total Runtime | GPT ≈ 257s (4m17s), ERNIE ≈ 271s (4m31s). |
| Overall Verdict | GPT-20B superior in speed, linguistic consistency, and stability. |