V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
HojiOShi
V2EX  ›  Local LLM

Intel GPU 的 llama-bench 测试结果

  •  
  •   HojiOShi · 11 天前 · 481 次点击

    硬件

    • GPU:Intel Arc A770 + A750
    • 芯片组:X570

    软件

    结果

    llama-bench -m <模型路径>

    test\backend ipex-llm sycl vulkan
    pp512 458.82 192.35 64.35
    tg128 7.09 6.55 11.60

    对比用结果

    4090 (572.60) + X670E, llama-b4820-bin-win-cuda-cu12.4-x64

    pp512: 2291.15, tg128: 40.55

    https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/download/v1.8/OpenCL-Benchmark-Windows.exe

    .-----------------------------------------------------------------------------.
    |----------------.------------------------------------------------------------|
    | Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
    | Device ID    1 | Intel(R) Arc(TM) A750 Graphics                             |
    |----------------'------------------------------------------------------------|
    |----------------.------------------------------------------------------------|
    | Device ID      | 0                                                          |
    | Device Name    | Intel(R) Arc(TM) A770 Graphics                             |
    | Device Vendor  | Intel(R) Corporation                                       |
    | Device Driver  | 32.0.101.6559 (Windows)                                    |
    | OpenCL Version | OpenCL C 3.0                                               |
    | Compute Units  | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s)              |
    | Memory, Cache  | 16255 MB VRAM, 16384 KB global / 64 KB local               |
    | Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
    |----------------'------------------------------------------------------------|
    | Info: OpenCL C code successfully compiled.                                  |
    | FP64  compute                                          not supported        |
    | FP32  compute                                        12.196 TFLOPs/s (2/3 ) |
    | FP16  compute                                        18.425 TFLOPs/s ( 1x ) |
    | INT64 compute                                         1.191  TIOPs/s (1/16) |
    | INT32 compute                                         5.687  TIOPs/s (1/4 ) |
    | INT16 compute                                        30.045  TIOPs/s ( 2x ) |
    | INT8  compute                                        29.282  TIOPs/s ( 1x ) |
    | Memory Bandwidth ( coalesced read      )                        223.97 GB/s |
    | Memory Bandwidth ( coalesced      write)                        432.86 GB/s |
    | Memory Bandwidth (misaligned read      )                        400.16 GB/s |
    | Memory Bandwidth (misaligned      write)                        438.62 GB/s |
    | PCIe   Bandwidth (send                 )                          9.30 GB/s |
    | PCIe   Bandwidth (   receive           )                          9.00 GB/s |
    | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.90 GB/s |
    |-----------------------------------------------------------------------------|
    |----------------.------------------------------------------------------------|
    | Device ID      | 1                                                          |
    | Device Name    | Intel(R) Arc(TM) A750 Graphics                             |
    | Device Vendor  | Intel(R) Corporation                                       |
    | Device Driver  | 32.0.101.6559 (Windows)                                    |
    | OpenCL Version | OpenCL C 3.0                                               |
    | Compute Units  | 448 at 2400 MHz (3584 cores, 17.203 TFLOPs/s)              |
    | Memory, Cache  | 8095 MB VRAM, 16384 KB global / 64 KB local                |
    | Buffer Limits  | 3967 MB global, 4062248 KB constant                        |
    |----------------'------------------------------------------------------------|
    | Info: OpenCL C code successfully compiled.                                  |
    | FP64  compute                                          not supported        |
    | FP32  compute                                        10.693 TFLOPs/s (2/3 ) |
    | FP16  compute                                        16.177 TFLOPs/s ( 1x ) |
    | INT64 compute                                         1.090  TIOPs/s (1/16) |
    | INT32 compute                                         5.043  TIOPs/s (1/3 ) |
    | INT16 compute                                        26.553  TIOPs/s ( 2x ) |
    | INT8  compute                                        26.611  TIOPs/s ( 2x ) |
    | Memory Bandwidth ( coalesced read      )                        210.06 GB/s |
    | Memory Bandwidth ( coalesced      write)                        434.85 GB/s |
    | Memory Bandwidth (misaligned read      )                        399.86 GB/s |
    | Memory Bandwidth (misaligned      write)                        441.22 GB/s |
    | PCIe   Bandwidth (send                 )                          9.35 GB/s |
    | PCIe   Bandwidth (   receive           )                          9.04 GB/s |
    | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.94 GB/s |
    |-----------------------------------------------------------------------------|
    |-----------------------------------------------------------------------------|
    | Done. Press Enter to exit.                                                  |
    '-----------------------------------------------------------------------------'
    
    |----------------.------------------------------------------------------------|
    | Device ID      | 0                                                          |
    | Device Name    | NVIDIA GeForce RTX 4090                                    |
    | Device Vendor  | NVIDIA Corporation                                         |
    | Device Driver  | 572.60 (Windows)                                           |
    | OpenCL Version | OpenCL C 3.0                                               |
    | Compute Units  | 128 at 2535 MHz (16384 cores, 83.067 TFLOPs/s)             |
    | Memory, Cache  | 24563 MB VRAM, 3584 KB global / 48 KB local                |
    | Buffer Limits  | 6140 MB global, 64 KB constant                             |
    |----------------'------------------------------------------------------------|
    | Info: OpenCL C code successfully compiled.                                  |
    | FP64  compute                                         1.401 TFLOPs/s (1/64) |
    | FP32  compute                                        85.239 TFLOPs/s ( 1x ) |
    | FP16  compute                                        88.567 TFLOPs/s ( 1x ) |
    | INT64 compute                                         4.204  TIOPs/s (1/24) |
    | INT32 compute                                        44.164  TIOPs/s (1/2 ) |
    | INT16 compute                                        38.203  TIOPs/s (1/2 ) |
    | INT8  compute                                       133.384  TIOPs/s ( 2x ) |
    | Memory Bandwidth ( coalesced read      )                        925.72 GB/s |
    | Memory Bandwidth ( coalesced      write)                        898.38 GB/s |
    | Memory Bandwidth (misaligned read      )                        923.73 GB/s |
    | Memory Bandwidth (misaligned      write)                        212.93 GB/s |
    | PCIe   Bandwidth (send                 )                         15.66 GB/s |
    | PCIe   Bandwidth (   receive           )                         14.80 GB/s |
    | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   15.24 GB/s |
    |-----------------------------------------------------------------------------|
    
    
    目前尚无回复
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   3658 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 21ms · UTC 05:07 · PVG 13:07 · LAX 22:07 · JFK 01:07
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.