AMDのRadeonシリーズのブロガー勉強会に参加してきました 1.Radeon HD7770を脱がす

脱がすとかいっても、別に分解はしません。そんなことしたらAMDさんにごめんなさいしなきゃいけなくなります
本当は、OpenCLのベンチ取ってから、とか思ったんですが、なかなかまとまった時間が取れず、こうなりゃ強行軍でなんとかするっきゃねーということで一念発起して筆を執っている次第。

——問おう。これがあなたの性能か

AMDさんのAPP SDKには、clinfoって言う便利なものがついていて、これはNVIDIAさんのSDKで言うところのDevice Queryだったりする訳ですが、それで見るところのHD7770は以下になります

  Device Type:					 CL_DEVICE_TYPE_GPU
  Device ID:					 4098
  Board name:					 AMD Radeon HD 7700 Series
  Device Topology:				 PCI[ B#1, D#0, F#0 ]
  Max compute units:				 10
  Max work items dimensions:			 3
    Max work items[0]:				 256
    Max work items[1]:				 256
    Max work items[2]:				 256
  Max work group size:				 256
  Preferred vector width char:			 16
  Preferred vector width short:			 8
  Preferred vector width int:			 4
  Preferred vector width long:			 2
  Preferred vector width float:			 4
  Preferred vector width double:		 2
  Native vector width char:			 16
  Native vector width short:			 8
  Native vector width int:			 4
  Native vector width long:			 2
  Native vector width float:			 4
  Native vector width double:			 2
  Max clock frequency:				 1100Mhz
  Address bits:					 32
  Max memory allocation:			 536870912
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 8192
  Max image 2D height:				 8192
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 16
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 2048
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 831520768
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Platform ID:					 0x7f5c29384140
  Name:						 Capeverde
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 1.2 
  Driver version:				 CAL 1.4.1741 (VM)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 AMD-APP (923.1)
  Extensions:					 cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt

Max Work Group SizeがGeforce GTX580とかと比べると少ないかな、とか思いますが、むべなるかな。GCNのアーキテクチャとしてこうなっているのでしょう。
Local memory はスクラッチパッドで32KB。ハード的には64KBを積んでいるはずですが、同期に使用したりするんで、プログラマに全部は見せていないとか。まぁ当然か、と納得する。
さて、今日着目するのはExtensionに書かれた二つ。

cl_amd_media_ops
cl_amd_popcnt

です。

cl_amd_media_ops

読んで字のごとく、というとアレですが。
AMDさんのGPUが独自に搭載するマルチメディア向け拡張をさします
当然、AMDさんのGPUでしか使えません
このExtensionを有効にすると、以下の関数が追加されます

      uint  amd_pack(float4 src)
      floatn  amd_unpack3(unitn src)
      floatn   amd_unpack2 (unitn src)
      floatn   amd_unpack1 (unitn src)
      floatn   amd_unpack0 (unitn src)
      uintn  amd_bitalign (uintn src0, uintn src1, uintn src2)
      uintn  amd_bytealign (uintn src0, uintn src1, uintn src2)
      uintn  amd_lerp (uintn src0, uintn src1, uintn src2)
      uintn  amd_sad (uintn src0, uintn src1, uintn src2)  
      uintn  amd_sadhi (uintn src0, uintn src1n, uintn src2)
      uint  amd_sad4(uint4 src0, uint4 src1, uint src2)

uintnのn = {1,4,8,16}です
つまり、uint4だとunsigned intが4つパッキングされたものですね
OpenCLではよくあるベクタ型の変数です
さて、それぞれベクタをスカラにパッキングしたり（往々にして画像なので8bitで事足りることが多く、unsigned なら32bitで4要素をパッキングすることが可能）アライメント取ったりする拡張が追加されます
個人的にはamd_sad系のビルトイン関数がすてきだと思っており、テンプレートマッチングなどで使用する類似度評価のSADが一発で出せるというのは大変すばらしいと思うのです

cl_amd_popcnt

Popcntとは、Population Countのことで、非ゼロ要素の数を高速に算出するというものです
ハッカーの楽しみとかBinary Hackとかあの辺でも高速な算出方法が導かれており、意外に（といったら失礼だけども）使うものです。エンコーディングとかでも地味に。
ハミング距離が算出できるので、GPUでもできることが広がっていきますね。

……で、本当はこの二つのパフォーマンスを計測して、それが中でどういう実装になっているのかなとか推測したかったりしたのですが（まさかハードウェアで実装されてるわけないだろう）、ちょっと時間的に眠くて辛いのでまた後日