Goのインタフェースがパフォーマンスに及ぼす影響

Dec 11, 2015 golang

Go Advent Calendar 2015 その3 11日目です。その3まであるなんてGo大人気ですね。

Gopherというのはいろいろな人がいてLLからGoへ、という方も多いかと思います。

LLではそもそも全てがオブジェクトだったりで話題になりませんが、よりマシンに近く~~変態的に~~速度を重視される方が多いC++では例えば仮想関数や実行時キャストのコストが議論になります。 Goにおいてこういう多態性はインタフェースで表現されます。結論からいうと、 Goのインタフェースにもそれなりのコストがあります。 なので極限までパフォーマンスを要求される場合には 例えばインターフェースを使わない というも選択肢に入ってくるのではないかと思います。

Go言語におけるインタフェースの内部表現

さて、Goはブートストラップ化(言語処理系をその言語自身で実装する)を進めており、1.5ではGoのコンパイラ、ランタイムはほぼほぼGoで実装されています。ということはインターフェースもGoで実装されているはずですね。

Goのインタフェースは２個のポインタを持った struct です。例えば、 reflect/value.go では以下のようなコードがあります。

reflect/value.go :

 1// emptyInterface is the header for an interface{} value.
 2type emptyInterface struct {
 3    typ  *rtype
 4    word unsafe.Pointer
 5}
 6
 7// nonEmptyInterface is the header for a interface value with methods.
 8type nonEmptyInterface struct {
 9    // see ../runtime/iface.go:/Itab
10    itab *struct {
11        ityp   *rtype // static interface type
12        typ    *rtype // dynamic concrete type
13        link   unsafe.Pointer
14        bad    int32
15        unused int32
16        fun    [100000]unsafe.Pointer // method table
17    }
18    word unsafe.Pointer
19}

コメントに書かれているように実態は runtime/iface.go ですがこちらのほうが簡略化されていてわかりやすいでしょう。1個目のポインタは「空のインタフェース( iterface{} )」(これを eface という)の場合、型情報、そうでない場合(これを iface という)の場合 itab と呼ばれるC++における仮想テーブルのようなものです。2個目に値が格納されます。2個目はどちらもポインタであることに注意してください。

インターフェース経由のメソッド呼び出し

ここで、インターフェースを使用するとどの程度パフォーマンスペナルティがあるのか、様々なケースで見てみます。

まずは、直接メソッドを呼び出すのと、インタフェース経由で呼び出す場合の違いを見てみましょう。

 1type MyError struct{}
 2func (e *MyError) Error() string { return "error" }
 3
 4func BenchmarkNoInterface(b *testing.B) {
 5    for i := 0; i < b.N; i++ {
 6        err := &MyError{}
 7        err.Error()
 8    }
 9}
10
11func BenchmarkInterface(b *testing.B) {
12    for i := 0; i < b.N; i++ {
13        var err error = &MyError{}
14        err.Error()     
15    }
16}

結果は以下のようになります。10倍程度遅くなります。単位が単位なので微々たる違いですが極限までパフォーマンスを追求したい場合は考慮しておいたほうがいいかもしれませんね。

1BenchmarkNoInterface    2000000000               0.73 ns/op
2BenchmarkInterface      200000000                9.56 ns/op

しかし実はこの中には *MyError を error インタフェースに変換するコストも含まれていて、そのコストを除くと5倍くらいの遅さです。次はそのインタフェースへの変換コストを詳しく見ていこうと思います。

インターフェースへの変換コスト(非ポインタ型)

Goにおけるインタフェースの変換時はC++におけるキャストよりも複雑なことが発生します。

Goでは「なんでも」を表すために interface{} を使いますね。 int などの組み込み型かつ非ポインタ型を interface{} として処理するとなにが起こるでしょうか？以下のようにフィボナッチを計算させてみます。

 1func IntFibNoInterface(n int) int {
 2    if n < 2 {
 3        return n
 4    }
 5    return IntFibNoInterface(n-1) + IntFibNoInterface(n-2)
 6}
 7
 8func IntFibInterface(o interface{}) interface{} {
 9    n := o.(int)
10    if n < 2 {
11        return o
12    }
13    return IntFibInterface(n-1).(int) + IntFibInterface(n-2).(int)
14}
15
16func BenchmarkNoInterface(b *testing.B) {
17    for i := 0; i < b.N; i++ {
18        IntFibNoInterface(20)
19    }
20}
21
22func BenchmarkInterface(b *testing.B) {
23    for i := 0; i < b.N; i++ {
24        IntFibInterface(20)
25    }
26}

結果は以下。40 - 50倍くらいおそくなりますね。これを見ると気軽にインタフェースを使いたくなくなるかもしれません。

1BenchmarkNoInterface       30000             57938 ns/op               0 B/op          0 allocs/op
2BenchmarkInterface           500           2448250 ns/op          525376 B/op      32836 allocs/op

はじめに書いたように、インタフェースは ポインタとして値を保持します 。そのためコード上アロケートしていなくても、内部的に int をアロケートしそこに値をコピーせねばなりません。

どのような処理が実行されたか、プロファイルを見てみましょう。

 11.46s of 1.46s total (  100%)
 2Showing top 20 nodes out of 35 (cum >= 0.01s)
 3      flat  flat%   sum%        cum   cum%
 4     0.37s 25.34% 25.34%      0.68s 46.58%  runtime.mallocgc
 5     0.26s 17.81% 43.15%      1.44s 98.63%  12.IntFibInterface
 6     0.20s 13.70% 56.85%      0.20s 13.70%  runtime.memmove
 7     0.16s 10.96% 67.81%      0.16s 10.96%  runtime.mSpan_Sweep.func1
 8     0.12s  8.22% 76.03%      0.96s 65.75%  runtime.convT2E
 9     0.11s  7.53% 83.56%      0.31s 21.23%  runtime.typedmemmove
10     0.10s  6.85% 90.41%      0.26s 17.81%  runtime.heapBitsSweepSpan
11     0.05s  3.42% 93.84%      0.22s 15.07%  runtime.assertE2T
12     0.02s  1.37% 95.21%      0.70s 47.95%  runtime.newobject
13     0.02s  1.37% 96.58%      0.02s  1.37%  runtime.prefetchnta
14     0.01s  0.68% 97.26%      0.01s  0.68%  runtime.(*gcControllerState).findRunnableGCWorker
15     0.01s  0.68% 97.95%      0.01s  0.68%  runtime.lock
16     0.01s  0.68% 98.63%      0.01s  0.68%  runtime.mSpanList_InsertBack
17     0.01s  0.68% 99.32%      0.01s  0.68%  runtime.xadd64
18     0.01s  0.68%   100%      0.01s  0.68%  runtime.xchg
19         0     0%   100%      1.44s 98.63%  12.BenchmarkInterface
20         0     0%   100%      0.01s  0.68%  runtime.GC
21         0     0%   100%      0.01s  0.68%  runtime.deductSweepCredit
22         0     0%   100%      0.01s  0.68%  runtime.gc
23         0     0%   100%      0.01s  0.68%  runtime.gc.func4

非ポインタ型のインターフェースへの変換コスト

では、ということで自分でアロケーションしてみましょう。 int ではなく *int を使ってみます。

 1func IntFibNoInterface(n int) int {
 2    if n < 2 {
 3        return n
 4    }
 5    return IntFibNoInterface(n-1) + IntFibNoInterface(n-2)
 6}
 7
 8func IntFibPointer(i *int) *int {
 9    n := *i
10    if n < 2 {
11        return i
12    }
13    a1 := new(int)
14    *a1 = n - 1
15    a2 := new(int)
16    *a2 = n - 2
17    v1 := IntFibPointer(a1)
18    v2 := IntFibPointer(a2)
19    ret := new(int)
20    *ret = *v1 + *v2
21    return ret
22}
23
24func IntFibInterface(o interface{}) interface{} {
25    n := *(o.(*int))
26    if n < 2 {
27        return o
28    }
29    a1 := new(int)
30    *a1 = n - 1
31    a2 := new(int)
32    *a2 = n - 2
33    v1 := IntFibInterface(a1).(*int)
34    v2 := IntFibInterface(a2).(*int)
35    ret := new(int)
36    *ret = *v1 + *v2
37    return ret
38}
39
40func BenchmarkNoInterface(b *testing.B) {
41    for i := 0; i < b.N; i++ {
42        IntFibNoInterface(20)
43    }
44}
45
46func BenchmarkPointer(b *testing.B) {
47    for i := 0; i < b.N; i++ {
48        a := new(int)
49        *a = 20
50        IntFibPointer(a)
51    }
52}
53
54func BenchmarkInterface(b *testing.B) {
55    for i := 0; i < b.N; i++ {
56        a := new(int)
57        *a = 20
58        IntFibInterface(a)
59    }
60}

結果は以下のようになります。20-30倍というところでしょうか。こちらのほうがコードだけ見ると遅そうですが、なんと int を使った場合より速いのです。そして、インタフェースへの変換コストは少ない、ということがわかります。

1BenchmarkNoInterface       30000             55558 ns/op               0 B/op          0 allocs/op
2BenchmarkPointer            1000           1297598 ns/op          525376 B/op      32836 allocs/op
3BenchmarkInterface          1000           1357172 ns/op          525376 B/op      32836 allocs/op

 11.54s of 1.54s total (  100%)
 2Showing top 20 nodes out of 37 (cum >= 0.01s)
 3      flat  flat%   sum%        cum   cum%
 4     0.75s 48.70% 48.70%      1.24s 80.52%  runtime.mallocgc
 5     0.22s 14.29% 62.99%      0.22s 14.29%  runtime.mSpan_Sweep.func1
 6     0.16s 10.39% 73.38%      0.38s 24.68%  runtime.heapBitsSweepSpan
 7     0.14s  9.09% 82.47%      1.38s 89.61%  runtime.newobject
 8     0.11s  7.14% 89.61%      1.49s 96.75%  12.IntFibInterface
 9     0.08s  5.19% 94.81%      0.08s  5.19%  runtime.prefetchnta
10     0.02s  1.30% 96.10%      0.02s  1.30%  runtime.heapBitsForObject
11     0.01s  0.65% 96.75%      0.01s  0.65%  runtime.(*bucket).mp
12     0.01s  0.65% 97.40%      0.01s  0.65%  runtime.atomicload64
13     0.01s  0.65% 98.05%      0.41s 26.62%  runtime.mCentral_CacheSpan
14     0.01s  0.65% 98.70%      0.03s  1.95%  runtime.scanblock
15     0.01s  0.65% 99.35%      0.01s  0.65%  runtime.schedule
16     0.01s  0.65%   100%      0.01s  0.65%  runtime.xchg
17         0     0%   100%      1.49s 96.75%  12.BenchmarkInterface
18         0     0%   100%      0.04s  2.60%  runtime.backgroundgc
19         0     0%   100%      0.01s  0.65%  runtime.deductSweepCredit
20         0     0%   100%      0.04s  2.60%  runtime.gc
21         0     0%   100%      0.02s  1.30%  runtime.gc.func1
22         0     0%   100%      0.01s  0.65%  runtime.gc.func3
23         0     0%   100%      0.01s  0.65%  runtime.gc.func4

非ポインタ型とポインタ型の違いは？

プロファイルをみるとわかりますが、 runtime.convT2E , runtime.assertE2T が非ポインタ型の場合重荷になっています。ここまで「ポインタ型」「非ポインタ型」という曖昧な表現をしていましたが、正確にはこれは isdirectiface という関数の結果が true かどうかです。これによりインターフェス変換の処理が変わります。

 1// Can this type be stored directly in an interface word?
 2// Yes, if the representation is a single pointer.
 3func isdirectiface(t *Type) bool {
 4    switch t.Etype {
 5    case TPTR32,
 6        TPTR64,
 7        TCHAN,
 8        TMAP,
 9        TFUNC,
10        TUNSAFEPTR:
11        return true
12
13        // Array of 1 direct iface type can be direct.
14    case TARRAY:
15        return t.Bound == 1 && isdirectiface(t.Type)
16
17        // Struct with 1 field of direct iface type can be direct.
18    case TSTRUCT:
19        return t.Type != nil && t.Type.Down == nil && isdirectiface(t.Type.Type)
20    }
21
22    return false
23}

これが false になるオブジェクトをインタフェースとして利用する際は注意です。

時にはインタフェースを使わない、という選択も

以上のことから、特に性能が必要な場合、インタフェースを使わず擬似共用体(Goに共用体はないので)のような実装をしたほうがよい場合が出てきます。フィボナッチの例を struct で書いてみましょう。

 1const (
 2    TypeInt int = iota
 3    TypeBool
 4)
 5
 6type Object struct {
 7    Type      int
 8    IntValue  int
 9    BoolValue bool
10}
11
12func IntFibNoInterface(n int) int {
13    if n < 2 {
14        return n
15    }
16    return IntFibNoInterface(n-1) + IntFibNoInterface(n-2)
17}
18
19func IntFibStruct(o Object) Object {
20    if o.IntValue < 2 {
21        return o
22    }
23    return Object{TypeInt,
24        IntFibStruct(Object{TypeInt, o.IntValue - 1, false}).IntValue +
25            IntFibStruct(Object{TypeInt, o.IntValue - 2, false}).IntValue,
26        false}
27}                                                               
28
29func BenchmarkNoInterface(b *testing.B) {                       
30    for i := 0; i < b.N; i++ {
31        IntFibNoInterface(20)
32    }
33}
34
35func BenchmarkStruct(b *testing.B) {
36    for i := 0; i < b.N; i++ {
37        IntFibStruct(Object{TypeInt, 20, false})
38    }  
39}

速度ですが、インタフェースを使うより断然早く、2-3倍程度しか遅くなりません。当たり前ですが、暗黙的なメモリアロケーションも発生しません。

1BenchmarkNoInterface       30000             58491 ns/op               0 B/op          0 allocs/op
2BenchmarkStruct            10000            139924 ns/op               0 B/op          0 allocs/op

reflect を使えば・・・というのもありますが、結局 reflect.ValueOf の引数が interface{} なので素直に使うと interface{} の場合と同様の遅さです。

 1func IntFibNoInterface(n int) int {
 2    if n < 2 {
 3        return n
 4    }
 5    return IntFibNoInterface(n-1) + IntFibNoInterface(n-2)
 6}
 7
 8func IntFibReflect(r reflect.Value) reflect.Value {
 9    n := r.Int()
10    if n < 2 {
11        return r
12    }     
13    return reflect.ValueOf(IntFibReflect(reflect.ValueOf(n-1)).Int() + IntFibReflect(reflect.ValueOf(n-2)).Int())
14}                         
15
16func BenchmarkNoInterface(b *testing.B) {
17    for i := 0; i < b.N; i++ {
18        IntFibNoInterface(20)
19    }     
20}         
21
22func BenchmarkReflect(b *testing.B) {
23    for i := 0; i < b.N; i++ {
24        IntFibReflect(reflect.ValueOf(20))
25    }                        
26}

1BenchmarkNoInterface       30000             56238 ns/op               0 B/op          0 allocs/op
2BenchmarkReflect            1000           2271026 ns/op          525376 B/op      32836 allocs/op

ちなみにGopherLuaでは

拙作のGo言語によるLua実装 GopherLua では一番おそい、非ポインタ型をインタフェースとして使う実装になっています。なのでGopherLuaはおそらく現在でもGo上で動くスクリプト言語としては最速の部類ですがさらに速くしようと思えばできたのかもなあと思っています(まぁこの例ほど言語実装は単純ではないし、擬似共用体にするにしてもサイズが大きくなるので、速くなるかどうかはやってみないとなんともいえないのですが。あとサイズがおおきくなるのでスタック型のAPIにするよりないでしょうね)。

これは使い易さや実装のし易さやメモリ効率、そして「Goらしさ」とのトレードオフだと思っています。私は、とにかく使い易いものがほしかったので一番使い易く、そして実装が楽なものを選びました。

とはいえ、特に数値型をインタフェースに変換する負荷を軽減するために、独自のアロケータを実装しています。これは一定数の float64 をまとめてアロケートすることでインタフェース変換コストを減らしています。かなり強引なのですが興味のある方は gopher-lua/alloc.go をご参照ください。

最後に

インタフェースはGoで中心的な役割を果たしていますが、時に大幅なパフォーマンス劣化を起こす可能性があります。特に、非ポインタ型を interface{} などとして引数に渡す場合が要注意です。

Go Advent Calendar 2015 その3 12日目は Ladicle さんです。

inforno

information + inferno ... inforno