概要

ダーツスキル評価用のDLモデルのハイパーパラメータを最適化する。
最適化には、Preferred Networks製Optunaを用いる。同社はChainerのメンテナーだけど、Optunaは別にchainer以外にも使える。今回はOptunaとKerasを合わせて使います。

ハイパーパラメータ最適化について

概念的には、ここのページがすごくまとまっている。
- 日本語の説明ページもネットには転がってるけど、だいたい説明がテキトーだから、こういうちゃんとした英語のページのほうを見たほうが良い気がします。
- 気が向いたら全訳載せます。 towardsdatascience.com
最近、ハイパーパラメータ最適化のライブラリは、Hyperopt, Optuna、Hyperopt、SMAC、MOE, Spearmintとか色々ある。
上のページだとHyperopt推しだけど、2018/12/03にOptunaが公開されていて、そこではHyperoptとと同じくTPE(Tree-structured Parzen Estimator)というので計算できて、かつ「学習曲線を用いた試行の枝刈り」「並列分散最適化」といった点でより効率よく計算できるようになっている。詳しくは下記記事参照。

research.preferred.jp
なお、TPEを使う場合は、「損失の上位グループと下位グループを分割する閾値y*」を設定する必要があるが、これはHyperoptと同じ設定にしていると書かれている。参考論文からすると「a quantile cutoff point of previous values」なので、単に過去の目的関数の評価結果値群の中央値をy*として選択しているのかなと思う。なのでユーザ側では設定不要。参考論文は下記。
- 該当ソースコード読んだわけではないので、違ってたらごめんなさい。そして良ければ正しい情報教えてください。
https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

さっそくハイパーパラメータ最適化をやってみる

Optunaでハイパーパラメータチューニングしたい！どこから手をつけるべきか？というところだが、githubにサンプルソースがあるので参考にする。
OptunaはPreferred Networks製なので、同社のライブラリであるchainerでサンプルが書かれている。

github.com

基本的なハイパーパラメータ最適化コードの構成

だいたいどのハイパーパラメータ最適化も、基本構成はかわらない。
ハイパーパラメータチューニングは、結局のところ、目的関数の最小化だから、結局最適化関数と、その関数に入力する目的関数が定義されていればOK。
ざっくり、optunaライクに書けば例えば以下のような感じ。

def objective(trial):
    trial変数からのハイパーパラメータ候補の生成
    目的関数の定義
    return 目的関数の評価値

if __name__ == '__main__':
    study = optuna.create_study() # 最適化インスタンス作成
    study.optimize(objective, n_trials=トライアル回数)

　ハイパーパラメータ最適化結果の出力

サンプルソース

chainer_simple.pyが参考になる。

https://github.com/pfnet/optuna/blob/master/examples/chainer_simple.py


from __future__ import print_function

import chainer
import chainer.functions as F
import chainer.links as L
import numpy as np
import pkg_resources

if pkg_resources.parse_version(chainer.__version__) < pkg_resources.parse_version('4.0.0'):
    raise RuntimeError('Chainer>=4.0.0 is required for this example.')


N_TRAIN_EXAMPLES = 3000
N_TEST_EXAMPLES = 1000
BATCHSIZE = 128
EPOCH = 10


def create_model(trial):
    # We optimize the numbers of layers and their units.
    n_layers = trial.suggest_int('n_layers', 1, 3)

    layers = []
    for i in range(n_layers):
        n_units = int(trial.suggest_loguniform('n_units_l{}'.format(i), 4, 128))
        layers.append(L.Linear(None, n_units))
        layers.append(F.relu)
    layers.append(L.Linear(None, 10))

    return chainer.Sequential(*layers)


def create_optimizer(trial, model):
    # We optimize the choice of optimizers as well as their parameters.
    optimizer_name = trial.suggest_categorical('optimizer', ['Adam', 'MomentumSGD'])
    if optimizer_name == 'Adam':
        adam_alpha = trial.suggest_loguniform('adam_alpha', 1e-5, 1e-1)
        optimizer = chainer.optimizers.Adam(alpha=adam_alpha)
    else:
        momentum_sgd_lr = trial.suggest_loguniform('momentum_sgd_lr', 1e-5, 1e-1)
        optimizer = chainer.optimizers.MomentumSGD(lr=momentum_sgd_lr)

    weight_decay = trial.suggest_loguniform('weight_decay', 1e-10, 1e-3)
    optimizer.setup(model)
    optimizer.add_hook(chainer.optimizer.WeightDecay(weight_decay))
    return optimizer


# FYI: Objective functions can take additional arguments
# (https://optuna.readthedocs.io/en/stable/faq.html#objective-func-additional-args).
def objective(trial):
    # Model and optimizer
    model = L.Classifier(create_model(trial))
    optimizer = create_optimizer(trial, model)

    # Dataset
    rng = np.random.RandomState(0)
    train, test = chainer.datasets.get_mnist()
    train = chainer.datasets.SubDataset(
        train, 0, N_TRAIN_EXAMPLES, order=rng.permutation(len(train)))
    test = chainer.datasets.SubDataset(
        test, 0, N_TEST_EXAMPLES, order=rng.permutation(len(test)))
    train_iter = chainer.iterators.SerialIterator(train, BATCHSIZE)
    test_iter = chainer.iterators.SerialIterator(test, BATCHSIZE, repeat=False, shuffle=False)

    # Trainer
    updater = chainer.training.StandardUpdater(train_iter, optimizer)
    trainer = chainer.training.Trainer(updater, (EPOCH, 'epoch'))
    trainer.extend(chainer.training.extensions.Evaluator(test_iter, model))
    log_report_extension = chainer.training.extensions.LogReport(log_name=None)
    trainer.extend(chainer.training.extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy']))
    trainer.extend(log_report_extension)

    # Run!
    trainer.run()

    # Set the user attributes such as loss and accuracy for train and validation sets
    log_last = log_report_extension.log[-1]
    for key, value in log_last.items():
        trial.set_user_attr(key, value)

    # Return the validation error
    val_err = 1.0 - log_report_extension.log[-1]['validation/main/accuracy']
    return val_err


if __name__ == '__main__':
    import optuna
    study = optuna.create_study()
    study.optimize(objective, n_trials=100)

    print('Number of finished trials: ', len(study.trials))

    print('Best trial:')
    trial = study.best_trial

    print('  Value: ', trial.value)

    print('  Params: ')
    for key, value in trial.params.items():
        print('    {}: {}'.format(key, value))

    print('  User attrs:')
    for key, value in trial.user_attrs.items():
        print('    {}: {}'.format(key, value))

目的関数をobjective関数で定義。
- モデルの定義をcreate_modelで、最適化手法の定義をcreate_optimizerで定義。
- ハイパーパラメータの探索範囲は、trial.suggest_〜みたいな関数で値を振れるように書く。値が整数なのか、離散値なのか、浮動小数点なのか、広範囲の小数点なのか等で、suggestを変えてやる必要がある。 optuna.readthedocs.io
で、最後にstudy.optimizeで最適化。最適化した結果は、study.best_trialに格納される。

ダーツ評価モデルの場合

僕の場合はKeras+Tensorflowでやっていたので、目的関数をkerasで書く。
OptimizerはSGD, Adam, Nadamを選べるようにした。
層数、ノード数、ドロップアウト率、減衰率、学習率、等を変えてやった。
目的関数は、validation lossとした。

import optuna
from keras.layers import Input, concatenate
from keras.layers.core import Activation, Flatten, Reshape, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model

import pandas as pd
import numpy as np
import glob
import os
import csv

import keras

count = 0

def create_model(trial):

    N_LAYERS_FINGER = trial.suggest_int('n_layers_finger', 1,10)
    N_LAYERS_BODY = trial.suggest_int('n_layers_body', 1,10)
    N_LAYERS_INTEGRATED = trial.suggest_int('n_layers_integrated', 1,10)

    finger_input_shape = (60, 6)
    body_input_shape = (60, 30)

    fingers_input = Input(shape=finger_input_shape)
    body_input = Input(shape=body_input_shape)


    x1 = Flatten()(fingers_input)
    for i in range(N_LAYERS_FINGER):
        n_units = int(trial.suggest_loguniform('n_units_finger_l{}'.format(i), 10, 400))
        drop_out_rate = trial.suggest_uniform('dropout_rate_finger_l{}'.format(i), 0.0, 1.0)
        x1 = Dense(n_units, name='finger_fc{}'.format(i))(x1)
        x1 = BatchNormalization()(x1)
        x1 = Activation('relu')(x1)
        x1 = Dropout(drop_out_rate)(x1)

    x2 = Flatten()(body_input)
    for i in range(N_LAYERS_BODY):
        n_units = int(trial.suggest_loguniform('n_units_body_l{}'.format(i), 10, 400))
        drop_out_rate = trial.suggest_uniform('dropout_rate_body_l{}'.format(i), 0.0, 1.0)
        x2 = Dense(n_units, name='body_fc{}'.format(i))(x2)
        x2 = BatchNormalization()(x2)
        x2 = Activation('relu')(x2)
        x2 = Dropout(drop_out_rate)(x2)

    x = concatenate([x1, x2])
    for i in range(N_LAYERS_INTEGRATED):
        n_units = int(trial.suggest_loguniform('n_units_integrated_l{}'.format(i), 10, 400))
        drop_out_rate = trial.suggest_uniform('dropout_rate_integrated_l{}'.format(i), 0.0, 1.0)
        x = Dense(n_units, name='integrated_fc{}'.format(i))(x)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        x = Dropout(drop_out_rate)(x)

    n_units = int(trial.suggest_loguniform('n_units_integrated_l{}'.format(i), 5, 100))
    x = Dense(n_units, activation='relu', name='fc_final')(x)
    x = Dense(1, activation='sigmoid', name='fc_final_sigmoid')(x)

    model = Model(inputs=[fingers_input, body_input], outputs=x)

    return model


def create_optimizer(trial):
    optimizer_name = trial.suggest_categorical('optimizer', ['SGD', 'ADAM', 'NADAM'])

    if optimizer_name == 'SGD':
        sgd_lr = trial.suggest_loguniform('sgd_lr', 1e-5, 1e-2)
        opt = keras.optimizers.SGD(lr=sgd_lr, nesterov=True)
    elif optimizer_name == 'ADAM':
        adam_lr = trial.suggest_loguniform('adam_lr', 1e-6, 1e-2)
        weight_decay = trial.suggest_loguniform('weight_decay', 1e-10, 1e-3)
        opt = keras.optimizers.Adam(lr=adam_lr, decay=weight_decay)
    else:
        nadam_lr = trial.suggest_loguniform('nadam_lr', 1e-6, 1e-2)
        schedule_decay = trial.suggest_loguniform('schedule_decay', 1e-10, 1e-3)
        opt = keras.optimizers.Nadam(lr=nadam_lr, schedule_decay=schedule_decay)

    return opt


def objective(trial):
    global count
    count = count + 1
    print("progress : {}".format(count))

    # # ==========================================================================
    # #
    # # Set Model
    # #
    # # ==========================================================================

    model = create_model(trial)

    # # ==========================================================================
    # #
    # # Set Optimizer
    # #
    # # ==========================================================================

    opt = create_optimizer(trial)

    # # ==========================================================================
    # #
    # # Set Data Config
    # #
    # # ==========================================================================

    # center location vector for normalization
    center_location_norm_body = [1.8447725036231883, -0.17879784788302278, 0.3172548949275362]

    # scale value for normalization
    # assume person only moves 2.0 [m] at most
    scale_norm_body = 2.0

    scale_norm_finger_motion = 10000.0
    scale_norm_finger_pressure = 50000.0

    data_dir = "./*_log.csv"

    # ==========================================================================
    #
    # Load Data
    #
    # ==========================================================================

    X_fingers = np.empty((0, 60, 6), dtype='float32')
    X_body = np.empty((0, 60, 30), dtype='float32')

    y = np.empty((0, 1), dtype='float32')

    for files in glob.glob(data_dir):

        basename = os.path.basename(files)
        dir_name = os.path.dirname(files)

        # output_filename1 = dir_name + '/' + basename[:-4] + '_finger_normalized.csv'
        # output_filename2 = dir_name + '/' + basename[:-4] + '_body_normalized.csv'

        loss_filename = dir_name + '/' + basename[:-8] + '_loss.csv'

        # -------------------------------------------------------------------------------
        # load loss value

        with open(loss_filename, 'r') as f:
            reader = csv.reader(f)
            # header = next(reader)  # ヘッダーを読み飛ばしたい時

            for row in reader:
                # arr.append(row[0])
                # print(float(row[0]))
                y = np.append(y, [[float(row[0])]], axis=0)

                # print(row[0])
                break

        # -------------------------------------------------------------------------------
        # load finger motion and pressure

        df = pd.read_csv(files)

        df_finger_motion = df[['finger0', 'finger2', 'finger4', 'finger6', 'finger8', ]] / scale_norm_finger_motion
        df_finger_pressure = df[['finger1']] / scale_norm_finger_pressure
        df_fingers = pd.concat([df_finger_motion, df_finger_pressure], axis=1)

        # df_fingers.to_csv(output_filename1)

        X_fingers = np.append(X_fingers, [df_fingers.values], axis=0)

        # -------------------------------------------------------------------------------
        # load body motion

        df_body_head = (df[['head_x', 'head_y', 'head_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_neck = (df[['neck_x', 'neck_y', 'neck_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_torso = (df[['torso_x', 'torso_y', 'torso_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_waist = (df[['waist_x', 'waist_y', 'waist_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_left_shoulder = (df[['left_shoulder_x', 'left_shoulder_y',
                                     'left_shoulder_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_left_elbow = (df[['left_elbow_x', 'left_elbow_y',
                                  'left_elbow_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_left_hand = (df[['left_hand_x', 'left_hand_y',
                                 'left_hand_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_right_shoulder = (df[['right_shoulder_x', 'right_shoulder_y',
                                      'right_shoulder_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_right_elbow = (df[['right_elbow_x', 'right_elbow_y',
                                   'right_elbow_z', ]] - center_location_norm_body) / scale_norm_body
        df_body_right_hand = (df[['right_hand_x', 'right_hand_y',
                                  'right_hand_z']] - center_location_norm_body) / scale_norm_body

        df_body_normalized = pd.concat(
            [df_body_head, df_body_neck, df_body_torso, df_body_waist, df_body_left_shoulder, df_body_left_elbow,
             df_body_left_hand, df_body_right_shoulder, df_body_right_elbow, df_body_right_hand], axis=1)

        # df_body_normalized.to_csv(output_filename2)
        X_body = np.append(X_body, [df_body_normalized.values], axis=0)

    model.compile(loss='mean_squared_error', optimizer=opt)
    hist_model = model.fit(x=[X_fingers, X_body], y=y, epochs=100, validation_split=0.2, batch_size=32)
    val = hist_model.history['val_loss'][-1]

    trial.set_user_attr('loss', hist_model.history['loss'])
    trial.set_user_attr('val_loss', hist_model.history['val_loss'])
    trial.set_user_attr('loss_final', hist_model.history['loss'][-1])
    trial.set_user_attr('val_loss_final', hist_model.history['val_loss'][-1])

    return val


if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=500)

    print('Number of finished trials : ', len(study.trials))

    print('Best trial:')
    trial = study.best_trial

    print(' Value: ', trial.value)
    print(' Params: ')
    for key, value in trial.params.items():
        print('     {} : {}'.format(key,value))

    print(' User attrs: ')
    for key, value in trial.user_attrs.items():
        print('     {} : {}'.format(key,value))

    path_w1 = 'result_params.txt'
    with open(path_w1, mode='w') as f:
        for key, value in trial.params.items():
            f.write('     {} : {}'.format(key, value))

    path_w2 = 'result_user_attrs.txt'
    with open(path_w2, mode='w') as f:
        for key, value in trial.user_attrs.items():
            f.write('     {} : {}'.format(key, value))

計算結果

1st トライ

とりあえず、目的関数の1回の学習計算を100エポックで行い、ハイパーパラメータ評価を500トライアル行うようにしてみた。
大体僕のPCで８時間かければ終わると思ったので、就寝して朝確認してみた。
結果・・・メモリが死ぬほど消費されて、計算が激遅になっていて、全然最後まで計算できてなかった・・・。16GBじゃ小さかったか。もしくはトライアル回数が多すぎたか。
朝起きたら、メモリ使用量がフルになって処理速度激落ちしていた・・・。
とりあえず途中で切ったが、そのときの一番最後のトライアル時点で、以下のようであった。
- Current best valueがつまり、最適と判断したハイパーパラメータでの目的関数の値である。　- 手で設計した値で0.0374・・・くらいだったので、微妙によくなった程度。もっと良くなってほしいなぁ・・・。

Current best value is 0.03668692404261002 
with parameters: {
'n_units_body_l2': 289.4605170198315,
'dropout_rate_body_l0': 0.8364339589557341, 
'dropout_rate_finger_l0': 0.8859765071323414,
 'dropout_rate_body_l1': 0.15985137870934507, 
'weight_decay': 2.488304298517825e-05, 
'dropout_rate_finger_l1': 0.341986720233033,
 'dropout_rate_integrated_l0': 0.5575471944470913, 
'n_units_finger_l0': 70.38427639988954, 
'dropout_rate_body_l3': 0.07259930169805667,
 'n_units_body_l3': 53.536974517157255,
 'n_layers_body': 4, 
'dropout_rate_body_l2': 0.2272064995599263,
 'n_layers_integrated': 1, 'n_units_body_l1': 46.97294594609835, 
'n_units_integrated_l0': 34.01545234457688, 
n_units_body_l0': 399.85159118050177, 
'optimizer': 'ADAM',
 'adam_lr': 0.00014800027376498905,
 'n_units_finger_l1': 29.76813380928397,
 'n_layers_finger': 2}.

2ndトライ

メモリ食い過ぎたのが、目的関数の試行回数が多すぎて、目的関数評価履歴データが溜まりすぎたのかな？と思い、目的関数の1回の学習計算を1000エポックで行い、ハイパーパラメータ評価を50トライアル行うようにしてみた。
また、前回の計算では、最後のほうはほぼNadamかAdamしか選ばれていなかったので、SGDは探索から除外した。
2時間半くらいで終わり、以下の結果が得られた。
最終的な値としては、0.036185・・・になったので、手で設計したときのモデルに比べると0.0015くらいval_lossが改善されている。しかも、圧倒的にパラメータ数が減っている。

Best trial:
 Value:  0.03618520897669861
 Params: 
     n_units_body_l2 : 235.38956630848222
     dropout_rate_integrated_l1 : 0.5737162875000977
     n_layers_integrated : 2
     dropout_rate_body_l2 : 0.9891436475997171
     dropout_rate_body_l1 : 0.1555186733167152
     dropout_rate_finger_l2 : 0.3464849954214746
     n_units_finger_l2 : 49.48920178671224
     dropout_rate_body_l3 : 0.3426272345026113
     n_units_finger_l1 : 103.31434565170868
     n_units_body_l0 : 34.571351359102025
     dropout_rate_finger_l0 : 0.7604449168769942
     optimizer : NADAM
     schedule_decay : 9.964837485839651e-06
     nadam_lr : 1.3501789024608659e-06
     dropout_rate_finger_l1 : 0.3228038832180243
     n_units_body_l3 : 185.55310387909378
     n_units_body_l1 : 14.346523075661459
     n_units_finger_l0 : 25.094171936721253
     n_layers_finger : 3
     n_units_integrated_l0 : 151.11179196621413
     dropout_rate_integrated_l0 : 0.38696485442569295
     n_units_integrated_l1 : 101.67919179045141
     n_layers_body : 4
     dropout_rate_body_l0 : 0.1023105039740452

最終結果のモデル図

f:id:surumetic-machine-83:20190114205956p:plain — Optunaでハイパーパラメータ最適化したダーツスキル評価DLモデル

所感

Optunaでハイパーパラメータ最適化をトライしてみた。手で設計するよりも、まぁまぁ良いパフォーマンスのモデルができたと思う。
ただ、ハイパーパラメータ最適化で劇的に良くならなかったのは、学習データのせいもあると思う。そもそもデータ数が少ないし、(自分でデータ化しておいてなんだが)データの質も高くないしね。今後もっとマシなデータで試してみたいな。
クラスごとのデータ量にかなり偏りのあるデータなので、最近話題のFocal Lossを導入してみると、効果がでるかもなと思っている。
- 識別が簡単な大多数に対する損失の重みを軽くし、識別が困難な少数のクラスに対して損失の重みを大きくする。データに含まれる各クラスのデータ量のばらつきが大きい時に特に効果を発揮するらしい。KITTI Benchmarkの3D Object DetectionのSOTAになっているPointRCNNとかでも使われている。

arxiv.org

今後

Focal Lossを試す、LSTMとGRUの導入。もしかしたらCNNもちょっといれる。
- ぶっちゃけLSTMとか簡単に試すだけならやったことあるけど、どう使うのが効果的なのか考えあぐねている・・・まぁその悩んだ結果もそのうち書ければと。論文読まないとなー。
ちゃんと時系列処理した奴に対してOptunaかけて、もう一回記事をポストしたいですね。

Azarashi Tech Blog

日常における日常的なことやテクノロジー的なこと

Kerasで書いたDLモデルをOptunaでハイパーパラメータ最適化(1)

概要

ハイパーパラメータ最適化について

さっそくハイパーパラメータ最適化をやってみる

基本的なハイパーパラメータ最適化コードの構成

サンプルソース

ダーツ評価モデルの場合

計算結果

1st トライ

2ndトライ

最終結果のモデル図

所感

今後