Fluentd: trouble shooting on Windows Server

Envrironment

確認環境下

OS: Windows Server 2016
td-agent: 3.1.1

確認したのは Windows Server 2016でのみだが、Linuxでも一部当てはまる。

buffer enqueue中にtd-agentのconfを書き換えてrestartしてしまった

何かしら例えばchunkがenqueue出来ずbufferに溜まり続ける事象が発生し、その場でconfを変更しfluentdのthreadを再起動した場合、 conf変更前と変更後のchunkが存在しenqueueする時に不整合が発生してしまい当該事象が直らない可能性がある.

confを変更した後は一度fluentdをstopしてからchunkをflush、もしくはdeleteしてからstartする。

chunk_limit_size を超える chunk を ES へ送ろうとしてエラー

Transport::Errors::RequestEntityTooLarge error なるエラーが td-agent.log に出力されている。

[warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=yyyy-mm-dd tt:mm:ss +0900 chunk="xxx" error_class=Elasticsearch::Transport::Transport::Errors::RequestEntityTooLarge error="[413] {\"Message\":\"Request size exceeded 104857600 bytes\"}"
[error]: #0 unexpected error error_class=Errno::EMFILE error="Too many open files - C:/path/to/log/iis_httperr.pos"

当該時間帯からchunkがbufferに溜まり始める。
更にElasticsearch側がHTTPリクエスト可能なpayload dataサイズが100 MiBまでである為、それを超えるchunkもElasticsearch側が受け付けず溜まり始める。
- node sizeにより最大サイズは異なる。
chunkがbufferに溜まり続け、 [error] Too many open files のエラーが大量に出始める。

chunkを確認したところ、数百MBにもなるchunkが存在した。この状態ではfluentdを再起動しても failed to flush the buffer が出続ける。

そこで chunk_limit_size を100MiB以内に抑えfluentdを再起動。
生成されるchunkは指定したサイズ以下で生成されるようになり、当該エラーは解消された。

flush_intervalを指定してもflushされない

flush_interval を設定している場合、 chunkサイズが指定した閾値を超えたか、 flush_intervalの時間を超過した場合にflushされる。
flush_interval の設定が無い場合もchunkサイズが指定閾値を超えた場合はflushされるが、加えてtimekey(+timekey_wait)で指定した時間を超過した場合にflushされる。

はずだったが、 flush_interval の指定間隔でflushされない事象が発生。一部のstoreで当該オプションが効いていない模様。 flush_interval を指定せずtimekey方式でflushすることにした。

数週間に1回程度chunkが溜まる

chunkが滞留し始めた時のエラーは次の通り。

[warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds= chunk="xxx" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool.

It seems to be the issue that AWS Elasticsearch is not 100% compatible with native ES when it comes to reloading connections to ES. Reloading connections happens every 10,000 requests by default. This can be useful when you've multiple hosts configure, but in the case of AWS ES there is only a single (HA) endpoint.

github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15

Elasticsearch Serviceがconnectionをreloadする際にエラーが発生しているが、Amazon Elasticsearch ServiceとnativeなESに完全な互換性が無いことに問題がある模様。
connectionのreloadは10000回リクエスト毎に発生する。
- 複数のhostで構成されている場合にnode listの更新を行うのを目的としている場合に役立つが、Amazon Elasticsearch Serviceは単一endpointである(nodeのIPが見えない)。
reload_connections を無効化する (default true) ことにより当該事象の発生を防ぐ。

ES側fieldと投入レコードのfieldが合致せずES側がRejectしてくる

してくる。

[warn]: #0 Elasticsearch rejected 1 documents due to invalid field arguments
Could not push logs to Elasticsearch, resetting connection and trying again. An established connection was aborted by the software in your host machine.
[warn]: #0 Elasticsearch rejected 1 documents due to invalid field arguments
[error]: #0 Could not push log to Elasticsearch: {"took"=>15, "errors"=>true, "items"=>[{"index"=>{"_index"=>"test_index", "_type"=>"test_type_2", "_id"=>"xxx", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Rejecting mapping update to [test_index] as the final mapping would have more than 1 type: [test_type_1, test_type_2]"}}}]}
[warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2019-08-16 14:17:15 +0900 chunk="xxx"

test_index の _type field (test_type_2) が、ES側の既存のレコード(test_type_1)と差異があり、td-agentの転送がESからRejectされている状態。

当該fieldの既存レコードのみ全削除を試みるも、同じエラーで失敗。

curl -XDELETE http://localhost/test_index/fluent/_query/ -H 'Content-type: application/json' -d '
{
    "query": {
        "filtered": {
            "query": {
                "match": {
                    "_type": "test_type_1"
                }
            },
            "filter": {
                "range": {
                    "@timestamp": {
                        "from": "yyyy-mm-ddT00:00:00",
                        "to": "yyyy-mm-ddT00:00:00"
                    }
                }
            }
        }
    }
}'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Rejecting mapping update to [test_index] as the final mapping would have more than 1 type: [fluent, type]"}],"test_type_1":"illegal_argument_exception","reason":"Rejecting mapping update to [test_index] as the final mapping would have more than 1 type: [fluent, type]"},"status":400}

当該Indexを削除し、td-agent側をrestartすることで復帰した。

curl -XDELETE http://localhost/test_index
{"acknowledged":true}

`error="Too many open files - socket(2) for \"127.0.0.1\" port 10000"`

Log (meta) ファイルが数千単位で開かれている場合、open file descriptor の制限値を超える為、システムエラーとなる。恐らく Windows のデフォルト制限値は 512。

Window has a default limit of 512 open file descriptors (pretty close to the 2 pipes x 255 subprocesses).

change the maximum number of open files in Windows

[error]: #0 unexpected error error_class=Errno::EMFILE error="Too many open files - socket(2) for \"127.0.0.1\" port 10000"

tail -f /dev/null

If you haven't had any obstacles lately, you're not challenging. be the worst.