💰 Data Talk: Be careful catboost users

Zimnat Insurance Recommendation Challenge

Helping Zimbabwe

$5 000 USD

Completed (over 5 years ago)

Skills you will learn

Prediction

Collaborative Filtering

1780 joined

612 active

Info Data Chat Leaderboard

Start

Jul 01, 20

Sep 13, 20

Reveal

Sep 13, 20

Pavel

Be careful catboost users

Notebooks · 27 Aug 2020, 08:16 · edited 1 minute later · 18

Catboost cannot reproduce the exact result (the metric floats all the time) on the GPU and it really seems to me that on the CPU the same (I checked only 1 time on Vast.ai I ordered Xeon because it takes a very long time to count on i7)

Discussion 18 answers

dead-mazai

Да, согласен, какая-то дичь. Я по началу думал, что это проблемы с разбиением, а потом когда все проверил, пришло осознание, что что-то не так с катбустом, ибо лосс каждый раз разный, хз как это поправить, на цпу нереально выучить и то не факт что поможет, сто лет занимает даже 100 итераций....

27 Aug 2020, 08:32

Upvotes 0

Pavel

Да я перешел на другой алгоритм там все ок

replied to dead-mazai27 Aug 2020, 08:33

Upvotes 0

dead-mazai

М?

replied to Pavel27 Aug 2020, 08:48

Upvotes 0

Дело не в катбусте а в специфике GPU - он не может гарантировать порядок вычислений c плавающей запятой, из-за этого есть небольшая вариабильность в результатах.

replied to dead-mazai27 Aug 2020, 10:39

Upvotes 0

dead-mazai

А, понял, не очень приятно, я бы сказал((

На цпу долго обучаться...

replied to AK27 Aug 2020, 10:42

Upvotes 0

dead-mazai

Но спасибо большое за информацию)))

replied to dead-mazai27 Aug 2020, 10:42

Upvotes 0

Pavel

Да спасибо уже подсказали на форуме, но мне кажется он и на CPU нестабилен так как пару раз удалось прогнать на Хеоне на удаленной машине и получил тоже разные скоры, дальше теститьне стал бо очень долго можно состарится в процессе.....может это особенности с мультиклассом и вероятностями

replied to AK27 Aug 2020, 10:44 (edited 1 minute later)

Upvotes 0

johnpateha

Lightgbm has some instability too (both cpu and gpu) but not too much. LGB much faster than cgb and usually has better accuracy

replied to Pavel27 Aug 2020, 13:52

Upvotes 0

Could you recommend some great tips on preventing overfitting with LGB?

replied to johnpateha27 Aug 2020, 14:10

Upvotes 0

johnpateha

- cross-validation - attention to every folds - sometimes overall CV improvement is possible if only one fold has big gain - don't trust it. good generalization == gain for most of CV folds.

- high value for min data in leaf - help to kill weak splits

-avoid too precise parameters - 0.8 always better than 0.7744522

replied to AK27 Aug 2020, 14:52

Upvotes 0

Thank you, appreciated! For some reason no matter how I tune LGB in this comp I can't beat my CB score, but I'm going to keep trying

replied to johnpateha27 Aug 2020, 15:43

Upvotes 0

pchlq

I used stratified by target and kfolds by IDs (each row of ID in one fold) CV schemes, but results were unstable and often overfitting as well. What points should be paid attention to during CV?

replied to johnpateha29 Aug 2020, 11:16

Upvotes 0

It is only a small difference and I don't think Zindi will mind, it does not change the score that much.

See:

https://catboost.ai/docs/features/training-on-gpu.html

for more details.

27 Aug 2020, 16:25

Upvotes 0

Pavel

Difference in score in my case was 20#people in rank table....so it would be sad when Zindi would check the code and result won't reproduce.....

replied to FC27 Aug 2020, 18:21

Upvotes 0

hmm, it is unlikely the GPU non-determinism can cause such a difference, in my case a small difference can be in the 5th or 6th digit. So in short it shouldn't affect the your rank at all

replied to Pavel28 Aug 2020, 09:19

Upvotes 0

I agree, have you fixed your random seed?

replied to AK28 Aug 2020, 09:23

Upvotes 0

Pavel

of course i now about random seed, difference in the 4th or 5th digit and in that competiton give diferences about 20#people in rank table.

replied to FC28 Aug 2020, 09:28

Upvotes 0

Pavel

anyway i choose another library which more stable and faster on cpu for that task, now its ok my result is reproduce

replied to FC28 Aug 2020, 09:35

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status