с Клифом опять собачимся
Oct. 17th, 2016 09:29 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
В выходные Арно апрувнул кусочек моего кода; good timing, Клиф где-то шлялся, и не успел покритиковать. Теперь он пишет:
fyi - allowing a setAny is one of those performance leaks I've been talking about. If you call it by the 1'sies and 10'sies it's all fine. If you call it by the billions, you'll be drowning in GC costs - you made at least the billion objects that are getting passed it, but probably lots more to get there.
Having the programming model specifically disallow slow-at-scale coding has been a design goal of H2O - and as a consequence nearly always the code is fast by default. You have to think a little bit more up front, but it's hella faster in the long run.
Может быть, сишники с ним и согласятся, но я думаю, это он глуповато выступил.
А я ему написал
Talking about setAny specifically, I consider it as a temporary solution that will be eventually removed, replaced with a more type-safe solution; of course making an ad-hoc decision for each value is not nice. Eventually I'm planning to have type-safe classes for doing stuff safely and efficiently.
In general, I think, not giving our users a choice between a comfortable and an arguably more efficient (nobody measured) API may be a disservice to our customers. The speed of development, I believe, beats the speed of execution. An hour of a developer is worth about 10^15 operations on an Amazon small instance, if I'm not mistaken.
I'd also love to see specific performance results from profilers. Without measurable, verifiable data it's some kind of religious discussion whose invisible god is better.
----------an update from Cliff---------- the guy is definitely new to trolling
Yeah, and 10^15 ops takes about a few minutes on a cluster.
You say "arguably more efficient (nobody measured) API" - bullshit!!!!
Of course we measured, over and over again. Please go do some serious
measurement - and don't bother quoting "without measurement it's a
religious discussion" bullshit at me; I f*cking made Java performant, by
religiously measuring everything - and then talking about it every forum
I could about the importance of measurement.
Sigh - we're clearly talking right past each other.
I said "Cost model is important" and you said "Type Safety".
Type safety is wonderful, until the program runs so slow that no one
bothers.... which will pretty much be the case for all datasets > 1bn
elements (or even 1m elements if you get slow enough!).
Type-Safety is not "efficient", although you can have both type safety
and efficiency (Example: set(long row, double value) - is both type-safe
and efficient).
H2O is/was intended for data that's: too big for a single machine, and
too big for e.g. Python style approaches, because it's too slow to
manipulate that way. There has always been a core need for speed in
these problems, one which you (and to some extent Pasha) appear to be
throwing away as fast as you can. It won't be coming back easily! So
before you go about busily adding Yet Another Function Call overhead, or
Yet Another Auto-Boxing call, I suggest YOU do some performance measures
- and see what the cost to use your fancy setAny call is vs what's there
already.
Execution time for Double vs double HAS been measured, a-lot. It's
between 100x and 1000x slower. Slow enough that's its a Real Bug - some
stuff simply does not complete soon enough to be bothered with.
And - to head this off at the pass - 90% of the time I see somebody
without any performance experience talk benchmarks, they invariably
start with some amazingly slow code, and add some modest more overhead -
and report "it's about the same". I suggest you start with e.g. the
RollupStats code, which is fairly complex but also fairly fast... and
try adding a few Double vs double calls and see what happens. Be sure
to measure on big enough data that It Matters, and data big enough
relative to your heap that It Matters - i.e., you should be seeing tons
of GC if you make a few billion Doubles, and if you're not seeing it,
then you are "cheating" - using more heap & hardware than data.
Cliff
=========my answer==========
Thank you. It was very interesting to read. I deeply apologize, I'm new in this company. So far I could not find any performance test in our codebase. I'll be happy to use some, to make sure my new code does not deteriorate anything. I have some points of improvement in mind, but without the profiling framework I can't test them.
fyi - allowing a setAny is one of those performance leaks I've been talking about. If you call it by the 1'sies and 10'sies it's all fine. If you call it by the billions, you'll be drowning in GC costs - you made at least the billion objects that are getting passed it, but probably lots more to get there.
Having the programming model specifically disallow slow-at-scale coding has been a design goal of H2O - and as a consequence nearly always the code is fast by default. You have to think a little bit more up front, but it's hella faster in the long run.
Может быть, сишники с ним и согласятся, но я думаю, это он глуповато выступил.
А я ему написал
Talking about setAny specifically, I consider it as a temporary solution that will be eventually removed, replaced with a more type-safe solution; of course making an ad-hoc decision for each value is not nice. Eventually I'm planning to have type-safe classes for doing stuff safely and efficiently.
In general, I think, not giving our users a choice between a comfortable and an arguably more efficient (nobody measured) API may be a disservice to our customers. The speed of development, I believe, beats the speed of execution. An hour of a developer is worth about 10^15 operations on an Amazon small instance, if I'm not mistaken.
I'd also love to see specific performance results from profilers. Without measurable, verifiable data it's some kind of religious discussion whose invisible god is better.
----------an update from Cliff---------- the guy is definitely new to trolling
Yeah, and 10^15 ops takes about a few minutes on a cluster.
You say "arguably more efficient (nobody measured) API" - bullshit!!!!
Of course we measured, over and over again. Please go do some serious
measurement - and don't bother quoting "without measurement it's a
religious discussion" bullshit at me; I f*cking made Java performant, by
religiously measuring everything - and then talking about it every forum
I could about the importance of measurement.
Sigh - we're clearly talking right past each other.
I said "Cost model is important" and you said "Type Safety".
Type safety is wonderful, until the program runs so slow that no one
bothers.... which will pretty much be the case for all datasets > 1bn
elements (or even 1m elements if you get slow enough!).
Type-Safety is not "efficient", although you can have both type safety
and efficiency (Example: set(long row, double value) - is both type-safe
and efficient).
H2O is/was intended for data that's: too big for a single machine, and
too big for e.g. Python style approaches, because it's too slow to
manipulate that way. There has always been a core need for speed in
these problems, one which you (and to some extent Pasha) appear to be
throwing away as fast as you can. It won't be coming back easily! So
before you go about busily adding Yet Another Function Call overhead, or
Yet Another Auto-Boxing call, I suggest YOU do some performance measures
- and see what the cost to use your fancy setAny call is vs what's there
already.
Execution time for Double vs double HAS been measured, a-lot. It's
between 100x and 1000x slower. Slow enough that's its a Real Bug - some
stuff simply does not complete soon enough to be bothered with.
And - to head this off at the pass - 90% of the time I see somebody
without any performance experience talk benchmarks, they invariably
start with some amazingly slow code, and add some modest more overhead -
and report "it's about the same". I suggest you start with e.g. the
RollupStats code, which is fairly complex but also fairly fast... and
try adding a few Double vs double calls and see what happens. Be sure
to measure on big enough data that It Matters, and data big enough
relative to your heap that It Matters - i.e., you should be seeing tons
of GC if you make a few billion Doubles, and if you're not seeing it,
then you are "cheating" - using more heap & hardware than data.
Cliff
=========my answer==========
Thank you. It was very interesting to read. I deeply apologize, I'm new in this company. So far I could not find any performance test in our codebase. I'll be happy to use some, to make sure my new code does not deteriorate anything. I have some points of improvement in mind, but without the profiling framework I can't test them.
no subject
Date: 2016-10-19 06:56 am (UTC)Данный блог не предлагать. ;)
no subject
Date: 2016-10-19 06:59 am (UTC)no subject
Date: 2016-10-19 07:27 am (UTC)no subject
Date: 2016-10-19 07:46 am (UTC)Усё с вами ясно %)
Date: 2016-10-19 08:02 am (UTC)