juan_gandhi | funny reads

You're viewing

juan_gandhi's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

juan_gandhi

Quora: What is the most interesting bug you have ever solved in a computer program?

Threaded | Flat

From:

sab123.livejournal.com

Гораздо интереснее вопрос: если пытаться залогиниться через гугель, зачем кворе нужно "manage your contacts"?

From:

stas

Compiler ones (esp. ones that "technically correct because standard allows it", like the insane pointer thing) are the worst.

Though Turkey's "i" thing is pretty annoying too, had to deal with it personally, no fun.

From:

sab123.livejournal.com

The most epic ones I remember were:

1. My application crashed by jumping to a random address when I connected more than 1000 clients. Eventually I've traced it to some bits in the return address on the stack of an OS authentication function getting flipped at random. Which happened because the function did the NIS authentication and used select() to control the sockets. I've build my code with the option that supported up to 10240 bits in FD_SET but the system library was built with the default settings with only 1024 bits in FD_SET, and allocated the FD_SET variables on the stack. So when more than 1024 sockets were open, it got a socket at index over 1024 and select() corrupted the random bits on the stack. Hm, don't remember how I fixed it or if I fixed it at all. Maybe I've moved the authentication into a separate process.

2. The machines in a Veritas cluster committed suicide. The 2-node clusters are inherently unstable, trying to decide whether the other machine had failed or if the network got partitioned (in case of partitioning the master machine is supposed to continue and the slave machine is supposed to kill itself). The slave machine killed itself out of nowhere. It worked the worst when the master machine was getting stopped, so it would move the load to the slave machine, and that one would kill itself in the middle of fail-over. It took probably close to a year: they would send a dump, I would add more diagnostics and a potential fix, send them a patch, in a month or so they would have the problem reoccur and send another dump, I'll look at the diagnostics in it and do another iteration... Kind of spectacularly, I've solved the problem onsite: the customer wanted someone to be present and hold their hand while they have a service window and install the latest patch, and the support engineer had personal plans, so I went. This last patch contained the diagnostics to the point where it watched for the unusual timing between the events. So we've installed the patch, and it has turned out that once in a while the OS time just stopped for about 4 seconds. Which broke all the heartbeat algorithms. The weirdest part is that this timer bug was a known one, fixed on another edition of the OS but somehow it never occurred to them to port the fix to this one too. But while working on it , I think I've fixed a few genuine race conditions in the clustering code too :-)

From:

bytebuster463.livejournal.com

Всё не асилил, но то, что прочёл, в over 146% случаев есть следствие mutable state.
В самых клинических случаях (как top voted) — mutable code.

Q.E.D.

Edited Date: 2016-08-04 08:12 pm (UTC)