Do not ever assume

02:40 yalazi 0 Comments

#sysadminday


Happy SysAdmin Day!
Two days ago was #sysadminday; System Administrator Appreciation day and Murphy was making sneaky and hideous plans for us.

And please don't tell me Murphy is not to be blamed but the Cosmos, or the God or another supernatural power is. Rule is named after him and he will have to cope with the accusations ;)

#symptoms 


It was a slow Friday and it all started after I went to the cafeteria to have some tea with friends. And I was not aware of any Murphy activity. After some time I was coming down stairs to my workstation I saw our DBA has a cluster of people around her.

Oh this is never good. 

Our ERP system was flaky and the culprit seemed to be the PostgreSql cluster. Gulçin was working on it and it has been a while since the issues has begun. But she could not pinpoint the cause.

Here I need to tell you that our home-baked ERP system is the heart of our operations. Every department, need to use some parts of it to run the company. It is the start of 5 days Ramadan festive vacation thus every other part of the company is preparing for the upcoming days, - every other department because hey we are sysadmins, if we are not working then we have done our jobs well - . This day ERP was a bit more important than usual. The issue must be solved at once. 

At this point my colleagues decided the issue may be the master postgres server and it was time to retire it. Promotion out of thin air for the long waiting slave.

Everything would be ok now except the issue reappeared in no time. It is not a db problem. It seems to be a more complex issue and a broader and fresh look was needed. So it was time for me to step in.

After talking with my colleagues about the issue and learning the symptoms. Something was not adding up. Systems were steady in cpu usage, our SAN was not out of breath, actually our DB showed strangely low write activity, network seemed to be not saturated or not showing any latency. There were little or no system log on the DB. load was high but not because cpu or disk IO? What the heck?

Only strange log was some tasks were hung more than 120 seconds and terminated. All related to postgresql.

#theinvestigation

The Team

Then I started to investigate.

I saw many "idle in transaction" postgres processes. I binded some of them with strace and see that they were waiting for data from a file descriptor. With the help of lsof and /proc filesystem, I found out that the fds belonged to our client pgbouncer's network sockets. So I opened another terminal tab and sshed into our client apps and straced into the counterpart processes. Strange thing is they seemed to be stalled on waiting data from PostgreSQL.

A race condition or something? Why would each part waited the other hand? I was clueless at this point.

At this time my colleagues was trying brute force strategies, Closing different parts of our stack to see effects. Was it our cronjobs? Celery tasks? Web apps? But nothing was helping.

I needed a break at this point thus I diverted my efforts to another but seemingly unrelated issue we were having. Sudo was responding very late. Our systems all connected to our Active Directory LDAP servers and sudo was no exception. Most of the time late sudo is caused by unresponsive dns servers, or in sudo-ldap situations ldap connection issues. But neither were in guilt this time.

Still I was not ready for facing the gruesome monster we had so I defaulted to my swiss army knife, strace to look more into this sudo issue. By the way it was intriguing. I was assuming it was because dns or ldap connection or PostgreSql was having locks on the cpu so sudo was not responding. Could the two issues be related? Yes intriguing.

Here I will give you a 2¢. Have you ever tried stracing sudo? If you are not root, sudo would bulk out because it is lacking setuid. strace is not setuid thus setuid on the sudo binary is not elevated to execed sudo. Easiest solution is to run strace via root user to strace sudo.

Strace shown that sudo was waiting on /dev/log a worrisome and noticeable  time, in the seconds magnitude.

Syslog is important
Then I had the moment. The moment where you solve every issue at once and feel lightening. Everything was in place now. Everything adding up.

It was logstash, no to be more specific it was Elastic Search. We have just begun collecting syslogs to a central server. And it was not responding in time because of failed elastic search shards.

After I fixed elastic search and restarted rsyslog instances, everything began to work flawlessly. Sure we had degraded database and elastic clusters but ERP was OK for now.
 

#theverdict

 

I was assuming many things that led us to this situation. After I built central logging with logstash, the single node elastic search cluster failed once and I thought it would hold till we solve our storage problem to introduce more nodes and would not affect anything even if it failed. I was dead wrong.

  • I assumed local syslog would never block. -- wrong
  • I assumed seemingly remote and unrelated systems would not affect each other. ElasticSearch was too remote, seemed innocent and never was brought up in the investigation.
  • I assumed I had time to add Elastic Search nodes.
  • I assumed late responding sudo was an unrelated issue.
  • I assumed sudo was late because of ldap or dns issues or byproduct of the the high load of the system.
  • We assumed the culprit was old master PostgreSql server.
  • I assumed that slow friday would end slow like it started.
 
I learned for the thousandth time to "Never assume". I'm a fast learner but I am also fast forgetting :)

0 yorum: