Our latest anomaly and the The numbers game



In systems administration(devops is not another kind so whatever you call what you do)

for maintanance, monitoring or troubleshooting, we are not actualy playing with code or systems, we are playing with numbers. We have to think in terms of numbers because in computing everything is about numbers.

 
Just like Mathematics.

When building a database cluster intention is not installing a database server. We are trying to serve this number of clients, that number of transactions, solve other numbers of problems and store a variable number of data. Everything is reduced to a few stats and calculations.

So It is very important for us to have inputs and functions of numbers to achieve our intended goals. Let it be just building a lamp stack or troubleshooting a screw up or running a hadoop cluster.
And watching for numbers always pays up.

Latest anomaly


A few weeks ago, we had an anomaly. Someone had malicious intends and targeting us. But lacked enough resources to cause recognizable harm thus if we weren't watching closely it was easy to miss.

We forward our uwsgi clusters performance data to our central graphite/carbon server. uwsgi supports carbon protocol with it's carbon plugin so we just configure it with our carbon collectors url. And let things roll. for more info: http://uwsgi-docs.readthedocs.org/en/latest/Carbon.html

carbon: ipaddress:portno

As I always do I was looking into our application servers average response times. on a good day it averages 150 -  200ms. And only peaks are deployments or anomalies.
This time I found out we are not in a peak, but instead in a lower average.  Our uwsgi was doing considerably or unbelievably better than normal.

Of course this was curious. I looked and checked my colleagues about intended changes. Looked into data of integrated systems and nothing was different except we were "seemed to be" doing considerably better.




Being better is not a sign of good things sometimes and this is a text book anomaly.

Maybe we lost some traffic? Put up some bullshit campaign and face some kind of boycott? Did it started to rain outside?

Let's check our request numbers.

It actually went up 3 folds?



We are being targeted with a high number of small sized requests with fast responses so our average response time seems to be getting lower. 3 times the reqs and half the response time! That's how you can manipulate statistics ;)



If we are having some strange traffic, these must be logged by our web servers. And here you can see the histogram of our log lines of this malicious traffic.

Do you see the correlation with the request number graphic?





And these are some log lines of the attack. It's my duty to protect the innocent (this time the guilty) so I had to do some paint job on the lines.

The attacker was targeting our user system and trying to do some registrations. But with little luck, she was missing csrf tokens and could not build the correct requests so she was getting http 400 Bad requests after and after. So this time I'm not really sure about the real intension here. Some kind of DOS? this was hardly impacting our users. Even we could easily miss.

Some kind of user data exposure? But we did not expose any data with http 400?

Registering fake customers? Hey gimme a break..

This attack was so badly engineered that the targeting ip (yes the ip not ips.) was exposing some sql dumps and php codes about the targets. It was targeted againts us and one of our competitors. It was put on a french vps and lacked enough horse power to impact or gain anything..

Still we are baffled about the intensions. Any thoughts about that?


By the way; the screen shots are from graphite and logstash/kibana. Use them. They are great tools.

http://www.logstash.net/
http://graphite.wikidot.com/

Do not ever assume

#sysadminday


Happy SysAdmin Day!
Two days ago was #sysadminday; System Administrator Appreciation day and Murphy was making sneaky and hideous plans for us.

And please don't tell me Murphy is not to be blamed but the Cosmos, or the God or another supernatural power is. Rule is named after him and he will have to cope with the accusations ;)

#symptoms 


It was a slow Friday and it all started after I went to the cafeteria to have some tea with friends. And I was not aware of any Murphy activity. After some time I was coming down stairs to my workstation I saw our DBA has a cluster of people around her.

Oh this is never good. 

Our ERP system was flaky and the culprit seemed to be the PostgreSql cluster. Gulçin was working on it and it has been a while since the issues has begun. But she could not pinpoint the cause.

Here I need to tell you that our home-baked ERP system is the heart of our operations. Every department, need to use some parts of it to run the company. It is the start of 5 days Ramadan festive vacation thus every other part of the company is preparing for the upcoming days, - every other department because hey we are sysadmins, if we are not working then we have done our jobs well - . This day ERP was a bit more important than usual. The issue must be solved at once. 

At this point my colleagues decided the issue may be the master postgres server and it was time to retire it. Promotion out of thin air for the long waiting slave.

Everything would be ok now except the issue reappeared in no time. It is not a db problem. It seems to be a more complex issue and a broader and fresh look was needed. So it was time for me to step in.

After talking with my colleagues about the issue and learning the symptoms. Something was not adding up. Systems were steady in cpu usage, our SAN was not out of breath, actually our DB showed strangely low write activity, network seemed to be not saturated or not showing any latency. There were little or no system log on the DB. load was high but not because cpu or disk IO? What the heck?

Only strange log was some tasks were hung more than 120 seconds and terminated. All related to postgresql.

#theinvestigation

The Team

Then I started to investigate.

I saw many "idle in transaction" postgres processes. I binded some of them with strace and see that they were waiting for data from a file descriptor. With the help of lsof and /proc filesystem, I found out that the fds belonged to our client pgbouncer's network sockets. So I opened another terminal tab and sshed into our client apps and straced into the counterpart processes. Strange thing is they seemed to be stalled on waiting data from PostgreSQL.

A race condition or something? Why would each part waited the other hand? I was clueless at this point.

At this time my colleagues was trying brute force strategies, Closing different parts of our stack to see effects. Was it our cronjobs? Celery tasks? Web apps? But nothing was helping.

I needed a break at this point thus I diverted my efforts to another but seemingly unrelated issue we were having. Sudo was responding very late. Our systems all connected to our Active Directory LDAP servers and sudo was no exception. Most of the time late sudo is caused by unresponsive dns servers, or in sudo-ldap situations ldap connection issues. But neither were in guilt this time.

Still I was not ready for facing the gruesome monster we had so I defaulted to my swiss army knife, strace to look more into this sudo issue. By the way it was intriguing. I was assuming it was because dns or ldap connection or PostgreSql was having locks on the cpu so sudo was not responding. Could the two issues be related? Yes intriguing.

Here I will give you a 2¢. Have you ever tried stracing sudo? If you are not root, sudo would bulk out because it is lacking setuid. strace is not setuid thus setuid on the sudo binary is not elevated to execed sudo. Easiest solution is to run strace via root user to strace sudo.

Strace shown that sudo was waiting on /dev/log a worrisome and noticeable  time, in the seconds magnitude.

Syslog is important
Then I had the moment. The moment where you solve every issue at once and feel lightening. Everything was in place now. Everything adding up.

It was logstash, no to be more specific it was Elastic Search. We have just begun collecting syslogs to a central server. And it was not responding in time because of failed elastic search shards.

After I fixed elastic search and restarted rsyslog instances, everything began to work flawlessly. Sure we had degraded database and elastic clusters but ERP was OK for now.
 

#theverdict

 

I was assuming many things that led us to this situation. After I built central logging with logstash, the single node elastic search cluster failed once and I thought it would hold till we solve our storage problem to introduce more nodes and would not affect anything even if it failed. I was dead wrong.

  • I assumed local syslog would never block. -- wrong
  • I assumed seemingly remote and unrelated systems would not affect each other. ElasticSearch was too remote, seemed innocent and never was brought up in the investigation.
  • I assumed I had time to add Elastic Search nodes.
  • I assumed late responding sudo was an unrelated issue.
  • I assumed sudo was late because of ldap or dns issues or byproduct of the the high load of the system.
  • We assumed the culprit was old master PostgreSql server.
  • I assumed that slow friday would end slow like it started.
 
I learned for the thousandth time to "Never assume". I'm a fast learner but I am also fast forgetting :)

Php ldap and Custom CA signed certificates


Have you ever had a problem connecting a secure ldap server using php's ldap functions?

It is probably caused by certificate trust issues and 9 out of 10 times root cause is the custom CA can not be recognized by php interpreter.  And as any self respecting devops engineer would solve it by introducing CA's certificate to the system.

But php is a damned monster the normal way to do this, putting the certificate to cacerts dir never works. Even adding  
TLS_CACERT [full.path.to.ca.cert]
line into ldap.conf won't help you. Of course some lucky ones with special deals with Murphy may not have this issue. But believe me, php.net's documentation comments and the internet are full of unresolved ca issues. And sometimes the solution presented is "recompile php with openssl/gnutls/libressl with this custom conf".

If the hell has not opened doors and the earth is not full with demons this solution can not be accepted. there must be a solution.

And of course there is a simple one. PHP uses openssl and openssl respects some environment variables even if you can not make it respect an ldap.conf file. So adding the following lines to your php app with the corresponding paths to your case will fix the issue:

<?php
   putenv('LDAPTLS_CACERTDIR=/etc/ssl/certs/');
   putenv('LDAPTLS_CACERT=/etc/ssl/certs/myCustomCA.crt');
?>

But sometimes even adding ca cert can not be enough. You may have expired certificates or something. So you may need to scrape away the whole certificate validation process. Adding

<?php
   putenv('LDAPTLS_REQCERT=never');
?>
Line to the beginning of your script would solve this issue. And I strongly advice you against doing this.

But this is a simple work around which would work for small scripts or isolated incidents. If you have an established application and you can not modify it? It's one solution to set this environment variables globally in the php server or you may use php's mostly unknown and discarded gem auto_prepend_file.

Just create a file with only the envirionment variable setting lines and use php.ini to prepend this to all php files.





Devil is in the details.

This is why google is kicking ass in so many tools.

I couldn't noticed this one until it's too late. Simple, informative, usefull and detail oriented. World is too small.

How I did a streaming Test

Hi,

I nearly promised to write about "the syn that's not replied" and "The port can not be opened" but I have something better to write about now. Well I've already written most of it already ;)

This is about a streaming test I've had performed for the GSM operator I'm consulting in Thailand. Have you ever encountered a streaming server to be load tested? Well I've in the last few weeks. It was a success let alone the bandwidth bottlenecks of my vps provider [1] Vpshispeed of Thailand.  I'm possibly going to retry these test with more Thailand bandwidth to reach my tests target in a later time. Any help on huge bandwidth vps or cloud providers in Thailand?

I'm bound to thank Diederik dee Gee for all the help he has provided to me. Thank you Diederik. It's been a really pleasure to work and talk with you! Also I must tell that that bottleneck was not any fault of Diederik's side. It's just I had consumed all the bandwidth available to his service and even I could use more than he has promised. His service exceeds his promises ;) I strongly recommend his services for anyone in need of vps in Thailand.

If I return to the foundation of my load test, as you already know it's a hard process. Also seemingly there are virtually no tools to load test a streaming server. So I had to improvise about it.

My concerns for a streaming load test are as follows:
  • No tools -
    Build in house scripts or tools.
  • Bandwidth -
    Streaming is bandwidth intensive so you need huge amounts of it.
  • What to monitor and how to analyse data? -
    For a web server load test we have many tools and patterns to use. Like ab, jmeter and many many more and easy structure to make the test. "Fire requests and receive responses." But how to load test a streaming server? It's not http.
So I prepared a foundation to design my tests. My Foundation is as follows:
  1. Receive Assumption: Users has enough bandwidth to saturate itself for maximum performance and keep busy the server at most.
  2. Receive Assumption: Users can consume all the data it receives or has enough buffer for constant data receiving.
  3. Receive Assumption: Users has enough time and patience to consume streams from beginning to end without any exceptions.
  4. Consume Assumption: users has a limited buffer but overflowing data is received with the next burst added to the received data.
  5. Consume Assumption: users consume data in a constant bitrate relative to the bitrate of the stream
  6. Server Assumption: Servers are more bandwidth constraint then memory or cpu.
  7. Server Assumption: Different protocols for streaming only has minimal effect on cpu and memory usage on the server side

After creating my foundation I started searching for tools to use. Because of assumptions 6 and 7 I decided on focusing on a single protocol and with the openrtsp tool [2] and (live555 Streaming Media Library [3]) availability rtsp seemed to be a suitable choice.

But as is, OpenRtsp would not give me any data to analyse. So I quickly hacked one of its samples into giving me data about stream bursts and put the received data into void. As this would only give me raw data about the network usage I built a simple python tool to analyse the raw data in a simplistic but informative way. This tool gives a [4] gnuplot data file about the data burst which then be used to create graphs of the bursts in a visual way.

I've put the foundation into github and can be pulled from [5] https://github.com/fsniper/streaming-loadtest-foundation . There are more info in the README file in the repository.

All pull requests are welcome ;) Please don't be harsh and please try to be constructive with your comments. These are hacks as they are right now.

[1] http://www.vpshispeed.com/en/
[2] http://www.live555.com/openRTSP/
[3] http://www.live555.com/liveMedia/
[4] http://www.gnuplot.info
[5] https://github.com/fsniper/streaming-loadtest-foundation

Hello for my torments! And what's happened?

Too long it has been, since I've written on the internet.  Too much has came and gone. I've done many things, passed many. I even don't know why I started writing back. Especially why in English? I believe, I do not have a clear answer to that question.

Maybe I need to practice writing in English? Maybe need more audience?  It does not matter. I will be writing in English and will have too many mistakes but won't care. I'll just write if I can. I do not promise anything. Promises are a burden if they can't be kept. I even has one on my shoulders that I could not manage to fulfil. Once I promised some of my college mates to cook for them. I still feel the urge. So long, and I even lost contact with many of them. Never mind, that's a lost cause. .

Since the last post I tried to write once and could not finish. I'm more inclined to finish this time. I know, if I pause, or postpone that means another half an effort that's lost.

I'm in a hotel bed in a foreign country now. I'm in Bangkok and started the feel the urge to write again. I've been abroad since my last post. Been to North Cyprus, Bahrain, Greek Islands and now Thailand. Some of them for work, some of them for leisure. But always with a leisure sauce.  In Bahrain I worked for Batelco, Ericsson, Parkyeri as an outside consultant. In Thailand I'm consulting at Parkyeri for TrueLife this time. Greek Islands and Cyprus was all leisure. 

In Bahrain I travelled the country with a rental car and stuck in desert to be rescued by U.S. Soldiers. (thank you uncle Sam!) In North Cyprus gambled and won to loose on adultery ;)  In Greek islands I was on a cruise ship and with my ex. And I'm in Thailand now. Not yet have much leisure. But I'm sure to have some in the coming days.

What will I be ramble about? As the name suggests, I'm a Linux System administrator and will be rambling about my experiences or my torments of all about system administration and anything that's complementary for it. Like the torture of my beloved two friends Barış and Barış's cycling tour designed to torture me!. For the tour route look at  http://dunyadangecerken.com/?p=38756

Now I have to sleep some. So see you! Maybe I will write next about my last struggle with  "unreplied syn" and my friends "port that can not be listened".