Keeping track of 16 million metrics – Behind the scenes with the Server Monitoring Team

With multiple services being developed at any one time, you may wonder how we keep track of all the resources required. This time we interviewed Paul Traylor (who is part of the Server Monitoring Team) to find out more.

It was now or never!

Tell me about your background

I’m from the USA and I grew up in a pretty rural area of North Carolina near a town called Fuquay-Varina. After graduating from university where I studied computer science, I joined a start-up in San Francisco.

The business I first joined was a social platform for gamers. I was a web developer with a focus on the back-end. While building functionality for the users, I was also developing internal tools to support the business.

From there I joined another start-up, this time as an operations engineer with a specific focus on automation and tools. For example, I combined multiple existing parts of the infrastructure into a single virtual machine for development. I also spent time making and improving tools that supported error handling and monitoring. I was mainly using technologies like Python, SaltStack, and JIRA, among others.

How did you end up in Fukuoka?

While in San Francisco, I started to learn Japanese as a hobby, and after a couple of years, I thought I’d like to try living in Japan. After a couple of years of working, I felt that it was a case of “now or never.” I found a school in Fukuoka and signed up for a year-long course.

How did you come to join LINE Fukuoka?

LINE is a large-scale communication tool with a huge user base and multiple services, which obviously drew me in to the business. However, LINE Fukuoka in particular stood out to me as an interesting opportunity for a number of reasons.

First, the role itself really appealed to my skill set. The position was with the Developer Support Team, and I would be working on the development environment and the surrounding infrastructure. It was the perfect opportunity to utilize my existing skill set in a new, forward-thinking environment.

Another reason I was drawn to LINE Fukuoka was the diversity of the business. Although the traditional image of Japanese businesses is that of uniformity, LINE Fukuoka’s development team is a roughly 50/50 split between Japanese and non-Japanese members. It’s really refreshing to be in such an international environment and to be able to share knowledge with others from around the world.

16 million metrics across 17,000 targets

You started with the Developer Support Team; what exactly were you doing?

As a Developer Support Team member, I was responsible for maintaining and improving the development environment. Provisioning and preparing servers, troubleshooting issues, and creating support tools – these were all things that developers needed to do their jobs without difficulty.

For a role like that, you need to understand not only the developer side but also the infrastructure behind the scenes so you can diagnose and solve a whole host of issues. Also, you need to be acutely aware of how all the different services work on a general scale. If a developer comes to you saying they want to run a specific service, you need to be able to visualize exactly what they’ll need in terms of databanks, server provisioning, etc.

It was a really interesting position.

Now you’re with the Server Monitoring Team?

Around the same time that I started doing developer support, we noticed that our previous monitoring system had several insufficiencies, and there were many areas that we wanted to improve. I was assigned to improve our monitoring system to make it even more useful for our developers.

The monitoring system is one of the most important parts of the development environment. Without it, you’re not really sure if your services are actually running and available. Also, as LINE introduces more and more services (currently 100 or so), the need for constant and accurate monitoring across the whole business increases. We wanted to make a system that was easy to use and universal so developers could simply select their own metrics and thresholds.

The system is based on Prometheus (an open source monitoring tool written in Go), and it scales very well for the current trend of having many micro-services and wanting to monitor several data points for each (memory, network, CPU etc.). For visualization, we use Grafana to make an easy-to-understand and visually appealing dashboard.

Currently, the system is monitoring around 16 million metrics across 17,000 targets, generating terabytes of data.

What is most rewarding about the work you do?

Currently, one of the biggest issues is how to store past data for future use. Right now I have about 50 TB of data that I’m storing, and that’s only a couple of months’ worth! This is an issue that affects every business, and there’s no easy solution. I make sure to keep up to date on the latest research and technologies to ensure that we’re at the leading edge when it comes to tackling these issues.

I also love taking complex technologies and creating simple tools to benefit the end-user. As mentioned, we based our monitoring system off Prometheus, which can be a little tricky to use by itself, so I also used Django (Python) to build an easy-to-use dashboard so that developers can configure things themselves. The system enables developers to designate what data points they want to measure, establish threshold values, and then decide how they’d like to be notified if the threshold values are crossed (a message via Slack, for example).

Finally, one of the aspects I love most about my job is the aspect of “kaizen,” or continuous improvement. The needs of developers are constantly changing and evolving, and so are the technologies that are available. I constantly assess and evaluate current practices with the mindset of “How can we make this better?” Although it’s a never-ending task, it’s one of the elements of the job I enjoy the most.

How have you developed in the role?

Through having a large degree of freedom when it comes to making decisions, I really feel I’ve improved my ownership and decision making skills. I had the freedom to select which tools would get the best results for the project based on the research I had done.

I’ve also really developed my communication ability. Working directly with developers can be rewarding yet tricky: they may be able to identify the issues they’re having, but what they see as the solution might not actually be the best way to tackle the issue. Being able to communicate clearly and get down to the heart of the issue is an important skill that I’ve improved.

The support of a large company, the mindset of a start-up

What is it like to work at LINE Fukuoka as an engineer?

LINE Fukuoka is a large company, and some aspects of that are reflected in my everyday work (how machines are provisioned, security protocols, etc.). However, at a team level, LINE Fukuoka feels closer to a start-up. Teams are free to decide on their own stack; we have teams mostly using Java and others using Perl, so there’s a lot of flexibility. We have the benefit of being able to choose the tech that will bring about the best outcome rather than being proscribed what to use.

One of the aspects I love the most about LINE Fukuoka is that every week during our regular meeting, there’s an opportunity for developers to give a presentation on any topic they’re interested in. This is not only a great opportunity to improve one’s public speaking skills, but it offers insight into areas of tech I might not otherwise have known much about. I enjoy hearing about the latest Android or iOS trends and catching up on the latest JavaScript frameworks.

Finally, one of the nice aspects of the office here is how international it is. We have Japanese speakers learning English and English speakers learning Japanese, so there’s a shared understanding of how difficult it can be to communicate in a language that isn’t your mother tongue. This means we communicate together from a position of mutual respect and cooperation.

What’s next for you?

I love the idea of “kaizen” (continuous improvement), so I’ll continue to study and improve my skill set. For example, there are a lot of interesting tools written in Go, and I’d like to look into them further. I’m also interested in how people manage time and agendas, even going as far as to read RFC papers on calendar syncing to learn more.

I continue to enjoy studying Japanese and trying to improve my communication skills. When I get tired of staring at computer screens, I like to walk around Fukuoka taking photos. It’s a great way to relax.

Thanks Paul!

LINE Engineering Blog

More Detail >

LINE Engineering Blog

More Detail >