Monthly Archives: September 2014

The Power of Rundeck

A big part of the DevOps movement is the passion and commitment to “automate everything” and provide as much self-service as humanly possible. I’m a big believer in automation…not because I’m lazy, but rather because I have a desire to make all things repeatable, reliable and robust. I call those the “Three R’s” and they are a huge part of why I became a big believer in a 4th “R” which is called Rundeck. Note, I’m not the author of Rundeck. The awesome guys at SimplifyOps were the authors. I’m just a user, fan and admirer of cool, easy to use technology. Below is a quick passage about Rundeck…

Rundeck is an open-source software Job scheduler and Run Book Automation system for automating routine processes across development and production environments. It combines task scheduling, multi-node command execution, workflow orchestration and logs everything that happens. Access control policy governs who executes actions across nodes via the configured “node executor” (default for unix uses SSH) and does not require any additional remote software.[1] to be installed on them. Jobs and plugins can be written in scripting languages or Java. The workflow system can be extended by creating custom step plugins to interface external tools and services.

Wikipedia Comparison of Open Source Automation Tools

I’m a big fan of Rundeck for a number of reasons. My first reason is pretty straightforward. Basically, it’s a simple web application that provides the basic controls and workflow for self-service. The simple web gui is just so easy to use that anyone can understand how to use it with little training. My second reason is that it pretty much can make use of any automation/scripting framework out on the market. Third, it gives developers, operations engineers or even support staff a simple workflow for doing work on a server without ever logging into the server. Fourth and certainly not last is that it provides an audit and tracking system. There are other key things such as scheduling and reporting, which are super easy-to-use features to enjoy as well.

Source: Rundeck.org

Long before the SimplifyOps guys built Rundeck, my old team at Blackboard built an automation engine we called Galileo. My team built it in Groovy/Grails. It was a lot like Rundeck, but not as simple to contribute and extend. It served a great purpose during its time. It helped us achieve so many of the needs I listed above. It required a listener on each destination client. Rundeck works without an installation on the client system. All that’s needed is an SSH key or simply passing login credentials for a trusted user within the script.

Crowd-Sourcing Development

One of the cool things that the SimplifyOps guys do is crowd-source their development via Trello, which is one of the best kanban boards available (for freemium btw) on the market. Their board is public for everyone to follow, vote and even contribute.

Making Time for a Side Project Using a Commitment Device

My team is getting used to my style and attitude about work. One core value I believe in is making time for other work (that’s relevant to one’s career) outside of the normal velocity of a sprint to accomplish additional learning or work. If you have a chance, take a look at my presentation about PTOn which is about applying a commitment device (ie: scheduling of time for Paid Time On) to ensure that the work is accounted for and is not disruptive to a team’s work velocity.  

A really good friend of mine (David Hafley) sent me this article today which is directly in line with my presentation about PTOn (Paid Time On). Teams (and individuals) need time to work through a work problem (project of role) or a problem that could yield incredible inspiration (project of passion). The challenge that I see with software development (engineering teams in general) is that teams focus on scheduling every ounce of time imaginable. If the team has 12 months in a year and they follow a 1 month velocity, then they have 12 units. The same applies to a 2-week velocity in which the team works off a 26 unit schedule. What I’m really getting at is that software teams tend to build utilization models that account for work and vacation. Occasionally, these work models account for training or an off-site. You get my point which is teams tend to over-schedule their team members like they are a bunch of Carbon Based Units.

CBLF meaning - what does CBLF stand for?

If you read the article closely, you will see it emphasizes creating “personal time” which I personally find difficult. I have a wife, kids, hobbies, etc…I will agree that finding personal time is important, but in the same grain, I would suggest that in the 40+ hours we spend at work (some 60+), we need to “find work time” for learning. 

 

Balancing Testing versus Measurement

One of the advantages of having a SAAS application is the ability to capture true production telemetry. This telemetry consists of functional and non-functional (performance and security) data points. These data points can be and should be optimized for use by our team to make us a more informed development team about the quality of our product. This by no means implies that live production metrics should be leveraged 100% in lieu of testing. There should be a balance of testing and measurement.

octopusK

I covered my testing philosophy in one of my earliest blogs in which I stressed and advocated for the need for robust build/test pipelines complete with quality inspection (unit, static, integration and acceptance). This pipeline is nothing original or unique that I’m proposing. The pipeline is a component of Continuous Integration in which developers commit early and often. The pipeline grows in complexity and maturity in an iterative fashion with each day as the team’s commits becomes a robust product or module ready for deployment. Consider this early phase more of an incubation phase in which the product is nothing more than executable code, but not deployable ore useable. When code is being incubated, teams should be placing more emphasis on testing and evaluation. This testing is more Unit and API, not acceptance testing.

11LEFTHANDED

If the product is ready for acceptance testing, then the product is ready for a deployment (synthetic or production). If the product is deployed, then it should be measured with deep telemetry (dynamic analysis) such as RUM (Real User Measurement), APM (Application Performance Management) and ASM (Application Security Management). Artifacts such as log files and live telemetry from component systems (Queuing Systems, Ephemeral Caches, RDBMS and Non-Relational Structures) should be captured and used. Why…Because the data is there. Why ignore passive data that can be analyzed, captured and organized in an automated fashion?

I can’t really explain why the data often gets ignored. It simply does because so many development organizations focus on the discrete activities of testing. They often fail to capture the more meaningful data that comes from embedded telemetry into the development process. That same telemetry data that can be captured in the testing process can be captured from live production systems. It’s like a golden egg that gets laid every day. The team has to take advantage of this goldmine of data.

cloud_22

I had the chance to talk with Badri Sridharan from LinkedIn about a year ago. Badri and I both ran Performance Engineering practices in our careers. We were exchanging perspectives on the current and future of Performance Engineering. During the call, Badri shared insight into a system called EKG that the Development and Operations teams introduced at LinkedIn. The blog was written by the Operations team, so it shows a lot of infrastructure data points visually. If you look toward the bottom of the blog, you will see the reference to exception counts and a “variety of other metrics”. Those other metrics as Badri explained are functional verification data points. Teams at LinkedIn can get live production data for their Canary and A/B deployments before they promote code throughout the whole system

EKG compares exception counts, network usage, CPU usage, GC performance, service call fanout, and a variety of other metrics between the canary and the control groups, helping to quickly identify any potential issues in the new code.

I’m still learning what telemetry exists in our systems right now. I’m eager to hear from all of our teams about what data is captured, where it is stored, how it’s made actionable and how the data is brought back into the development process.