cloud computing: 2013

Tuesday, December 31, 2013

Challenges For On-premise Vendors Transitioning To SaaS

As more and more on-premise software vendors begin their journey to become SaaS vendors they are going to face some obvious challenges. Here's my view on what they might be.

The street is mean but you can educate investors

Sharp contrast between Amazon and Apple is quite clear. Even though Amazon has been in business for a long time with soaring revenue in mature categories the street sees it as a high growth company and tolerates near zero margin and surprises that Jeff Bezos brings in every quarter. Bezos has managed to convince the street that Amazon is still in heavy growth mode and hasn't yet arrived. On the other hand despite of Apple's significant revenue growth—in mature as well as in new disruptive categories—investors treat Apple very differently and have crazy revenue and margin expectations.

Similarly, traditional pure SaaS companies such as Salesforce is considered a high growth company where investors are focused on growth and not margins. But, if you're an on-premise vendor transitioning to SaaS the street won't tolerate a hit on your margins. The street would expect mature on-premise companies to deliver on continuous low double digit growth as well as margins without any blips and dips during their transition to SaaS. As on-premise vendors change their product, delivery, and revenue models investors will be hard on them and stock might take a nosedive if investors don't quite understand where the vendors are going with their transition. As much as investors love the annuity model of SaaS they don't like uncertainty and they will punish vendors for lack of their own understanding in the vendor's model. It's a vendor's job to educate investors and continuously communicate with them on their transition.

Isolating on-premise and SaaS businesses is not practical

Hybrid on-premise vendors should (and they do) report on-premise and subscription (SaaS) revenue separately to provide insights to investors into their revenue growth and revenue transition. They also report their data center related cost (to deliver software) as cost of revenue. But, there's no easy way, if at all there's one, to split and report separate SG&A costs for their on-premise and SaaS businesses. In fact combined sales and marketing units are the weapons incumbents on-premise vendors have to successfully transition to SaaS. More on that later in this post.

The basic idea behind achieving economies of scale and to keep the overall cost down (remember margins?) is to share and tightly integrate business functions wherever possible. Even though vendors sometime refer to their SaaS and on-premise businesses as separate lines of businesses (LoBs), in reality they are not. These LoBs are intertwined that report numbers as single P&L.

Not being able to charge more for SaaS is a myth

Many people I have spoken to assume that SaaS is a volume-only business and you can't charge customers what you would typically charge your customers in your traditional license and maintenance revenue business model. This is absolutely not true. If you look at some of the deal sizes and length of SaaS contracts of pure SaaS companies they do charge a premium when they have unique differentiation regardless of volume. Customers are not necessarily against paying premium - for them it is all about bringing down their overall TCO and increasing their ROI with reduced time to value. If a vendor's product and its delivery model allow customers to accomplish these goals they can charge them premium. In fact in most cases this could be the only way out. As a vendor transitioning from on-premise to SaaS their cost is going to go up; they will continue to invest into building new products and transitioning existing products and they will significantly assume the cost of running operations on behalf of their customers to deliver software as a service. They not only will have to grow their top-line to meet the growth expectations but to offset some of the cost to maintain the margins.

Prime advantage on-premise incumbents have over SaaS entrants

So, what does work in favor of on-premise vendors who are going through this transition?

It's the sales and marketing machine, my friends.

The dark truth about selling enterprise software is you need salespeople wearing suits driving around in their BMWs to sell software. There's no way out. If you look at high growth SaaS companies they spend most of what they earn on sales and marketing. Excluding Workday there is not much difference in R&D cost across vendors, on-premise or SaaS. Workday is building out its portfolio and I expect to see this cost go down in a few years.

Over a period of time, many on-premise vendors have built a great brand and achieved amazing market penetration. As these vendors go through SaaS transition they won't have to spend as much time and money educating the market and customers. In fact I would argue they should thank other SaaS vendors for doing the job for them. On-premise vendors have also built an amazing sales machine with deep relationship with customers and reliable sales processes. If they can maintain their SG&A numbers they will have enough room to deal with a possible initial hit on revenue and additional cost they would incur as they go through this transition.

Be in charge of your own destiny and be aggressive

It's going to be a tough transition regardless of your loyal customer base and differentiating products. It will test the execution excellence of on-premise vendors. They are walking on a tight rope and there's not much room to make mistakes. The street is very unforgiving.

Bezos and Benioff have consistently managed to convince the street they are high growth companies and should be treated as such. There's an important lesson here for on-premise vendors. There is no reason to label yourself an on-premise vendor simply making a transition. You could do a lot more than that; invest into new disruptive categories and rethink existing portfolio. Don't just chase SaaS for its subscription pricing but make an honest and explicit attempt to become a true SaaS vendor. The street will take a notice and you might catch a break.

Thursday, November 21, 2013

Rise Of Big Data On Cloud

Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud.

Three years back, I wrote a post, The Future Of The BI In Cloud where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a large scale data warehouse and the second was lack of scale out computing for I/O intensive applications.

A year back Amazon announced RedShift, a data warehouse service in the cloud, and last week they announced high I/O instances for EC2. We have come a long way and more and more I look at the current capabilities and trends, Big Data, at scale, on the cloud, seems much closer to reality.

From a batched data warehouse to interactive analytic applications:

Hadoop was never designed for I/O intensive applications, but Hadoop being a compelling computational scale out platform developers had a strong desire to use it for their data warehousing needs. This made Hive and HiveQL popular analytic frameworks but this was a sub optimal solution that worked well for batch loads and wasn't suitable for responsive and interactive analytic applications. Several vendors realized there's no real reason to stick to the original style of MapReduce. They still stuck to the HDFS but significantly invested into alternatives to Hive which are way faster.

There are series of such projects/products that are being developed on HDFS and MapReduce as a foundation but by adding special data management layers on top of it to run interactive queries much faster compared to plain vanilla Hive. Some of those examples are Impala from Cloudera and Apache Drill from MapR (both based on Dremel), HAWQ from EMC, Stinger from Hortonworks and many other start-ups. Not only vendors but the early adopters such as Facebook created Hive projects such as Presto, an accelerated Hive, which they recently open sourced.

From raw data access frameworks to higher level abstraction tools:

As vendors continue to build more and more Hive alternatives I am also observing vendors investing in higher level abstraction frameworks. Pig was amongst those first higher level frameworks that made it easier to express data analysis programs. But, now, we are witnessing even higher layer rich frameworks such as Cascading and Cascalog not only to write SQL queries but write interactive programs in higher level languages such as Clojure and Java. I'm a big believer in empowering developers with right tools. Working directly against Hadoop has a significant learning curve and developers often end up spending time on plumbing and other things that can be abstracted out in a tool. For web development, popularity of Angular and Bootstrap are examples of how right frameworks and tools can make developers way more efficient not having to deal with raw HTML, CSS, and Javascript controls.

From solid state drives to in-memory data structures:

Solid state drives were the first step in upstream innovation to make I/Os much faster but I am observing this trend go further where vendors are investing into building in-memory resident data management layers on top of HDFS. Shark and Spark are amongst the popular ones. Databricks has made big bets on Spark and recently raised $14M. Shark (and hence Spark) is designed to be compatible with Hive but designed to run queries 100x times faster by using in-memory data structures, columnar representation, and optimizing MapReduce not to write intermediate results back to disk. This looks a lot like MapReduce Online which was a research paper published a few years back. I do see a UC Berkeley connection here.

Photo courtesy: Trey Ratcliff

Thursday, October 31, 2013

How I Accomplished My Personal Goal Of Going To Fewer Meetings

As part of my job I have to go to a lot of meetings. As it turns out, all meetings are not equally important. Many times, either during a meeting or after the meeting, I end up asking myself why the hell did I go to this meeting. Sounds familiar?

A couple of yeas back, instead of just whining about it, I decided to do something about this situation. I set a personal goal to cut down the meetings that I would go to by 20%. Not only I succeeded but I kept the same goal the year after and I accomplished that as well.

This is how I did it:

Ask for prep documents and an upfront agenda

If the meeting that I am invited to does not have an agenda in the meeting request, I ask for it before I commit to it. This approach has two positive effects: 1) it forces an organizer to think what he/she wants to accomplish that invariably results in a productive meeting b) I have an opportunity to opt out if I don't receive an agenda or the agenda doesn't require my presence. I also ask for prep documents for a meeting; I prepare for all my meetings and I firmly believe that meeting time should be judiciously used to discuss what people think about the information and make important decisions as opposed to gathering information that could have been accomplished prior to a meeting.

Opt-out with an alternative ahead of a meeting

If I believe the agenda is partially useful but I won't add any value by being part of the meeting, I connect with an organizer ahead of the meeting to clarify a few things or give my input, either in person or via email or phone. In most cases, me reaching out to an organizer serves the purpose and I don't have to go to the actual meeting. If I do end up having to go for such meetings I ask an organizer for a permission to either walk-in late or leave early. This saves me a lot of time and I don't have to sit through a meeting when I am not required to be there.

Postpone a non-critical meeting

If I see that I am invited to a non-critical meeting, I ask to postpone it by a few days citing my non-availability. In many cases, the issue would have been resolved in a few days and we won't be required to meet. It is important to decline the original meeting request and ask the organizer to create a new meeting request in future even if you have an intent to postpone and not cancel the meeting. Most people don't create a new meeting request and I won't hear back from them.

DVR the meeting

I ask organizers to record certain meetings when I believe that parts of a meeting would be useful at later stage. I fast forward non-interesting parts of such meetings and listen to the parts that I like. I underestimated the effectiveness of listening to a recording until I organized a few meetings as podcasts and listened to them during my commute. Most fascinating part of this approach, other than an ability to fast forward, is being able to listen to a meeting as an information session without having to worry about understanding all details and anxiety to make decisions.

If everything else fails, multitask

I believe it is somewhat rude and distracting to others when people bring their laptops/tablets to a meeting and keep working on it and not pay attention to the meeting. But, this isn’t true when the meeting is an audio conference. I don't work on my laptop or tablet when I am in a meeting room; I am fully committed to the meeting and completely present. However, for certain meetings, when I know that I don't have an option to opt out and it is going to be a waste of time, I dial into the meeting instead of being there in person. I do continue to work on my laptop while participating into the meeting. This is not to confuse with remote meetings that I participate in or lead when all people are not at the same location. I am fully present for those meetings.

Before you ask, yes, I did meticulously measure the time I saved. I had a simple spreadsheet that did a great job. I was a little hesitant in the beginning to push back for meetings but l became more comfortable as I started saving more and more time. I would highly encourage you to follow these rules or create your own and save yourself some quality time that you can use do other useful things.

Photo courtesy: Ho John Lee

Monday, October 21, 2013

Big Data Platform As Technology Continuum

Source: Wikipedia

A Russian chemist, Dimitri Mendeleev, invented the first periodic table of elements. Prior to that, scientists had identified a few elements but the scientific world lacked a consistent framework to organize these elements. Dimitri built upon existing work of these scientists and invented the first periodic table based on a set of design principles. What fascinates me more about his design is that he left a couple of rows empty because he predicted that new elements would be discovered soon. Not only he designed the first periodic table to create a foundation for how elements can be organized but he anticipated what might happen in future and included that consideration in his design.

It is unfortunate that a lot of us are trained to chase a perfect answer as opposed to designing something that is less than perfect, useful, and inspirational to future generations to build on it. We look at technology in a small snapshot and think what it can do for me and others now. We don't think of technology disruption as a continuum to solve a series of problems. Internet started that way and the first set of start-ups failed because they defined the problem too narrowly. The companies that succeeded such as Google, Amazon, eBay etc. saw Internet as a long term trend and didn't think of it in a small snapshot. Cloud and Big Data are the same. Everyday I see problems being narrowly defined as if this is just a fad and companies want to capitalize on it before it disappears.

Build that first element table and give others an imagination to extend it. As an entrepreneur you were not the first and you are not going to be the last trying to solve this problem.

Monday, September 30, 2013

The Dark Side Of Big Data

Latanya Sweeney, a Harvard professor Googled her own name to find out an ad next to her name for a background check hinting that she was arrested. She dug deeper and concluded that so-called black-identifying names were significantly more likely to be the targets for such ads. She documented this in her paper, Discrimination in Online Ad Delivery. It is up to an advertiser how they pick keywords and other criteria to show their ads. Google, like most other companies for which advertising is their primary source of revenue, would never disclose details of algorithms behind their ad offerings. Google denied AdWords being discriminatory in anyway.

Facebook just announced they are planning to give more options to their users to provide feedback regarding which ads are relevant to them and which ads are not. While on surface this might sound like a good idea to get rid of ads that are not relevant and keep marketers as well as users happy, this approach has far more severe consequences than what you might think. In case of the Google AdWords discrimination scenario the algorithm is supposedly blind and has no knowledge of who is searching for what (assuming you're not logged in and there is no cookie effect), but in case of Facebook, the ads are targeted based on you as an individual and what Facebook might know about you. Algorithms are written by human beings and knowingly or unknowingly they could certainly introduce subtle or blatant discrimination. As marketers and companies that serve ads on behalf of marketers know more about you as as an individual, and your social and professional network, they are a step closer to discriminate their users, knowingly or unknowingly. There's a fine line between stereotyping and what marketers call "segmentation."

AirBnB crunched their data and concluded that older hosts tend to be more hospitable and younger guests tend to be more generous with their reviews. If this is just for informational purposes it's interesting. However what if AirBnB uses this information to knowingly or unknowingly discriminate young hosts and old guests?

A combination of massively parallel computing and sophisticated algorithms to leverage this parallelism as well as ability of algorithms to learn and adapt to be more relevant, almost in real-time, are going to cause a lot more of such issues to surface. As a customer you simply don't know whether the products or services that you are offered or not at a certain price is based on any discriminatory practices. To complicate this further, in many cases, even companies don't know whether insights they derive from a vast amount of internal as well as external data are discriminatory or not. This is the dark side of Big Data.

The challenge with Big Data is not Big Data itself but what companies could do with your data combined with any other data without your explicit understanding of how algorithms work. To prevent discriminatory practices, we see employment practices being audited to ensure equal opportunity and admissions to colleges audited to ensure fair admission process, but I don't see how anyone is going to audit these algorithms and data practices.

I have no intention to paint a gloomy picture and blame technology. Disruptive technology always surfaces socioeconomic issues that either didn't exist before or were not obvious and imminent. Some people get worked up because they don't quite understand how technology works. I still remember politicians trying to blame GMail for "reading" emails to show ads. I believe that Big Data is yet another such disruption that is going to cause similar issues. We should not shy away from these issues but should collaboratively work hard to highlight and amplify what these issues might be and address them as opposed to blame technology to be evil.

Photo Courtesy: Jonathan Kos-Read

Saturday, August 31, 2013

Purple Squirrels

It is fashionable to talk about talent shortage in the silicon valley. People whine about how hard it is to find and hire the "right" candidates. What no one wants to talk about is how the hiring process is completely broken.

I need to fill headcount: This is a line that you hear a lot at large companies. Managers want to hire just because they are entitled to hire with a "hire or lose headcount" clause. Managers spend more time worrying about losing headcount and less time finding the right people the right way.

Chasing a mythical candidate: Managers like to chase purple squirrels. They have outrageous expectations and are far removed from reality of talent market. Managers are also unclear on exactly what kind of people they are looking to hire.

Bizarre interview practices: "How many golf balls can fit in a school bus?" or "can you write code with right hand while drawing a tree with left hand?" We all have our favorite bizarre interview stories. But, even if not bizarre, by and large, interview practices have been quite unscientific, inconsistent, and highly subjective. Most companies don't have a good way to objectively conduct interviews and identify the right candidates to hire. This sounds silly but unfortunately it's true.

If we are really serious about talent we should focus on our ability to attract, acquire, and retain talent as opposed to whining about it.

Always be sourcing

Cultivate hiring culture; always keep looking for people in your network even if you have no immediate plans to hire. In many cases, the best hires are the ones that are not actively looking for a job. The references from your best current employees are the right ones to get started with. Go to conferences and talk about your company and projects. Use this as a learning opportunity to calibrate your understanding of the market and seek out an outsider's perspective on what might be the right hiring strategy for your organization. You are constantly making an effort to attract talent. Treat this as an ongoing task as opposed to one time hiring activity.

Pulse has redesigned their technical hiring process by introducing a "try before buy" model where the prospects can get to actually work with Pulse's team on a real project as part of an interview process. Hiring someone is a critical decision and this approach is a win-win situation. This is also the reason why interns make good hires as both sides get enough time to check each other out.

Treat interviewing as an important skill

Most employees are trained to do their work but they have a little or no training in interviewing other people. I find it astounding that we hire social scientists, ethnographers, and user researchers to meticulously and scientifically interview users to better understand their behavior and eventually design a product that meets or exceeds their expectations. But, we don't spend anytime training our own employees to better understand the prospects and hire the ones that would actually design these products.

"Years ago, we did a study to determine whether anyone at Google is particularly good at hiring. We looked at tens of thousands of interviews, and everyone who had done the interviews and what they scored the candidate, and how that person ultimately performed in their job. We found zero relationship."

I have seen interviewers either rejecting interviewees in the first few minutes of an interview solely relying on their hunch and intuition or mistaking interviewee's confidence as his or her competence without any kind of objectivity. I find it strange that the technical as well as the business folks who believe in science, have been trained to trust empirical evidence, and possess great analytical skills fall for subjective interpretations based on their pre-conceived biases. Interviewing objectively is hard because it is boring to follow an objective approach leaving your subjective smartness aside. Very boring and very hard.

Look for behaviors and not just skills

Have your interviews designed to measure past behaviors and not skills alone. Skills are easy to learn, but behaviors are hard if not impossible to change. Start with your most successful employees and identify what behaviors they exhibit and how these behaviors have made them successful and valuable to your company. I cringe when I hear words "chemistry" and "cultural fit." These are actually behaviors that people find it hard to describe and evaluate. There's a way to break down this chemistry and cultural fit into measurable behaviors that you could look for during an interview. Don't judge people based on what they can do during an interview because it does not represent a real working life scenario. Asking people to solve a puzzle or draw something on a whiteboard during an interview doesn't prove much. Infamous for ridiculous interview practices Google has confessed them to be complete waste of time.

"On the hiring side, we found that brainteasers are a complete waste of time. How many golf balls can you fit into an airplane? How many gas stations in Manhattan? A complete waste of time. They don’t predict anything. They serve primarily to make the interviewer feel smart." -- Laszlo Block, senior vice president of people operations at Google.

Unless you have designed a consistent interviewing process that focuses on asking questions to objectively assess candidates based on the behaviors they have exhibited in their previous jobs you will become a victim of your own biases and subjective interpretations.

Retaining talent is as important as attracting and acquiring talent. A separate blog post on that topic some other time.

Photo courtesy: Harvard Business Review

Wednesday, July 31, 2013

Chasing That Killer Application Of Big Data

I often get asked, "what is the killer application of Big Data?" Unfortunately, the answer is not that simple.

In the early days of enterprise software, it was the automation that fueled the growth of enterprise applications. The vendors that eventually managed to stay in business and got bigger were/are the ones that expanded their footprint to automate more business processes in more industries. The idea behind the killerness of some of these applications was merely the existence and some what maturity of business processes in alternate forms. The organizations did have financials and supply chain but those processes were paper-based or part-realized in a set of tools that didn't scale. The objective was to replace these homegrown non-scalable processes and tools and provide standardized package software that would automate the processes after customizing it to the needs of an organization. Some vendors did work hard to understand what problems they were set out to solve, but most didn't; they poured concrete into existing processes.

Traditional Business Intelligence (BI) market grew the same way; the customers were looking for a specific set of reporting problems to be solved to run their business. The enterprise applications that automated the business processes were not powerful enough to deliver the kind of reporting that organizations expected to gain insights into their operations and make decisions. These applications were designed to automate the processes and not to provide insights. The BI vendors created packaged tools and technology solutions to address this market. Once again, the vendors didn't have to think about what application problems the organizations were trying to solve.

Now with the rise of Big Data, the same vendors, and some new vendors, are asking that same question: what's the killer application? If Big Data turns out to be as big of a wave as the Internet or cloud we are certainly in a very early stage. This wave is very different than the previous ones in a few ways; it is technology-led innovation which is opening up new ways of running business. We are at an inflection point of cheap commodity hardware and MPP software that is designed from ground up to treat data as the first class citizen. This is not about automation or filling a known gap. I live this life working with IT and business leaders of small and large organizations worldwide where they are struggling to figure out how best they can leverage Big Data. These organizations know there's something in for them in this trend but they can't quite put a finger on it.

As a vendor, the best way to look at your strategy is to help customers with their Big Data efforts without chasing a killer application. The killer applications will be emergent when you pay attention and observe patterns across your customers. Make Big Data tangible for your customers and design tools that would take your customers away from complexity of a technology layer. The organizations continue to have massive challenges with semantics as well as the location and format of their data sources. This is not an exciting domain for many vendors but help these organizations bring their data together. And, most importantly, try hard to become a trusted advisor and a go-to vendor for Big Data regardless of your portfolio of products and solutions. Waiting for a killer application to get started or marketing your product as THE killer application of Big Data are perhaps not the smartest things to do right now.

Big Data is a nascent category; an explosive, promising, but a nascent category. The organizations are still trying to get a handle on what it means to them. The maturity of business processes and well-defined unsolved problems in this domain are not that clear. While this category plays out on its own don't chase those killer applications or place your bets on one or two killer applications. Just get started and help your customers. I promise you shall stumble upon that killer application during your journey.

About the picture: I took this picture inside a historic fort in Jaisalmer, India that has rich history. History has taught me a lot about all things enterprise software as well as non-enterprise-software.

Sunday, June 30, 2013

Celebrating Failures

Being a passionate design thinker I am a big believer in failing fast and failing often. I have taken this one step further; I celebrate one failure every week. Here's why:

You get more comfortable looking for failures, analyzing them, and learn from it

I have sat through numerous post-mortem workshops and concluded that the root causes of failures are usually the same: abstract concepts such as lack of communication, unrealistic scope, insufficient training, and so on. If that’s true, why do we repeat the same mistakes, causing failure to remain a common situation? Primarily because many people find it hard to imagine and react to abstractions, but can relate much better when these concepts are contextualized into their own situation. Post-mortem of a project would tell you what you already suspected; it's hindsight and it's a little too late. I have always advocated a "pre-mortem workshop" to prepare for a failure in the beginning. Visualize all the things that could go wrong by imagining that the project has failed. This gives the team an opportunity to proactively look at risks and prepare to prevent and mitigate them.

Failures just like successes become nothing more than events with different outcomes

A failure or a success is nothing but an event. Just like sports you put in your best effort and still fail because you control your efforts, dedication, and passion but not the outcome. While it is absolutely essential to analyze mistakes and make sure you don't repeat them but in some cases, looking back, you would not have done anything differently. When you look at more failures more often they do tend to become events with different outcomes as opposed to one-off situations that you regret.

It changes your attitude to take more risk because you are not afraid of outcome

When failures are not a one-off event and you are anticipating and celebrating it more often it changes how you think about many things, personally as well as professionally. It helps you minimize regret and not failures.

I don't want to imply failure is actually a good thing. No one really wants to fail and yet failure is the only certainty. But, it's all about failing fast, failing often, and correct the course before it's too late. Each failure presents us with an opportunity to learn from it. Don't waste a failure; celebrate it.

About the picture: I took this picture inside the Notre Dame in Paris. I see lights as medium to celebrate everything: victory of good over evil as celebrated during the Hindu festival Diwali and a candlelight vigil to show support and motivate people for a change.

Thursday, June 13, 2013

Hacking Into The Indian Education System Reveals Score Tampering

Debarghya Das has a fascinating story on how he managed to bypass a silly web security layer to get access to the results of 150,000 ISCE (10th grade) and 65,000 ISC (12th grade) students in India. While lack of security and total ignorance to safeguard sensitive information is an interesting topic what is more fascinating about this episode is the analysis of the results that unearthed score tampering. The school boards changed the scores of the students to give them "grace" points to bump them up to the passing level. The boards also seem to have tampered some other scores but the motive for that tampering remains unclear (at least to me).

I would encourage you to read the entire analysis and the comments, but a tl;dr version is:

32, 33 and 34 were visibly absent. This chain of 3 consecutive numbers is the longest chain of absent numbers. Coincidentally, 35 happens to be the pass mark.
Here's a complete list of unattained marks -
36, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 56, 57, 59, 61, 63, 65, 67, 68, 70, 71, 73, 75, 77, 79, 81, 82, 84, 85, 87, 89, 91, 93. Yes, that's 33 numbers!

The comments are even more fascinating where people are pointing out flaws with his approach and challenging the CLT (central limit theorem) with a rebuttal. If there has been no tampering with the score it would defy the CLT with a probability that is so high that I can't even compute. In other words, the chances are almost zero, if not zero, of this guy being wrong about his inferences and conclusions.

He is using fairly simple statistical techniques and MapReduce style computing to analyze a fairly decent size data set to infer and prove a specific hypothesis (most people including me believed that grace points existed but we had no evidence to prove it). He even created a public GitHub repository of his work which he later made it private.

I am not a lawyer and I don't know what he did is legal or not but I do admire his courage to not post this anonymously as many people in the comments have suggested. Hope he doesn't get into any trouble.

Spending a little more time trying to comprehend this situation I have two thoughts:

The first shocking but unfortunately not surprising observation is: how careless the school boards are in their approach in making such sensitive information available on their website without basic security. It is not like it is hard to find web developers in India who understand basic or even advanced security; it's simply laziness and carelessness on the school board side not to just bother with this. I am hoping that all government as well as non-government institutes will learn from this breach and tighten up their access and data security.

The second revelation was - it's not a terribly bad idea to publicly distribute the very same as well as similar datasets after removing PII (personally identifiable information) from it to let people legitimately go crazy at it. If this dataset is publicly available people will analyze it, find patterns, and challenge the fundamental education practices. Open source has been a living proof of making software more secured by opening it up to public to hack it and find flaws in it so that they can be fixed. Knowing the Indian bureaucracy I don't see them going in this direction. Turns out I have seen this movie before. I have been an advocate of making electronic voting machines available to researchers to examine the validity of a fair election process. Instead of allowing the security researchers to have access to an electronic voting machine Indian officials accused a researcher of stealing a voting machine and arrested him. However, if India is serious about competing globally in education this might very well be the first step to bring in transparency.

Friday, May 31, 2013

Unsupervised Machine Learning, Most Promising Ingredient Of Big Data

Orange (France Telecom), one of the largest mobile operators in the world, issued a challenge "Data for Development" by releasing a dataset of their subscribers in Ivory Coast. The dataset contained 2.5 billion records, calls and text messages exchanged between 5 million anonymous users in Ivory Coast, Africa. Various researchers got access to this dataset and submitted their proposals on how this data can be used for development purposes in Ivory Coast. It would be an understatement to say these proposals and projects were mind-blowing. I have never seen so many different ways of looking at the same data to accomplish so many different things. Here's a book [very large pdf] that contains all the proposals. My personal favorite is AllAborad where IBM researchers used the cell-phone data to redraw optimal bus routes. The researchers have used several algorithms including supervised and unsupervised machine learning to analyze the dataset resulting in a variety of scenarios.

In my conversations and work with the CIOs and LOB executives the breakthrough scenarios always come from a problem that they didn't even know existed or could be solved. For example, the point-of-sale data that you use for your out-of-stock analysis could give you new hyper segments using clustering algorithms such as k-means that you didn't even know existed and also could help you build a recommendation system using collaborative filtering. The data that you use to manage your fleet could help you identify outliers or unproductive routes using SOM (self organizing maps) with dimensionality reduction. Smart meter data that you use for billing could help you identify outliers and prevent thefts using a variety of ART (Adoptive Resonance Theory) algorithms. I see endless scenarios based on a variety of unsupervised machine learning algorithms similar to using cell phone data to redraw optimal bus routes.

Supervised and semi-supervised machine learning algorithms are also equally useful and I see them complement unsupervised machine learning in many cases. For example, in retail, you could start with a k-means to unearth new shopping behavior and end up with Bayesian regression followed by exponential smoothing to predict future behavior based on targeted campaigns to further monetize this newly discovered shopping behavior. However, unsupervised machine learning algorithms are by far the best that I have seen—to unearth breakthrough scenarios—due to its very nature of not requiring you to know a lot of details upfront regarding the data (labels) to be analyzed. In most cases you don't even know what questions you could ask.

Traditionally, BI has been built on pillars of highly structured data that has well-understood semantics. This legacy has made most enterprise people operate on a narrow mindset, which is: I know the exact problem that I want to solve and I know the exact question that I want to ask, and, Big Data is going to make all this possible and even faster. This is the biggest challenge that I see in embracing and realizing the full potential of Big Data. With Big Data there's an opportunity to ask a question that you never thought or imagined you could ask. Unsupervised machine learning is the most promising ingredient of Big Data.

Wednesday, May 22, 2013

Lead, Follow, Or Get Out Of The Way

If you have been following this blog you would know that I mainly blog about enterprise software, cloud, and big data with a few occasional posts on design and design thinking. That's what I am most passionate about. Having spent my entire career building enterprise software I have realized that success and competitive differentiation in market place boil down to an organization's unique ability to get three things right where management plays a key role: 1) people who can continuously learn and adapt to change 2) processes that are nimble and evolve as the company evolves 3) products that solve a real problem and delight the end users. While I continue to blog about enterprise software I have decided to evolve this blog further by adding a few management posts going forward.

There are a series of management topics that I am interested in but let's start with the basic one which is about my core management philosophy. My management philosophy is "lead, follow, or get out of the way." In any situation I ask myself whether I should be leading in this situation or following someone's lead and extend my full support to do so. If neither make sense I simply get out of the way and let people do their job. Building, selling, and supporting software, like many other things, require a loosely-connected (to put it in software terms) organization where there are leaders who lead and follow other leaders at the same time. This gets more and more complicated as the size and portfolio of an organization grow over years. People draw artificial boundaries and lose sight of the mission and the big picture.

Leading is hard, following is harder, and getting out of the way is the hardest which requires a conscious attempt to empower people to do their job without getting into their way. But, it is an approach that does work and I encourage you to try it out and share it with others.

Photo courtesy: Pison Jaujip

Tuesday, April 30, 2013

Justifying Big Data Investment

Traditionally companies invest into software that has been proven to meet their needs and has a clear ROI. This model falls apart when disruptive technology such as Big Data comes around. Most CIOs have started to hear about Big Data and based on their position on the spectrum of conservative to progressive they have either started to think about investing or have already started investing. The challenge these CIOs face is not so much whether they should invest into Big Data or not but what they should do with it. Large companies have complex landscapes that serve multiple LOBs and all these LOBs have their own ideas about what they want to get out of Big Data. Most of these LOB executives are even more excited about the potential of Big Data but are less informed about the upstream technical impact and the change of mindset that IT will have to go through to embrace it. But, these LOBs do have a stronger lever - money to spend if they see that technology can help them accomplish something that they could not accomplish before.

As more and more IT executives get excited over the potential of Big Data they are underestimating the challenges to get access to meaningful data in single repository. Data movement has been one of the most painful problems of a traditional BI system and it continues to stay that way for Big Data systems. A vast majority of companies have most of their data locked into their on-premise systems. Not only it is inconvenient but it's actually impractical to move this data to the cloud for the purposes of analyzing it if Big Data platform happens to be a cloud platform. These companies also have a hybrid landscape where a subset of data resides in the cloud inside some of the cloud solutions that they use. It's even harder to get data out from these systems to move it to either a cloud-based or an on-premise Big Data platform. Most SaaS solutions are designed to support ad hoc point-to-point or hub and spoke REST-ful integration but they are not designed to efficiently dump data for external consumption.

Integrating semantics is yet another challenge. As organizations start to combine several data sources the quality as well as the semantics of data still remain big challenges. Managing semantics for single source in itself isn't easy. When you add multiple similar or dissimilar sources to the mix this challenge is further amplified. It has been the job of an application layer to make sense out of underlying data but when that layer goes away the underlying semantics become more challenging.

If you're a vendor you should work hard thinking about business value of your Big Data technology - not what it is to you but what it could do for your customers. The spending pie for customers hasn't changed and coming up with money to spend on (yet another) technology is quite a challenge. My humble opinion on this situation is that vendors have to go beyond technology talk and start understanding the impact of Big Data, understand the magnitude of these challenges, and then educate customers on the potential and especially help them with a business case. I would disagree with people who think that Big Data is a technology play/sale. It is not.

Photo Courtesy: Kurtis Garbutt

Sunday, March 31, 2013

Thrive For Precision Not Accuracy

Jake Porway who was a data scientist at the New York Times R&D labs has a great perspective on why multi-disciplinary teams are important to avoid bias and bring in different perspective in data analysis. He discusses a story where data gathered by Über in Oakland suggested that prostitution arrests increased in Oakland on Wednesdays but increased arrests necessarily didn't imply increased crime. He also outlines the data analysis done by Grameen Foundation where the analysis of Ugandan farm workers could result into the farmers being "good" or "bad" depending on which perspective you would consider. This story validates one more attribute of my point of view regarding data scientists - data scientists should be design thinkers. Working in a multi-disciplinary team to let people champion their perspective is one of the core tenants of design thinking.

One of the viewpoints of Jake that I don't agree with:

"Any data scientist worth their salary will tell you that you should start with a question, NOT the data."

In many cases you don't even know what question to ask. Sometimes an anomaly or a pattern in data tells a story. This story informs us what questions we might ask. I do see that many data scientists start with knowing a question ahead of time and then pull in necessary data they need but I advocate the other side where you bring in the sources and let the data tell you a story. Referring to design, Henry Ford once said, ""Every object tells a story if you know how to read it." Listen to the data—a story—without any pre-conceived bias and see where it leads you.

You can only ask what you know to ask. It limits your ability to unearth groundbreaking insights. Chasing a perfect answer to a perfect question is a trap that many data scientists fall into. In reality what business wants is to get to a good enough answer to a question or insight that is actionable. In most cases getting to an answer that is 95% accurate requires little effort but getting that rest 5% requires exponentially disproportionate time with disproportionately low return.

Thrive for precision, not accuracy. The first answer could really be of low precision. It's perfectly acceptable as long as you know what the precision is and you can continuously refine it to make it good enough. Being able to rapidly iterate and reframe the question is far more important than knowing upfront what question to ask; data analysis is a journey and not a step in the process.

Photo credit: Mario Klingemann

Friday, March 15, 2013

We Got Hacked, Now What?

Hopefully you really have a good answer for this. Getting hacked is no longer a distant probability; it's a harsh reality. The most recent incident was Evernote losing customer information including email addresses and passwords to a hacker. I'm an Evernote customer and I watched the drama unfold from the perspective of an end user. I have no visibility into what level of security response planning Evernote had in place but this is what I would encourage all the critical services to have:

Prevent

You are as secured as your weakest link; do anything and everything that you can to prevent such incidents. This includes hardening your systems, educating employees on social engineering, and enforce security policies. Broadly speaking there are two kinds of incidents - hijacking of a specific account(s) and getting unauthorizd access to a large set of data. Both of these could be devastating and they both need to prevented differently. In the case of Evernote they did turn on two-factor authentication but it doesn't solve the problem of data being stolen from their systems. Google has done an outstanding job hardening their security to prevent account hijacking. Explore shared-secret options where partial data loss doesn't lead to compromised accounts.

Mitigate

If you do get hacked, is your system instrumented to respond to such an incident? It includes locking acconts down, taking critical systems offline, assess the extent of damage etc. In the case of Evernote I found out about the breach from Twitter long before Evernote sent me an email asking to change the password. This approach has a major flaw: if someone already had my password (hard to decrypt a salted and hashed value but still) they could have logged in and changed the password and would have had full access to my account. And, this move—logging in and changing the password—wouldn't have raised any alarms on the Evernote side since that's exactly what they would expect users to do. A pretty weak approach. A slightly better way would have been to ask users to reset the password and then follow up with an email verification process before users could access the account.

Manage

If the accounts did get hacked and the hackers did get control over certain accounts and got access to certain sensitive information what would you do? Turns out the companies don't have a good answer or any answer for this. They just wish such things won't happen to them. But, that's no longer true. There have been horror stories on people losing access to their Google accounts. Such accounts are further used for malicious activities such as sending out emails to all contacts asking to wire you money due to you being robbed in . Do you have a multi-disciplinary SWAT team—tech, support, and communication—identified when you end up in such a situation? And, lastly, have you tested your security response? Impact of many catastrophes, natural or otherwise, such as flood earthquakes, and terrorist attacks can be reduced if people were prepared to anticipate and respond. Getting hacked is no different.

Photo courtesy: Daniele Margaroli

Thursday, February 28, 2013

A Data Scientist's View On Skills, Tools, And Attitude

I recently came across this interview (thanks Dharini for the link!) with Nick Chamandy, a statistician a.k.a a data scientist at Google. I would encourage you to read it; it does have some great points. I found the following snippets interesting:

Recruiting data scientists:

When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don’t want to miss out on a great candidate because he or she comes from a non-statistics background and doesn’t search for the right keyword. On my team alone, we have had successful “statisticians” with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.

I share the same view. The best scientists I have met are not statisticians by academic training. They are domain experts and design thinkers and they all share one common trait: they love data! When asked how they might build a team of data scientists I highly recommend people to look beyond traditional wisdom. You will be in good shape as long as you don't end up in a situation like this :-)

Skills:

The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into “pipelines.”

Most companies won't have the kind of highly skilled development army that Google has but then not all companies would have Google scale problem to deal with. Though I suggest two things: a) build a very strong community of data scientists using social tools so that they can collaborate on challenges and tools they use b) make sure that the chief data scientist (if you have one) has very high level of management buy-in to make things happen otherwise he/she would be spending all the time in "alignment" meetings as opposed to doing the real work.

Data preparation:

There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data.

I disagree. I do strongly believe the tools need to involve to do some of these things and the data scientists should not be spending their time to compensate for the inefficiencies of the tools. Becoming intimate with the data—have empathy for the problem—is certainly a necessity but spending time on pulling, fixing, and aggregating data is not the best use of their time.

Attitude:

To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one’s attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics.

As I would say bring tools and knowledge but leave bias and expectations aside. The best data scientists are the ones who are passionate about data, can quickly learn a new domain, and are willing to make and fail and fail and make.

Image courtesy: xkcd

Friday, February 15, 2013

Commoditizing Data Science

My ongoing conversations with several people continue to reaffirm my belief that Data Science is still perceived to be a sacred discipline and data scientists are perceived to be highly skilled statisticians who walk around wearing white lab coats. The best data scientists are not the ones who know the most about data but they are the ones who are flexible enough to take on any domain with their curiosity to unearth insights. Apparently this is not well-understood. There are two parts to data science: domain and algorithms or in other words knowledge about the problem and knowledge about how to solve it.

One of the main aspects of Big Data that I get excited about is an opportunity to commoditize this data science—the how—by making it mainstream.

The rise of interest in Big Data platform—disruptive technology and desire to do something interesting about data—opens up opportunities to write some of these known algorithms that are easy to execute without any performance penalty. Run K Means if you want and if you don't like the result run Bayesian linear regression or something else. The access to algorithms should not be limited to the "scientists," instead any one who wants to look at their data to know the unknown should be able to execute those algorithms without any sophisticated training, experience, and skills. You don't have to be a statistician to find a standard deviation of a data set. Do you really have to be a statistician to run a classification algorithm?

Data science should not be a sacred discipline and data scientists shouldn't be voodoos.

There should not be any performance penalty or an upfront hesitation to decide what to do with data. People should be able to iterate as fast as possible to get to the result that they want without worrying about how to set up a "data experiment." Data scientists should be design thinkers.

So, what about traditional data scientists? What will they do?

I expect people that are "scientists" in a traditional sense would elevate themselves in their Maslow's hierarchy by focusing more on advanced aspects of data science and machine learning such as designing tools that would recommend algorithms that might fit the data (we have already witnessed this trend for visualization). There's also significant potential to invent new algorithms based on existing machine learning algorithms that have been into existence for a while. What algorithms to execute when could still be a science to some extent but that's what the data scientists should focus on and not on sampling, preparing, and waiting for hours to analyze their data sets. We finally have Big Data for that.

Image courtesy: scikit-learn

Thursday, January 31, 2013

Empathize Not Sympathize

Many enterprise software vendors sympathize. "We know it's a bad experience" or "We will fix the usability." One of the reasons the software is not usable is because the makers never had any empathy for the end users who would use it. In many cases the makers didn't even know who their end users were; they only knew who would buy the software. As far as enterprise software is concerned people who write checks don't use the software and people who use software don't write checks and have a little or no influence in what gets bought. Though the dynamics are now changing.

Usability is the last step; it's about making software usable for the tasks that it is designed for. It's not useful at all when the software is designed to solve a wrong problem. Perfectly usable software could be completely useless.

It's the job of a product manager, designer, and a developer to assess the end user needs—have empathy for them—and then design software that meets or exceeds their needs in a way that is usable. That way they don't have to sympathize later on.

Design Thinking encourages people to stay in the problem space for a longer duration without jumping to a solution. What problem is being solved—needs—is far more important than how it is solved—usability. Next time you hear someone say software is not usable, ask whether it's the what or how. The how part is relatively easy to fix, what part is not. For fixing the "what" you need to have empathy for your end users and not sympathy.

Wednesday, January 16, 2013

A Journey From SQL to NoSQL to NewSQL

Two years back I wrote that the primary challenge with NoSQL is that it's not SQL. SQL has played a huge rule in making relational databases popular for the last forty years or so. Whenever the developers wanted to design an(y) application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time, the RDBMS grew in functions and features such as binary storage, faster access, clusters, sophisticated access control etc. and the applications reaped these benefits. The traditional RDBMS became a non-fit for cloud-scale applications that fundamentally required scale at whole different level. Traditional RDBMS could not support this scale and even if they could it became prohibitively expensive for the developers to use it. Traditional RDBMS also became too restrictive due to their strict upfront schema requirements that are not suitable for modern large scale consumer web and mobile applications. Due to these two primary reasons and a lot more other reasons we saw the rise of NoSQL. The cloud movement further fueled this growth and we started to see a variety of NoSQL offerings.

Each NoSQL store is unique in which how a programmer would access it. NoSQL did solve the scalability and flexibility problems of a traditional database, but introduced a set of new problems, primary ones being lack of ubiquitous access and consistency options, especially for OLTP workload, for schema-less data stores.

This has now led to the movement of NewSQL (a term initially coined by Mat Aslett in 2011) whose working definition is: "NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional single-node database system." NewSQL's focus appears to be on gaining performance and scalability for OLTP workload by supporting SQL as well as custom programming models and eliminating cumbersome error-prone management tasks such as manual sharding without breaking the bank. It's a good first step in the direction of a scalable distributed database that supports SQL. It doesn't say anything about mixed OLTP and OLAP workload which is one of the biggest challenges for the organizations who want to embrace Big Data.

From SQL to NoSQL to NewSQL, one thing that is common: SQL.

Let's not underestimate the power of a simple non-procedural language such as SQL. I believe the programmers should focus on what (non-procedural such as SQL) and not how. Exposing "how" invariably ends up making the system harder to learn and harder to use. Hadoop is a great example of this phenomenon. Even though Hadoop has seen widespread adoption it's still limited to silos in organizations. You won't find a large number of applications that are exclusively written for Hadoop. The developers first have to learn how to structure and organize data that makes sense for Hadoop and then write an extensive procedural logic to operate on that dataset. Hive is an effort to simplify a lot of these steps but it still hasn't gained desired populairty. The lesson here for the NewSQL vendors is: don't expose the internals to the applications developers. Let a few developers that are closer to the database deal with storing and configuring the data but provide easy ubiquitous access to the application developers. The enterprise software is all about SQL. Embracing, extending, and augmenting SQL is a smart thing to do. I expect all the vendors to converge somewhere. This is how RDBMS and SQL grew. The initial RDBMS were far from being perfect but SQL always worked and the RDBMS eventually got better.

Distributed databases is just one part of the bigger puzzle. Enterprise software is more about mixing OLAP and OLTP workload. This is the biggest challenge. SQL skills and tools are highly prevalent in this ecosystem and more importantly people have SQL mindset that is much harder to change. The challenge to vendors is to keep this abstraction intact and extend it without exposing the underlying architectural decisions to the end users.

The challenge that I threw out a couple of years back was:

"Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications."

We are not there, yet, but I do see signs of convergence. As a Big Data enthusiast I love this energy. Curt Monash has started his year blogging about NewSQL. I have blogged about a couple of NewSQL vendors, NimbusDB (NuoDB) and GenieDB, in the past and I have also discussed the challenges with the OLAP workload in the cloud due to its I/O intensive nature. I am hoping that NewSQL will be inclusive of OLAP and keep SQL their first priority. The industry is finally on to something and some of these start-ups are set out to disrupt in a big way.

Photo Courtesy: Liz