Adopting the Lean Startup Methodology with MATLAB to Develop Algorithms for Big Data Analysis
By Chetan Jadhav, New York Life Investment Management
The lean startup methodology was initially developed to help high-technology startups minimize the risk of failure by shortening development cycles and reducing capital expenditure. The goal is to rapidly create a prototype, or minimal viable product (MVP), using as few resources as possible. If the MVP shows promise, it can be turned into a marketable product. If not, the organization can try a different idea without having wasted time or budget.
The same advantages that make lean startup an ideal fit for new, small companies also make it attractive to business units in established corporations. Often, these business units are also resource-constrained; few can afford to engage in a six-month development effort just to test out a new idea. When an idea can be validated in a matter of weeks by a single developer, however, the cost-benefit ratio changes, and many more potentially high-reward projects can be evaluated.
Recently, my colleagues and I at Credit Suisse utilized the lean startup approach with MATLAB® to develop an application for identifying prospective clients from big data—saving $300,000 in software vendor costs. MATLAB is well-suited to employ lean startup. With MATLAB we were able to develop prototype software and iteratively explore new ideas much faster than would have been possible with Java®, C, or another low-level language.
From Waterfall to Lean Startup
Many organizations today still use the traditional waterfall approach for software development. Waterfall involves exhaustively defining requirements and then proceeding methodically through lengthy design, implementation, and verification phases. The process can take months or even years, and if the resulting product fails to meet customer needs, much of that effort is wasted.
In many ways, lean startup is the philosophical antithesis to the waterfall model. Organizations following a lean startup approach rapidly iterate through three short phases: build, measure, and learn (Figure 1). These phases provide many opportunities to course-correct and reduce the risk of failure.
In the build phase, an idea for a new business process or product is distilled to a measurable hypothesis (for example, “It is possible to identify prospective new customers from the personal and professional networks of existing customers.”). The team then builds an MVP, incorporating only the basic functionality needed to test the hypothesis.
In the measure phase, the hypothesis is tested using the results produced by the MVP. In the learn phase, the team makes a pivot-preserve-or-perish decision: If the hypothesis has not been validated and the team sees no viable way forward with the MVP, then the project perishes. If the hypothesis is partially validated, then the team may decide to pivot, making needed changes to the MVP and beginning the loop again. Finally, if the hypothesis is fully validated, then the team can preserve what they have started―for example, by developing a scalable, production version of the MVP.
Lean Startup with MATLAB: Identifying Client Prospects from Big Data
This project was based on the hypothesis that it is easier to acquire new customers from a pool of individuals who have personal and professional relationships with a firm’s existing clients than from a pool of individuals with no such relationship.
To test this hypothesis, our MVP was to be software that could match individuals in internal databases of clients and employees with individuals in external databases obtained from a previous project. These external databases did not provide a unique identifier, such as a social security number, for each individual. This meant that our first challenge was to develop algorithms that matched records in the internal databases with those in the external databases by comparing names and addresses.
A simple string comparison of the names and addresses would yield poor results due to common misspellings, abbreviations, and nicknames (“Will” for “William,” for example). We needed a fuzzy matching algorithm. We found third-party software designed for this purpose, but licensing costs were about $300,000 bi-annually, and we had no guarantee it would work. In the spirit of lean startup, we decided to build our own fuzzy matching algorithm with MATLAB.
The string comparison algorithm was based on the Jaro–Winkler distance measure, which was developed in part by William E. Winkler of the Bureau of the Census to link database records. On the Rochester Institute of Technology website, I located one version of the Jaro-Winkler algorithm implemented in about 120 lines of C code. I re-implemented the algorithm in MATLAB using fewer than 30 lines of code:
function jw = jarowinkler(s1,s2) % jw = jarowinkler(s1,s2) calculates the Jaro Winker string distance between % strings s1 and s2 s1 = lower(s1); s2 = lower(s2); s1l = length(s1); s2l = length(s2); s11 = repmat(s1,s2l,1); s22 = repmat(s2',1,s1l); % Calculate the matching distance mtxwin = floor(max(length(s1),length(s2))/2)-1; mtxmat = double(s11 == s22); % Ignore the matched characters outside of matching distance mtxmat = mtxmat - tril(mtxmat,-mtxwin-1) - triu(mtxmat,mtxwin+1); % Ignore additional matches beyond first match m = zeros(s2l,s1l); for i=1:s1l if any(mtxmat(:,i)) fmatid = find(~any(m,2) & mtxmat(:,i),1); m(fmatid,i) = mtxmat(fmatid,i); end end s1m = any(m); s2m = any(m,2); % Calculate the number of transpositions t = sum(~mtxmat(sub2ind(size(m),find(s2m),find(s1m)')))/2; % Calculate the number of matches m = sum(s1m); % Jaro distance dj = (m/s1l + m/s2l + (m-t)/m)/3; % Winkler modification jw = dj + 0.1*(s1(1:3)==s2(1:3))*cumprod(s1(1:3)==s2(1:3))'*(1-dj);
The string comparison algorithm operated on name and address data stored in several large databases. The internal databases contained approximately 100,000 records, while the external databases held about 200,000. As a result, a full matching of all records required about 20 billion comparisons (100,000 x 200,000). To minimize database access times and eliminate the network overhead of client-server databases, I set up an SQLite database with local data storage. After loading all the available data into SQLite, I accessed it from within MATLAB using Database Toolbox™.
Improving the MVP with Build-Measure-Learn Iterations
I built the first version of the MVP using the Jaro-Winkler algorithm I implemented in MATLAB. The algorithm produced accurate results and effectively joined the internal and external databases, but it was slow, taking more than 16 hours to perform the 20 billion matching operations on a single processing core.
At this point, our MVP showed promise, so we decided to pivot and improve it. In order to optimize our MVP, I created a MATLAB MEX file in which I implemented a C component that performed the computationally expensive part of the Jaro-Winkler calculation.
Although our second MVP was faster than the first, it still had one drawback: it required manual steps to determine a key threshold setting used in the algorithm. Each Jaro-Winkler measurement produces a result between 0 and 1, with 0 representing no similarity between the strings and 1 representing an identical match. In the first two versions of the MVP, we had to manually inspect the results to determine the best value for this threshold. When the threshold was set too high, the system failed to match all records, and when it was too low, it linked records that were not legitimate matches.
The manual inspection was too time-consuming to use on databases with hundreds of thousands of records. To address this issue, I updated the MVP a second time, adding automatic classification based on the typographical error rates in the databases and chance matching rates. Working in MATLAB, I implemented a statistical mixture model (again based on William E. Winkler’s work), which we used to automatically identify thresholds for the string comparison algorithm.
Altogether, the final MVP identified potential clients among the personal and professional networks of our existing clients, representing a potential new market of $1.7 trillion.
$300,000 and Months of Development Effort Saved
As the client prospect identification project demonstrates, the lean startup methodology with MATLAB enabled us to quickly explore new ideas at low cost and minimal risk. Instead of paying a vendor $300,000 for a packaged solution, we built our own solution in MATLAB. Because we were already using MATLAB for other projects, we incurred no additional cost.
I completed most of the development work myself. Total development time for all versions was about eight weeks. If I had used Java or C++ instead of MATLAB, the same project would have taken me six to eight months. In fact, without MATLAB I would not have dared to take this project on―the time commitment and the risk of failure would have been too great.
Published 2017 - 93087v00