Search the KHIT Blog

Friday, November 1, 2013

(404)^n, the upshot of dirty data


The entire point of searching, locating, linking, retrieving, merging, reordering, indexing, and analyzing data originating in various data repositories (digital or otherwise) is to reduce uncertainty in order to make accurate, value-adding decisions. To the extent that data are "dirty" (riddled with errors), this objective is thwarted. Worse, the resulting datasets borne of such problematic inquiry then themselves frequently become source data for subsequent query, iteratively, recursively. Should you be on the receiving end of bad data manipulation, the consequences can range from the irritatingly trivial to the catastrophic. We all have our hair-pulling stories regarding the mistakes bequeathed us by those who sloppily muck about in our information and misinformation. I certainly have mine.
  • We recently sold our house in Las Vegas, clearing a nice six-figure net sum (we'd luckily bought ahead of the Vegas real estate Bubble in 2003, on a fixed 15-year note). We opted to have the funds wire-transferred to our bank account, and were told by the title company it would post within 24 hours. It most certainly did not. It took five days and repeated emails and phone calls to clear things up. The title company initially blew us off, claiming that they'd transferred the funds and gotten a confirmation number. Finally they had to admit that the transfer had been "kicked back" by the bank owing to a mismatch wherein the second "d" of my last name "Gladd" had been truncated to "Glad" by the system (yeah, right) and refused.

    It was a clerical transcription error at the title company
    (the most common kind). They would not admit to that, but it had to be the case. We finally got our proceeds posted, after an anxious five days wondering whether our money would just disappear into the cyber-ether.

  • In early 2002 a Las Vegas Constable's deputy knocked on my door and served me with a divorce proceeding subpoena, one initiated by a Bronx lawyer whose client was a woman I'd never even met. They'd heard that the estranged husband had migrated to the southwest U.S., and, given that we had the nominal "exact" same name, "Robert E. Gladd," I had to be The Guy. My middle name is "Eugene," whereas this soon-to-be ex hubby's is "Edwin" or something (it's not clear). No matter; Counselor had his target and had set a court date. One wherein I might be found liable for court costs, fees, alimony, and/or division of community property.

    This lawyer came to know Bad Bobby. It cost me a lot of time and expense. He never admitted to having done anything wrong (but dropped the scheduled hearing). My letter to The Aggrieved Missus:
Dear Nxxxxx,

Your attorney has made a reckless mistake in identifying me as your estranged spouse—from whom you have filed for divorce—and then having me served with a Summons by the Bronx New Court Supreme Court (case # 3309/02) falsely naming me as the Defendant in your divorce petition. Enclosed is a copy of the letter I have sent via Certified USPS mail to both your attorney (Frank J. Giordano) and the Clerk of the Bronx Supreme Court.

Given that he is probably charging you for all of the expenses associated with this erroneous filing, he is wasting both your time and money. I’d be insisting that he incur any such expenses and move on to get it right.

I certainly wish you well and hope you can have your divorce granted as soon as possible in order to put the episode behind you, but if the process entails certifiably notifying the spouse/Defendant prior to the granting of your Decree, Sherlock J. Holmes has some additional investigatory work to complete, for I will not stand in as the surrogate.

I regret having to bother you with this. Be well.
  • My wife and my late daughter ("Sissy," her stepdaughter) have the same first name, "Cheryl." When we applied for a car loan, one of Sissy's delinquent medical lab bills originating in Atlanta (where we'd never lived nor had any medical encounters) showed up in my wife's credit bureau file. A puny $29 in arrears that had gone to collections. After a good bit of effort, we got it corrected. When we subsequently moved to Las Vegas, it popped up again, throwing sand in our mortgage application gears. "Cheryl Gladd." Good enough for these incompetent, indifferent people, notwithstanding that my wife has never taken nor used my last name -- and notwithstanding that the Social Security Number in the file was clearly not my wife's.

  • Trivially, last week I got my new auto insurance card from Metlife. They misspelled the street name of my new address.
Yeah, we all have our stories, don't we?

Bad data and HealthCare.gov:
Health insurers getting bad data from healthcare.gov
Summary: Insurance companies tell the Wall Street Journal that they are receiving erroneous application data from the troubled healthcare.gov site.


A story in the Wall Street Journal gives more detail on earlier reports that healthcare.gov, the federal health insurance exchange site created pursuant to the Patient Protection and Affordable Care Act (PPACA, also known as ObamaCare), is sending erroneous data to insurers. The implications could be serious for the applicants.

As the WSJ and other sources have reported, the front-end errors and delays in healthcare.gov have begun to subside. In the process they have exposed other problems.

Those few applicants who managed to complete the application process may consider themselves lucky. Insurance companies say that the data is still coming slowly, but even so they are being overburdened because of the frequent errors. The WSJ cited industry executives as saying that the enrollment data includes "duplicate enrollments, spouses reported as children, missing data fields and suspect eligibility determinations." One company also reported that some applications contained 3 spouses per application.

The insurance companies must clean up the enrollment data. This is usually a manual process and, in some cases, impossible to do conclusively without further information. For example, the enrollment records are not time-stamped when they arrive at the insurer, so if two applications differ in some detail, it is unclear which is the correct one. Blue Cross & Blue Shield of Nebraska has hired temps to contact enrollees for clarification...

__
Errors propagate, they do not "cancel out." Uncontrolled, they metastasize, actually. This is "SPC 101" -- Statistical Process Control, the field wherein I cut my professional teeth within the walls of a forensic environmental radiation laboratory in Oak Ridge in the 1980's under the mentorship of James W. Dillard PhD (pdf). Consequently, I am an unmitigated, pedantic hardass when it comes to the "DQO" -- the Data Quality Objective. 
The DQO Process, defined by the U.S. Environmental Protection Agency (EPA), is a series of planning steps to identify and design more efficient and timely data collection programs. The DQO process relies heavily on customer and supplier communication to define data requirements and acceptable levels of errors in decision making before major resources are expended on data collection and to assure the customer (whether internal or external) is satisfied with the results.
Did HHS and its contractors do any of this? Do the feds suffer from IT best practices amnesia? The cavalier fashion with which IT people treat data is a continuing source of aggravation to me. I guess it's a narrower slice of the cavalier manner with which people in general treat the truth. One need look no further than the slovenly state of our political discourse of late.

But, that's another, larger matter, notwithstanding the tangential relevance. As the issue pertains to Health IT broadly and the current travails of HealthCare.gov specifically, recall my closing remarks from my prior post:
I've not heard much at all these past two weeks about the extent to which bad data in the various distributed databases comprising the under-the-hood guts of HealthCare.gov have contributed to this fiasco. Your programming logic and module interfaces may be airtight, but you cannot code your way out of bad data already resident in your far-flung RDBMS -- other than to write expensive, laborious remediation code that goes through the data repositories and rectifies the ID'd and suspected errors in the tables ("data scrubbing"). Problematic, that idea.
I have direct, long, and deep experience wrestling with the upshots of crap data. As I wrote eleven years ago during my banking tenure:
"I now work in revolving credit risk assessment (a privately-held issuer of VISA and MasterCard accounts), where our department has the endless and difficult task of trying to statistically separate the “goods” from the “bads” using data mining technology and modeling methods such as factor analysis, cluster analysis, general linear and logistic regression, CART analysis (Classification and Regression Tree) and related techniques.

Curiously, our youngest cardholder is 3.7 years of age (notwithstanding that the minimum contractual age is 18), the oldest 147. We have customers ostensibly earning $100,000 per month—odd, given that the median monthly (unverified self-reported) income is approximately $1,700 in our active portfolio.
 

Yeah. Mistakes. We spend a ton of time trying to clean up such exasperating and seemingly intractable errors. Beyond that, for example, we undertake a new in-house credit score modeling study and immediately find that roughly 4% of the account IDs we send to the credit bureau cannot be merged with their data (via Social Security numbers or name/address/phone links).

I guess we’re supposed to be comfortable with the remaining data because they matched up -- and for the most part look plausible. Notwithstanding that nearly everyone has their pet stories about credit bureau errors that gave them heartburn or worse.
 

In addition to credit risk modeling, an ongoing portion of my work involves cardholder transaction analysis and fraud detection. Here again the data quality problems are legion, often going beyond the usual keystroke data processing errors that plague all businesses. Individual point-of-sale events are sometimes posted multiple times, given the holes in the various external and internal data processing systems that fail to block exact dupes. Additionally, all customer purchase and cash advance transactions are tagged by the merchant processing vendor with a 4-digit “SIC code” (Standard Industrial Classification) categorizing the type of sale. These are routinely and persistently miscoded, often laughably. A car rental event might come back to us with a SIC code for “3532- Mining Machinery and Equipment”; booze purchases at state-run liquor stores are sometimes tagged “9311- Taxation and Monetary Policy”; a mundane convenience store purchase in the U.K. is seen as “9711- National Security”, and so forth.

Interestingly, we recently underwent training regarding our responsibilities pursuant to the Treasury Department’s FinCEN (Financial Crimes Enforcement Network) SAR program (Suspicious Activity Reports). The trainer made repeated soothing references to our blanket indemnification under this system, noting approvingly that we are not even required to substantiate a “good faith effort” in filing a SAR. In other words, we could file egregiously incorrect information that could cause an innocent customer a lot of grief, and we can’t be sued.
 

He accepted uncritically that this was a necessary and good idea."
See also my 2008 blog post Privacy and the 4th Amendment amid the "War on Terror."
__

As I watch the developments pertaining to the remedial activities pursuant to the woeful HealthCare.gov rollout, I am struck by the lack of media discourse on the likely impact of poor data quality on the performance of HealthCare.gov.

A couple of necessary definitions:
  • Accuracy: the extent to which a result maps to a known reference standard;
  • Precision: the extent to which results can be reproduced identically in repeated trials.
"Accuracy" and "precision" are not the same thing. You can be utterly "precise" and quite precisely wrong. Accuracy is consistently "hitting the bullseye."


Clustering your shots closely in an outer target quadrant would be "precise" but inaccurate.

OK, back to my opening Photoshop metaphor.


This is a bit simplistic, but makes an opening point. Assume a set of fair dice. Assume further that "rolling snake eyes" (a pair of ones, my "404 error" proxies) is our equivalent analogy for finding incorrect data during a RDMBS search, i.e., 1/36th, or 2.78% probability. Equivalently, this would mean that the data in each database are 35/36ths "accurate, or 97.2% "accurate." What, then, is the likelihood of encountering bad data during the course of a 10 database search?

Well, 1 - 35/36ths raised to the tenth power (this is really just an application of the binomial probability axiom, an example of multiplicative conjunctive probability).

Stick this expression in Google: 1-(35/36)^10=

You have roughly a 1 in 4 chance of finding erroneous data, in this thought experiment -- e.g., can you go ten times without hitting the bad data "snake eyes"?

Less likely with each additional database search.

Now, commercial and government databases having only 1/36th error rate (our "snake eyes") would be considered to be "highly accurate" (and rare).

Some literature:
Modeling Database Error Rates
Elizabeth Pierce, Indiana University of Pennsylvania, Data Quality, Sept 1997
How good are my data? This question is being asked more and more often as managers use data stored in databases for decision-making. Redman estimated that payroll record changes have a 1% error rate, billing records have a 2-7% error rate, and the error rate for credit records may be as high as 30%. In 1992, The Wall Street Journal reported that 25 of 50 information executives it surveyed believed their corporate information was less than 95% accurate. Almost all of them said that databases maintained by individual departments were not good enough to be used for important decisions. Knight reached a similar decision after surveying 501 corporations having annual sales of more than $20 million. Two-thirds of the Information Systems managers he polled reported data quality problems...
A Model of Error Propagation in Conjunctive Decisions and its Application to Database Quality Management
Irit Askira Gelman (DQIQ, USA)
Nearly every organization is plagued by bad data, which result in higher costs, angry customers, compromised decisions, and greater difficulty for the organization to align departments. The overall cost of poor data quality to businesses in the US has been estimated to be over 600 billion dollars a year (Eckerson, 2002), and the cost to individual organizations is believed to be 10%-20% of their revenues (Redman, 2004). Evidently, these estimates are not expected to dramatically improve anytime soon. A survey that covered a wide range of organizations in the US and several other countries showed that about half of the organizations had no plans for improving data quality in the future (Eckerson, 2002).

The low motivation of organizations to improve the quality of their data is often explained by the general difficulty of assessing the economic consequences of the quality factor (Eckerson, 2002; Redman, 2004). The economic aspect of data quality has been drawing a growing research interest in recent years. An understanding of this aspect can be crucial for convincing organizations to address the data quality issue. It can guide decisions on how much to invest in data quality and how to allocate limited organizational resources. The economics of data quality, however, is partly determined by the relationship between the quality of the data and the quality of the information that the information system outputs (Note that, in this paper, the term “data” will largely describe the raw, unprocessed input of an information system; the term “information” will mostly designate the output of the system). An increasing number of management information systems (MIS) studies have centered on this relationship, while parallel questions have been studied in numerous research areas (e.g., Condorcet, 1785; Cover, 1974; Clemen & Winkler, 1984; Grofman, Owen, & Feld 1983; Kuncheva & Whitaker, 2003). However, our grasp of the relationship between an information system’s data quality and its output information quality is still often limited...
None of this is exactly news. Below, a paper published in 1969.


See also Error Propagation in Distributed Databases. The issue has been rather extensively studied for decades. But, perhaps not at CGI Federal or QSSI.
__

Back to the Data Quality Objective

I worked in subprime credit risk modeling from 2000 to 2005 (my risk score project white paper: large pdf). We could be "wrong" 99% of the time as long as the 1% we got "right" paid for everything and turned a profit (which they did; the bank set new profitability records year after year across my entire tenure).

We bought pre-selected direct marketing prospect lists and launched massive direct mail, internet, and phone campaigns. We got about a 5% response rate (or, equivalently, a 95% "error" rate). We then culled people not making the initial booking criteria cut (more "errors"), and subsequently booked those passing muster. Many of those would go delinquent and eventually "charge off"  (yet again more "errors"). The tiny minority who proved profitable paid for the entire operation.

Such a DQO would never suffice for criminal or anti-terror investigations.

Or HealthCare.gov.

Our decades-long indifference to data quality is now coming back to bite us ever more acutely as more and more bits and pieces of information about us get recorded in myriad RDBMS and then merged and mined for various and sundry purposes.

THE "ERROR CASCADE"

Error propagation increases the likelihood of the eventual "error cascade."
In medical jargon, an “error cascade” is something very specific: a series of escalating errors in diagnosis or treatment, each one amplifying the effect of the previous one. This is a well established term in the medical literature: this abstract is quite revealing about the context of use.
The principle goes beyond simply the escalation of erronous dx and px/tx in medicine. to wit:


In general, see "Cascading Failure" in the Wiki.

Add to the foregoing the iterative / recursive nature of data mining and aggregation -- as I alluded to above -- you end up with increased "data pollution." It risks a vicious downward negative cycle. "Big Data" enthusiasts might want to give this a bit more thought.

FMEA CONSIDERATIONS

"Failure Mode and Effects Analysis" (FMEA). A fundamental of risk assessment and risk management. apropos of IT, data errors range from the inconsequential and easily remediable to the intractable and the (sometimes deadly) show-stopper.

The Social Security number (SSN, or "Social" in the jargon) is the closest data element we have to a unique "No Dupes, No Nuls Primary Key," but, as we know, the "No Dupes" part -- beyond errors -- is a joke. The IRS and Social Security Administration (SSA) know full well that Socials are used fraudulently, many times mapping to dozens of actual living (and deceased) persons (mostly for employment "verification" by illegal immigrants). SSA simply shrugs: "Not our job to vet SSNs to claimants." Taxes withheld from the pay of illegals simply (and conveniently, for the feds) pile up in what are known as "suspense accounts."

Consequently, we end up having to triangulate our way into "authentication" approximations via a "multi-key" processes (think "Patient Locator Service"), processes vulnerable to all of the bad data liabilities I've alluded to in the foregoing.

How much of these (crap data) materially impact the HealthCare.gov architecture and processes is unclear to me at this point. But, the questions should be posed to HHS, its contractors, and all of the data providers involved. I didn't hear anything relating to this in the just-concluded House hearings, and I watched all of them. I will review the transcripts to see what I might have missed.

Stay tuned.
__

SATURDAY MORNING UPDATE

Everybody and her brother now seem to be chiming in with offers to "help fix the code" of HealthCare.gov

As computer experts hired by the U.S. government scramble to fix the much-maligned healthcare.gov website, a corps of independent kibitzers is chiming in from around the world, publicizing coding flaws that they’ve discovered and offering suggestions for fixing them.

Much of the constructive criticism is coming from members of the “open source” community, a passionate but loose-knit group that advocates openness and collaboration as a means of writing better computer software. Their desire to help solve the federal government’s website woes in part stems from an early decision by the Department of Health and Human Services to make the healthcare.gov code available for examination – a promise that was never fully fulfilled...


Open-source advocates were excited when Health and Human Services CTO Bryan Sivak said this spring that the code for the site would be open for examination. But only the part of the front-end code produced by Development Seed was made available through GitHub, and that effort has been criticized by open-source advocates as incomplete.

Then, after the Oct. 1 launch of healthcare.gov, people started using the comments section to vent anger about the site’s usability rather than talking about the code itself. The repository was removed at the government’s request...


Matthew McCall, an open-source advocate who has been a Presidential Innovation Fellow, has posted a petition on the White House website asking the government to release all the source code written by CGI Federal. “It is believed that the enrollment issues with healthcare.gov are likely due to poor coding practices in components that are unavailable to the world's development community to evaluate,” the petition says. “Please release the code so we may help fix any found issues.”

By Thursday, however, the petition had fewer than 3,000 of the 100,000 signatures needed by Nov. 19 to gain a response from the Obama administration.


In the meantime, the public appears divided on whether the website is repairable. In an NBC News/Wall Street Journal poll taken over the weekend, 37 percent said these are short-term technical woes that can be fixed, while 31 percent believe they point to a longer-term issue with the law’s design that can’t be corrected, and 30 percent think it’s too soon to say.
“The only option is to fix it,” said Reed, who believes that starting over from the ground up, as some have suggested the government do, isn’t practical because of the amount of time that would take. “And the code is fixable. It’s not the worst code that I’ve ever seen.”
These white knights appear to be uniformly fixated on "the code" and HealthCare.gov website "performance." Not one word in this article pointing at data accuracy problems within the far-flung, multi-agency, multi-corporate entity RDBMS.

From a post on Sulia:


I've tried using the #ObamaCare web site, and it's appalling. After verifying my account, I can't log in. And it took two days to "verify" the account. It basically now takes me in an infinite loop when I try to log in. I may be verified. I may not be. Who knows? Certainly not their apparently misconfigured Oracle clusters.

The screen shot you see is the site speed grade given to it by Yahoo's Firebug Extension #YSlow, which is a web developer tool that helps measure site speed. They give it a D. This is for a web site where the developers were paid $88 million. I could have done this site myself with the help of a couple colleagues I know for a whole helluva a lot less than that, and it wouldn't have taken 3 1/2 years.

My own web site, https://createamixer.com/, has a grade of B, and I'm the only developer, and I can tell you about 50 more things I need to do to optimize the site.

There are tons of things wrong with the healthcare site, based on what I've seen by inspecting the network traffic.

Let's start with the most basic stuff - why on earth do they bring in Facebook API calls? Facebook calls are notoriously slow. And they don't need a damn Like button, for God's sake. They are also talking to the #Twitter API. Really? Twitter? For a health care app? I'm going to want to Like the page and then Tweet about my experience? You better hope I don't spam twitter with my experiences.

They also seem to be trying to talk to a variety of other government agencies. Obviously I can't know all the details on why, but as a new web site they should be making other agencies talk to THEM through an API.

They also have tons of NON-Minfied CSS and JS. My God, people, this is web development 101.

There are also 77 static components that are not on CDN.

The list goes on, but that mostly just affects the browser experience. If you have a slow browser, whether because it is just old or you just happen to even have a lot of tabs open on a computer with not a lot of RAM, you are going to be in a world of hurt. Not to mention the strain on obamacare servers from having to pull 77 static components for millions of users EVERY TIME a user hits the site. Hello?

C.D.N. My God.

This is all just front end incompetence. I hate to guess what happens on the back end, which is harder to peer into.

Please, after this disaster, Mr. #Obama, please please post the code on GitHub. You're going to get thrashed for it, but do it anyway.
"This is all just front end incompetence. I hate to guess what happens on the back end, which is harder to peer into."

Ahhh... a waft of allusion to the RDBMS ("the back end").
__

MONDAY MORNING UPDATE

From The Washington Post:
What went wrong with HealthCare.gov
 

HealthCare.gov, built by 55 contractors, is one of the most complex pieces of software ever created for the federal government. It communicates in real time with at least 112 different computer systems across the country. In the first 10 days, it received 14.6 million unique visits, according to the Obama administration.

Look in particular at Steps 3 and 4. This is where the problem of bad data comes to the fore.

NOV 5TH UPDATE
Can it get worse? Obamacare website gives out SC man’s private information
November 5th, 2013, Michael Dorstewitz, BizPac Review


Because a South Carolina attorney wanted to shop around for cheaper health insurance, he unwittingly walked into a security breach nightmare. Now a North Carolina man has all the attorney’s private information and is unable to enter his own.

Elgin, S.C. attorney Thomas Dougall spent the evening a month ago browsing for insurance on the Affordable Care Act’s HealthCare.gov website. When he got home Friday night, he had a shocking voicemail message waiting for him, according to America Now News.

“I believe somehow the ACA — the Healthcare website — has sent me your information, is what it looks like,” said Justin Hadley, a North Carolina resident who could access Tom’s information on healthcare.gov. ”I think there’s a problem with the wrong information getting to the wrong people.”

Hadley indicated that whenever he entered his own username and password he got Dougall’s personal information instead of his own...
Below, a poll at the bottom of this article. Secretary Sebelius has 25 days to see that this mess is corrected.

She doesn't have a lot of friends in the general public at this point.
___

More to come...

6 comments:

  1. Thanks for the post, Bobby. I learned a lot.

    What strikes me about this is how thoroughly the designers of the site put themselves up for failure by designing access to a dozen or so of these "dirty" databases right in the first step, and requiring that all the data match or the process would fail.

    It does seem from a process point of view, that many simpler ways of establishing identity (ways considered adequate for paying taxes, for instance, or renting a car) could have been used, with the question of fraud left to enforcement action (as it is for taxes, for instance). Checking whether someone was a veteran, or a Native American, for instance, could have been left until people claimed some special status, instead of doing it for every single person.

    And they could have left the whole question of identity until later in the process, after people had "shopped," just like most shopping sites do.

    ReplyDelete
  2. You're welcome, Joe. And thank you for commenting. Praise from YOU means a lot to me.

    ReplyDelete
  3. Wow, I had no idea that dirty data was such a factor in this saga.

    Too bad...having worked with data while a research fellow, I certainly appreciate the importance of getting data collection right.

    ReplyDelete
  4. whoah this blog is excellent i like studying your articles.
    Stay up the great work! You know, many people
    are searching round for this info, you can help them greatly.



    Here is my blog post :: Microsoft office 2013 licensed key (unomatch.com)

    ReplyDelete
  5. Hi ther i am kavin, its my first time to commenting anyplace, when i reaqd this post i thought i could
    also make comment due to this good pisce
    of writing.

    My weblog; AVG 2014 serial crack downloa

    ReplyDelete