The KHIT Blog: "Data Science?"

Monday, October 1, 2018

"Data Science?"

The latest fad? Last year it was profitably fashionable to add "crypto" and/or "blockchain" to one's resume or startup company name. I've alluded to the phrase "data science" in a number of prior posts, in the context of Health InfoTech. See, e.g., "Health IT: process mining and analytics for healthcare QI."

(BTW: Blockchain update.)

This (below) is a pretty good illustrative graphic of the subtopical components:

I have direct work experience in a number of these areas, but not "machine learning" nor "large scale distributed computing" (and I have some methodological concerns about the latter, which I will get to). "BPM" is "Business Process Management." We called "process mining" "operations analytics."

The allusion to "databases," one assumes, includes the critical subject of "database architectures." The heterogeneity of widely distributed "big data" (often of materially varying quality pedigree) has to be a concern. In fairness, though, my waning programmer / database architect chops are pretty old-school RDBMS comprising in-house (e.g., local server) "structured data."

By "machine learning," I assume they include "artificial intelligence," "deep learning," and "natural language processing (NLP)."

I'm reading up.

Just getting started with these, stay tuned. Looking for clear, consistent definitions at the outset, for one thing.

From the MIT book:

1. What Is Data Science?

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. Many of the elements of data science have been developed in related fields such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML) focuses on the design and evaluation of algorithms for extracting patterns from data. Data mining generally deals with the analysis of structured data and often implies an emphasis on commercial applications. Data science takes all of these considerations into account but also takes up other challenges, such as the capturing, cleaning, and transforming of unstructured social media and web data; the use of big-data technologies to store and process big, unstructured data sets; and questions related to data ethics and regulation...

Kelleher, John D.. Data Science (MIT Press Essential Knowledge series) . The MIT Press. Kindle Edition.

From the "AI Science" book:

What is Data Science?

Data science is multidisciplinary field that relies on scientific methods, statistics and algorithms to extract meaningful insights from data. At its core, data science is all about discovering useful patterns in data that can then be presented as information to tell a story or make informed decisions. It would be noticed that data science depends on techniques from a bunch of other fields such as computer science, mathematics, statistics and business analytics. It is common for data scientists to have skills across this range. Data science can be employed to derive insights from both small and large datasets and it is often a misconception that data science is only suited to so called big data.

Morgan, Peter. Data Science from Scratch with Python: Step-by-Step Guide (Kindle Locations 337-344). AI Sciences LLC. Kindle Edition.

OK. Their Venn diagram:

Another engrossing book that I'm way deep into at the moment, written by the AI eminence Judea Pearl.

This one is a total whack upside the head.

…We live in an era that presumes Big Data to be the solution to all our problems. Courses in “data science” are proliferating in our universities, and jobs for “data scientists” are lucrative in the companies that participate in the “data economy.” But I hope with this book to convince you that data are profoundly dumb. Data can tell you that the people who took a medicine recovered faster than those who did not take it, but they can’t tell you why. Maybe those who took the medicine did so because they could afford it and would have recovered just as fast without it.

Over and over again, in science and in business, we see situations where mere data aren’t enough. Most big-data enthusiasts, while somewhat aware of these limitations, continue the chase after data-centric intelligence, as if we were still in the Prohibition era.

As I mentioned earlier, things have changed dramatically in the past three decades. Nowadays, thanks to carefully crafted causal models, contemporary scientists can address problems that would have once been considered unsolvable or even beyond the pale of scientific inquiry. For example, only a hundred years ago, the question of whether cigarette smoking causes a health hazard would have been considered unscientific. The mere mention of the words “cause” or “effect” would create a storm of objections in any reputable statistical journal.

Even two decades ago, asking a statistician a question like “Was it the aspirin that stopped my headache?” would have been like asking if he believed in voodoo. To quote an esteemed colleague of mine, it would be “more of a cocktail conversation topic than a scientific inquiry.” But today, epidemiologists, social scientists, computer scientists, and at least some enlightened economists and statisticians pose such questions routinely and answer them with mathematical precision. To me, this change is nothing short of a revolution. I dare to call it the Causal Revolution, a scientific shakeup that embraces rather than denies our innate cognitive gift of understanding cause and effect.

Pearl, Judea. The Book of Why: The New Science of Cause and Effect (pp. 6-7). Basic Books. Kindle Edition.

"If I could sum up the message of this book in one pithy phrase, it would be that you are smarter than your data. Data do not understand causes and effects; humans do." [pg. 21]

So much for the liturgy of "Data-Driven."

Among numerous other virtues, The Book of Why provides the best explication of Bayesian Networks I've ever read. I'm already long up to speed on applications of Bayes Theorem ("base rates matter"), but Pearl's Bayesian Networks stuff is off the hook, and foundational to his compelling argument.

UPDATE

Michael Lewis' new book is out. I read it all immediately.

…in the space of a few years, the interest in data analysis went from curiosity to fad. The fetish for data overran everything from political campaigns to the management of baseball teams. Inside LinkedIn, DJ presided over an explosion of job titles that described similar tasks: analyst, business analyst, data analyst, research sci. The people in human resources complained to him that the company had too many data-related job titles. The company was about to go public, and they wanted to clean up the organization chart. To that end DJ sat down with his counterpart at Facebook, who was dealing with the same problem. What could they call all these data people? “Data scientist,” his Facebook friend suggested. “We weren’t trying to create a new field or anything, just trying to get HR off our backs,” said DJ. He replaced the job titles for some openings with “data scientist.” To his surprise, the number of applicants for the jobs skyrocketed. “Data scientists” were what people wanted to be.

Lewis, Michael. The Fifth Risk (pp. 157-158). W. W. Norton & Company. Kindle Edition.

A compelling, albeit by turns depressing and infuriating read. Highly recommended.
___

"DATA SCIENCE," STANFORD IS ON IT

sdsi.stanford.edu

I saw a presentation about this stuff given by Stanford's Carlos Bustamante last December during the Health 2.0 Technology for Precision Health conference.

From the SDSI website:

Science of Data Science

Science is experiencing simultaneous challenges and opportunities at an unprecedented rate:

From new sources of data, especially in large quantity and unconventional structure, often from “non-scientific” sources, such as social media;

From new algorithmic techniques potentially expanding greatly the ability to reason from data but whose interpretation, validity and fairness can not be established by our current statistical and computational techniques;

From the crucial need for scientifically valid advice on questions of the greatest importance to the future of society, of life and of the earth itself---advice that must be effectively communicated to society.

In all of these, data science is clearly central. Recent computational, statistical and other research has been of great value. Much more needs to be done, however, and with a sense of urgency.

Validity of algorithmic inferences:
Algorithmic techniques to infer patterns and structure have had exceptional success recently in many areas of practical value. They can also be important, even revolutionary, for science in many areas. Data as divergent as social media interactions on one hand and satellite or drone images on the other may provide vital results through such algorithms.

However, the scientific validity of the results can not be assumed. Conventional concepts such as random sampling of the intended population are rarely relevant. A deeper understanding of the data sources and the computations applied will be essential.

Fairness of algorithmic decisions:
Beyond the scientific validity of inferences, the use of algorithmic results to recommend practical actions raises important questions of fairness and equitable treatment. Data science needs to search for valid notions of fairness, to ensure that the results of analysis and the data-based algorithms using them are fair to all demographic and other cohorts.

Privacy and the public interest:
Huge quantities of data exist for individuals, through social media, other internet activities and databases of medical, governmental, employment and commercial records. Computational and statistical techniques are needed that satisfy both the right to privacy and society’s need to deal with important questions. Progress has been made with new approaches such as differential privacy and distributed inference on private data. Much more needs to be done given the increasing attraction of mining such data sources, with the potential risks to individual rights.

Causality:
Some of the richest sources of extensive data for scientific study are observational (“non-randomized”) data bases made available by the explosion of technology (the internet and digital records in medicine, government and business). Naive application of inferential techniques to infer causal mechanisms will be seriously misleading on such data, potentially with disastrously mistaken conclusions. Research in new statistical and computational techniques to adjust for such data sources is needed.

The reproducibility crisis:
Repeated and often highly visible incidents have highlighted failures to reproduce “scientific” conclusions; for example, frequent editorials in prestigious journals such as Science and Nature have documented and apologized for many failures to reproduce published results.

Issues of scientific and academic culture are undoubtedly part of the problem. However, the radical changes in sources of data and algorithms applied mean that the practice of data analysis has changed enormously. Data science needs to find new inferential paradigms that allow data exploration prior to the formulation of hypotheses.

SDSI on Data Science in the health care space:

Data Science for Human Health
It is clear that data science will be a driving force in transitioning the world’s healthcare systems from reactive “sick-based” care to proactive, preventive care.

First, and most importantly, data science has the power to empower the consumer, giving them more control over their own care. People can make better, more informed decisions if their care providers are able to make better, more data-based recommendations. Imagine your care provider could access your genetic information in a proactive healthcare system, measure your genetic risk for disease—not just as an individual but also as a member of a larger population—and then help you manage that risk throughout your life course.

This is the kind of personalized, patient-focused medicine that current reactive healthcare systems cannot facilitate, because they are designed to wait until things go wrong with the human body before addressing the problem, and every individual is deemed responsible for managing his/her own health and risk. In a data-based proactive healthcare system, public education could inform people of what it means to have different levels of risk. Since we all carry some level of risk (some more than others for specific diseases), individuals could be informed of their individual and collective health risks early on, enhancing control over their own health at every stage of their lifespan.

Second, data science enables more cost-effective drug discovery, helping us do the right thing for the right person. Rather than have someone trying and failing ten different drugs at great expense to the individual and the acute-based care system (not to mention worsening quality of life for the patient), data science can help us choose the right one on the first try. Although that drug in isolation is more expensive for the system, it would have been even more expensive if we didn’t have data science because that person would have had ten different things tried and failed. Additionally, data science allows us to bring things to market more quickly, because we’re not beholden to the hypothesis-driven routine.

Third, data science technologies are capable of improving patient outcomes and conditions with variable outcomes. They can capture data inputs, weed out subtypes, and distill best practices when combating disease, such as brain or other neurological cancers.

Lastly, data science technology can also reconfigure the costs associated with delivery of care by utilizing continuous data capture, analytics, and new key insights in order to inform physicians and clinicians when things have gone wrong in the human body before patients feel unwell. That understanding could then be integrated into a new model of care, which would enable early intervention, thus preventing that individual from having to go to the hospital. Recent Stanford research has begun to explore the possibilities of monitoring cardiomyopathy patients at home and monitoring children in the ER and ICU: we believe these studies are leading us toward a future of proactive, consumer-based care.

We recognize fully that technological advancement and unprecedented growth in biomedical data have created great opportunities, but they have also introduced great challenges for protecting the privacy and security of patient and other research data. We must work with stakeholders and experts in the private sector and federal agencies, such as the NIH, to promote and practice robust and proactive information-security procedures to ensure appropriate stewardship of patient and research-participant data while at the same time enabling scientific and medical advances.

Highly recommend you read all of their topical domain info.

"Ethics and Data Science?" Yeah, I'm gonna get there too. For one thing, I gotta get around to evaluating this (below).

_____________

More to come...

No comments:

Brave New Health

Commonwell Health Alliance

Another important read (pdf)

I love this kind of stuff. It sustains and humbles me. "As politicians, advertisers, salesmen, and propagandists for various political, economic, moral, religious, psychic, environmental, dietary, and artistic doctrinaire positions know only too well, fallible human minds are easily tricked, by clever verbiage... Common language—or at least, the English language—has an almost universal tendency to disguise epistemological statements by putting them into a grammatical form which suggests to the unwary an ontological statement. A major source of error in current probability theory arises from an unthinking failure to perceive this."

Quotes

"An economist is a person who sees something that works in practice and tries to figure out whether it will work in theory."

- J.D. Kleinke, medical economist
___

"The only person who enjoys change is a baby with a wet diaper."

"Every misspent dollar in our health care system is part of somebody's paycheck.

- Brent James, M.D., M.Stat

“We could do healthcare, at markedly higher quality, for everyone in this country, without rationing or denying anybody the care that they need, without having the government dictate how doctors practice or whether hospitals could expand, at half the cost we do it now.”

- Health Care Futurist Joe Flower

Most of the sciences, unlike parts of medical science, are not concerned with the impossible. There is not complementary and alternative physics, or chemistry, or biochemistry, or engineering. These disciplines compare their ideas against reality, and, if the ideas are found wanting, abandoned."

- Mark A. Crislip, MD

"Q: How much alcohol is too much?
A: More than your doctor drinks."

- a physician I once heard speak during a CME presentation

“Just because science doesn’t know everything, doesn’t mean you get to fill in the gaps with whatever fairy tale most appeals to you.”

- Dara O’Briain

'[I]t is one small step from using the computer for "helping" doctors to monitoring them, judging them, dictating to them what to do, and withdrawing payment for computer non-compliance. The use of computer data is a multi-edged sword. It can be used for the "good," facilitating diagnosis and treatment and making it more accurate and up-to-date, and for “evil,” invading privacy, inviting security breechs, and making decisions based on the opinions of remote authorities rather than those present at the patient-doctor encounter.'

- Richard Reece, MD

“[T]here ARE statistics which are non-political. Just because The Washington Post/Fox News reports the temperature is 75 degrees doesn’t mean it’s really snowing and sunscreen is a liberal/conservative plot. Even if you earn a living being ideological.”

- Michael L. Millenson

"It is a generally a fairly convincing argument that people shouldn’t have to be subsidized to undertake a change which is in their best interest.

The reconciliation seems to be that EHR is not supposed to make a doctor’s practice more efficient and higher quality. It is supposed to make the system of care more efficient and higher quality, which is not the same thing. Those of you who took calc recall that maximizing the total of variables is not achieved by maximizing any one variable and this is a perfect example of that.

Those of you have served in combat certainly noticed that too — if everyone works as a team the unit takes fewer casualties. If you try to save your own hide, you might, but at the expense of more casualties overall."

- Al Lewis

"There are two ideas to keep in mind about Bayesian reasoning and how we tend to mess it up. The first is that base rates matter, even in the presence of evidence about the case at hand. This is often not intuitively obvious. The second is that intuitive impressions of the diagnosticity of evidence are often exaggerated."

- Daniel Kahneman, "Thinking, Fast and Slow"

"Physicians apply advanced scientific knowledge, but they must do so without the favorable conditions that experimental scientists create for themselves. Multitasking is forced on physicians, often in chaotic environments and under severe time and resource constraints."

- Lawrence and Lincoln Weed, "Medicine in Denial"

"It’s time to stop the whining about Obama care and acknowledge we already have universal health care. We just pay for it in the stupidest way possible that ensures problems are that much more disastrous and complicated when they’re finally treated."

- Mark Hoofnagle, MD, PhD

"Every act of conscious learning requires the willingness to suffer an injury to one's self-esteem. That is why young children, before they are aware of their own self-importance, learn so easily."

- Thomas Szasz, MD
___

"Of course, one reason that process metrics* are so popular is that processes are much easier to define and measure than outcomes."

- The Skeptical Scalpel
___

"There is an “illusion of validity” for any random data point, a seductive sense that is colored by what we hope will be true. Mountains of pharmaceutical claims are often made from mere molehills of data."

- Danielle Ofri, MD
___

"Joy empowers people. It is a source of energy that enables people to hope and plan and change their lives for the better. Spend some time around someone who is relentlessly negative and how do you feel–drained, right? More and more research shows that joy is not something that just happens to you, like a bolt of lightening out of the blue. Joy is, instead, a habit to cultivate. Negative thinking and despair are the crabgrass of our souls–weeds that take root and spread, sometimes to all areas of life. Joy, in contrast, is a soul’s rose–hardy when cared for, able to put down roots over time and withstand disease and extremes. Like a rose, however, your joy can become blighted from neglect or harsh conditions. We all need to tend to our joy–to prune away the badness, and to know that, even though it may look like a prickly bare root, if you invest time in a joyous outlook, gorgeous things will bloom, even in the harshest conditions."

- Dr. Jan Gurley
___

"'Solutions' exist only in mathematics."

- Karen Martin
___

"The issue of how to regulate clinical software is, in the long run, indistinguishable from the issue of how to regulate medicine. The only difference is that medicine is practiced in the open, without secrecy, subject to peer review, and under a merit-based state license."

- Adrian Gropper, MD
___

"Economist, rope, tree: some assembly required."

- Source unknown

DISCLAIMER:

I write this blog wholly on my own time and my own dime. The views proffered are expressly my own as a concerned and active citizen/taxpayer (in addition to being the result of my substantive experience in the various IT fields), and in no way reflect any policy views of my former employer, notwithstanding that some of the thinking has indeed obviously been spurred by the implications of the work with which I have been doing for them.

FAIR USE POLICY
I cite a ton of news and web sources spanning the breadth of relevant technical and policy domains, sometimes at substantial length. I believe I remain well within the bounds of "Fair Use," as [1] I am not doing any of this for profit, [2] I always provide attribution and links -- which, [3] far from negatively impacting any copyright holders' commercial interests, might actually increase traffic to and interest in their offerings.

Nonetheless, should I post anything of yours regarding which you have any objection, just let me know and I will remove it forthwith.

The KHIT Blog

Search the KHIT Blog

Monday, October 1, 2018

"Data Science?"

No comments:

Post a Comment