Search the KHIT Blog

Monday, October 1, 2018

"Data Science?"

The latest fad? Last year it was profitably fashionable to add "crypto" and/or "blockchain" to one's resume or startup company name. I've alluded to the phrase "data science" in a number of prior posts, in the context of Health InfoTech. See, e.g., "Health IT: process mining and analytics for healthcare QI.

(BTW: Blockchain update.)

This (below) is a pretty good illustrative graphic of the subtopical components:


I have direct work experience in a number of these areas, but not "machine learning" nor "large scale distributed computing" (and I have some methodological concerns about the latter, which I will get to). "BPM" is "Business Process Management." We called "process mining" "operations analytics."
The allusion to "databases," one assumes, includes the critical subject of "database architectures." The heterogeneity of widely distributed "big data" (often of materially varying quality pedigree) has to be a concern. In fairness, though, my waning programmer / database architect chops are pretty old-school RDBMS comprising in-house (e.g., local server) "structured data."
By "machine learning," I assume they include "artificial intelligence," "deep learning," and "natural language processing (NLP)."

I'm reading up.


Just getting started with these, stay tuned. Looking for clear, consistent definitions at the outset, for one thing.

From the MIT book:
1. What Is Data Science? 

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. Many of the elements of data science have been developed in related fields such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML) focuses on the design and evaluation of algorithms for extracting patterns from data. Data mining generally deals with the analysis of structured data and often implies an emphasis on commercial applications. Data science takes all of these considerations into account but also takes up other challenges, such as the capturing, cleaning, and transforming of unstructured social media and web data; the use of big-data technologies to store and process big, unstructured data sets; and questions related to data ethics and regulation...

Kelleher, John D.. Data Science (MIT Press Essential Knowledge series) . The MIT Press. Kindle Edition.
From the "AI Science" book:
What is Data Science?

Data science is multidisciplinary field that relies on scientific methods, statistics and algorithms to extract meaningful insights from data. At its core, data science is all about discovering useful patterns in data that can then be presented as information to tell a story or make informed decisions. It would be noticed that data science depends on techniques from a bunch of other fields such as computer science, mathematics, statistics and business analytics. It is common for data scientists to have skills across this range. Data science can be employed to derive insights from both small and large datasets and it is often a misconception that data science is only suited to so called big data.


Morgan, Peter. Data Science from Scratch with Python: Step-by-Step Guide (Kindle Locations 337-344). AI Sciences LLC. Kindle Edition.
OK. Their Venn diagram:


Another engrossing book that I'm way deep into at the moment, written by the AI eminence Judea Pearl.


This one is a total whack upside the head.
…We live in an era that presumes Big Data to be the solution to all our problems. Courses in “data science” are proliferating in our universities, and jobs for “data scientists” are lucrative in the companies that participate in the “data economy.” But I hope with this book to convince you that data are profoundly dumb. Data can tell you that the people who took a medicine recovered faster than those who did not take it, but they can’t tell you why. Maybe those who took the medicine did so because they could afford it and would have recovered just as fast without it.

Over and over again, in science and in business, we see situations where mere data aren’t enough. Most big-data enthusiasts, while somewhat aware of these limitations, continue the chase after data-centric intelligence, as if we were still in the Prohibition era.

As I mentioned earlier, things have changed dramatically in the past three decades. Nowadays, thanks to carefully crafted causal models, contemporary scientists can address problems that would have once been considered unsolvable or even beyond the pale of scientific inquiry. For example, only a hundred years ago, the question of whether cigarette smoking causes a health hazard would have been considered unscientific. The mere mention of the words “cause” or “effect” would create a storm of objections in any reputable statistical journal.

Even two decades ago, asking a statistician a question like “Was it the aspirin that stopped my headache?” would have been like asking if he believed in voodoo. To quote an esteemed colleague of mine, it would be “more of a cocktail conversation topic than a scientific inquiry.” But today, epidemiologists, social scientists, computer scientists, and at least some enlightened economists and statisticians pose such questions routinely and answer them with mathematical precision. To me, this change is nothing short of a revolution. I dare to call it the Causal Revolution, a scientific shakeup that embraces rather than denies our innate cognitive gift of understanding cause and effect.

Pearl, Judea. The Book of Why: The New Science of Cause and Effect (pp. 6-7). Basic Books. Kindle Edition
.
"If I could sum up the message of this book in one pithy phrase, it would be that you are smarter than your data. Data do not understand causes and effects; humans do." [pg. 21]
So much for the liturgy of "Data-Driven."
Among numerous other virtues, The Book of Why provides the best explication of Bayesian Networks I've ever read. I'm already long up to speed on applications of Bayes Theorem ("base rates matter"), but Pearl's Bayesian Networks stuff is off the hook, and foundational to his compelling argument.
UPDATE

Michael Lewis' new book is out. I read it all immediately.

…in the space of a few years, the interest in data analysis went from curiosity to fad. The fetish for data overran everything from political campaigns to the management of baseball teams. Inside LinkedIn, DJ presided over an explosion of job titles that described similar tasks: analyst, business analyst, data analyst, research sci. The people in human resources complained to him that the company had too many data-related job titles. The company was about to go public, and they wanted to clean up the organization chart. To that end DJ sat down with his counterpart at Facebook, who was dealing with the same problem. What could they call all these data people? “Data scientist,” his Facebook friend suggested. “We weren’t trying to create a new field or anything, just trying to get HR off our backs,” said DJ. He replaced the job titles for some openings with “data scientist.” To his surprise, the number of applicants for the jobs skyrocketed. “Data scientists” were what people wanted to be.

Lewis, Michael. The Fifth Risk (pp. 157-158). W. W. Norton & Company. Kindle Edition.
A compelling, albeit by turns depressing and infuriating read. Highly recommended.
___

"DATA SCIENCE," STANFORD IS ON IT

sdsi.stanford.edu
I saw a presentation about this stuff given by Stanford's Carlos Bustamante last December during the Health 2.0 Technology for Precision Health conference.

From the SDSI website:
Science of Data Science

Science is experiencing simultaneous challenges and opportunities at an unprecedented rate:
  • From new sources of data, especially in large quantity and unconventional structure, often from “non-scientific” sources, such as social media;
  • From new algorithmic techniques potentially expanding greatly the ability to reason from data but whose interpretation, validity and fairness can not be established by our current statistical and computational techniques;
  • From the crucial need for scientifically valid advice on questions of the greatest importance to the future of society, of life and of the earth itself---advice that must be effectively communicated to society.
In all of these, data science is clearly central. Recent computational, statistical and other research has been of great value. Much more needs to be done, however, and with a sense of urgency.

Validity of algorithmic inferences:

Algorithmic techniques to infer patterns and structure have had exceptional success recently in many areas of practical value. They can also be important, even revolutionary, for science in many areas. Data as divergent as social media interactions on one hand and satellite or drone images on the other may provide vital results through such algorithms.

However, the scientific validity of the results can not be assumed. Conventional concepts such as random sampling of the intended population are rarely relevant. A deeper understanding of the data sources and the computations applied will be essential.

Fairness of algorithmic decisions:
Beyond the scientific validity of inferences, the use of algorithmic results to recommend practical actions raises important questions of fairness and equitable treatment. Data science needs to search for valid notions of fairness, to ensure that the results of analysis and the data-based algorithms using them are fair to all demographic and other cohorts.

Privacy and the public interest:
Huge quantities of data exist for individuals, through social media, other internet activities and databases of medical, governmental, employment and commercial records. Computational and statistical techniques are needed that satisfy both the right to privacy and society’s need to deal with important questions. Progress has been made with new approaches such as differential privacy and distributed inference on private data. Much more needs to be done given the increasing attraction of mining such data sources, with the potential risks to individual rights.

Causality:
Some of the richest sources of extensive data for scientific study are observational (“non-randomized”) data bases made available by the explosion of technology (the internet and digital records in medicine, government and business). Naive application of inferential techniques to infer causal mechanisms will be seriously misleading on such data, potentially with disastrously mistaken conclusions. Research in new statistical and computational techniques to adjust for such data sources is needed.

The reproducibility crisis:
Repeated and often highly visible incidents have highlighted failures to reproduce “scientific” conclusions; for example, frequent editorials in prestigious journals such as Science and Nature have documented and apologized for many failures to reproduce published results.
Issues of scientific and academic culture are undoubtedly part of the problem. However, the radical changes in sources of data and algorithms applied mean that the practice of data analysis has changed enormously. Data science needs to find new inferential paradigms that allow data exploration prior to the formulation of hypotheses.
SDSI on Data Science in the health care space:
Data Science for Human Health
It is clear that data science will be a driving force in transitioning the world’s healthcare systems from reactive “sick-based” care to proactive, preventive care.

First, and most importantly, data science has the power to empower the consumer, giving them more control over their own care. People can make better, more informed decisions if their care providers are able to make better, more data-based recommendations. Imagine your care provider could access your genetic information in a proactive healthcare system, measure your genetic risk for disease—not just as an individual but also as a member of a larger population—and then help you manage that risk throughout your life course.

This is the kind of personalized, patient-focused medicine that current reactive healthcare systems cannot facilitate, because they are designed to wait until things go wrong with the human body before addressing the problem, and every individual is deemed responsible for managing his/her own health and risk. In a data-based proactive healthcare system, public education could inform people of what it means to have different levels of risk. Since we all carry some level of risk (some more than others for specific diseases), individuals could be informed of their individual and collective health risks early on, enhancing control over their own health at every stage of their lifespan.

Second, data science enables more cost-effective drug discovery, helping us do the right thing for the right person. Rather than have someone trying and failing ten different drugs at great expense to the individual and the acute-based care system (not to mention worsening quality of life for the patient), data science can help us choose the right one on the first try. Although that drug in isolation is more expensive for the system, it would have been even more expensive if we didn’t have data science because that person would have had ten different things tried and failed. Additionally, data science allows us to bring things to market more quickly, because we’re not beholden to the hypothesis-driven routine.

Third, data science technologies are capable of improving patient outcomes and conditions with variable outcomes. They can capture data inputs, weed out subtypes, and distill best practices when combating disease, such as brain or other neurological cancers.

Lastly, data science technology can also reconfigure the costs associated with delivery of care by utilizing continuous data capture, analytics, and new key insights in order to inform physicians and clinicians when things have gone wrong in the human body before patients feel unwell. That understanding could then be integrated into a new model of care, which would enable early intervention, thus preventing that individual from having to go to the hospital. Recent Stanford research has begun to explore the possibilities of monitoring cardiomyopathy patients at home and monitoring children in the ER and ICU: we believe these studies are leading us toward a future of proactive, consumer-based care.

We recognize fully that technological advancement and unprecedented growth in biomedical data have created great opportunities, but they have also introduced great challenges for protecting the privacy and security of patient and other research data. We must work with stakeholders and experts in the private sector and federal agencies, such as the NIH, to promote and practice robust and proactive information-security procedures to ensure appropriate stewardship of patient and research-participant data while at the same time enabling scientific and medical advances.
Highly recommend you read all of their topical domain info.


"Ethics and Data Science?" Yeah, I'm gonna get there too. For one thing, I gotta get around to evaluating this (below).

_____________

More to come...

No comments:

Post a Comment