Search the KHIT Blog

Tuesday, September 19, 2017

GAFA, Really Big Data, and their sociopolitical implications

My September issue of my ASQ "Quality Progress" journal showed up in the snailmail.

Pretty interesting cover story. Some of it ties nicely into recent topics of mine such as "AI" and its adolescent cousin "NLP" as they go to the health IT space. More on that in a bit, but, first, let me set the stage via some relevant recent reading.

Franklin Foer has had a good book launch. Lots of interviews and articles online and a number of news show appearances.

I loved the book, read it in one day. I will cite just a bit from his book, "GAFA," btw, is the EU's  sarcastic shorthand for "Google-Amazon-Facebook-Apple," the for-profit miners of the biggest of "big data."
UNTIL RECENTLY, it was easy to define our most widely known corporations. Any third grader could describe their essence. Exxon sells oil; McDonald’s makes hamburgers; Walmart is a place to buy stuff. This is no longer so. The ascendant monopolies of today aspire to encompass all of existence. Some of these companies have named themselves for their limitless aspirations. Amazon, as in the most voluminous river on the planet, has a logo that points from A to Z; Google derives from googol, a number (1 followed by 100 zeros) that mathematicians use as shorthand for unimaginably large quantities.

Where do these companies begin and end? Larry Page and Sergey Brin founded Google with the mission of organizing all knowledge, but that proved too narrow. Google now aims to build driverless cars, manufacture phones, and conquer death. Amazon was once content being “the everything store,” but now produces television shows, designs drones, and powers the cloud. The most ambitious tech companies— throw Facebook, Microsoft, and Apple into the mix— are in a race to become our “personal assistant.” They want to wake us in the morning, have their artificial intelligence software guide us through the day, and never quite leave our sides. They aspire to become the repository for precious and private items, our calendar and contacts, our photos and documents. They intend for us to unthinkingly turn to them for information and entertainment, while they build unabridged catalogs of our intentions and aversions. Google Glass and the Apple Watch prefigure the day when these companies implant their artificial intelligence within our bodies.

More than any previous coterie of corporations, the tech monopolies aspire to mold humanity into their desired image of it. They believe that they have the opportunity to complete the long merger between man and machine— to redirect the trajectory of human evolution. How do I know this? Such suggestions are fairly commonplace in Silicon Valley, even if much of the tech press is too obsessed with covering the latest product launch to take much notice of them. In annual addresses and townhall meetings, the founding fathers of these companies often make big, bold pronouncements about human nature— a view of human nature that they intend to impose on the rest of us.

There’s an oft-used shorthand for the technologist’s view of the world. It is assumed that libertarianism dominates Silicon Valley, which isn’t wholly wrong. High-profile devotees of Ayn Rand can be found there. But if you listen hard to the titans of tech, that’s not the worldview that emerges. In fact, it is something much closer to the opposite of a libertarian’s veneration of the heroic, solitary individual. The big tech companies believe we’re fundamentally social beings, born to collective existence. They invest their faith in the network, the wisdom of crowds, collaboration. They harbor a deep desire for the atomistic world to be made whole. By stitching the world together, they can cure its ills. Rhetorically, the tech companies gesture toward individuality— to the empowerment of the “user”— but their worldview rolls over it. Even the ubiquitous invocation of users is telling, a passive, bureaucratic description of us.

The big tech companies— the Europeans have charmingly, and correctly, lumped them together as GAFA (Google, Apple, Facebook, Amazon)— are shredding the principles that protect individuality. Their devices and sites have collapsed privacy; they disrespect the value of authorship, with their hostility to intellectual property. In the realm of economics, they justify monopoly with their well-articulated belief that competition undermines our pursuit of the common good and ambitious goals. When it comes to the most central tenet of individualism— free will— the tech companies have a different way. They hope to automate the choices, both large and small, that we make as we float through the day. It’s their algorithms that suggest the news we read, the goods we buy, the path we travel, the friends we invite into our circle.

It’s hard not to marvel at these companies and their inventions, which often make life infinitely easier. But we’ve spent too long marveling. The time has arrived to consider the consequences of these monopolies, to reassert our own role in determining the human path. Once we cross certain thresholds— once we transform the values of institutions, once we abandon privacy— there’s no turning back, no restoring our lost individuality…

Foer, Franklin. World Without Mind: The Existential Threat of Big Tech (pp. 1-3). Penguin Publishing Group. Kindle Edition.
I've considered thses issues before. See, e.g., "The old internet of data, the new internet of things and "Big Data," and the evolving internet of YOU."

A useful volume of historical context also comes to mind.

Machines are about control. Machines give more control to humans: control over their environment, control over their own lives, control over others. But gaining control through machines means also delegating it to machines. Using the tool means trusting the tool. And computers, ever more powerful, ever smaller, and ever more networked, have given ever more autonomy to our instruments. We rely on the device, plane and phone alike, trusting it with our security and with our privacy. The reward: an apparatus will serve as an extension of our muscles, our eyes, our ears, our voices, and our brains.

Machines are about communication. A pilot needs to communicate with the aircraft to fly it. But the aircraft also needs to communicate with the pilot to be flown. The two form an entity: the pilot can’t fly without the plane, and the plane can’t fly without the pilot. But these man-machine entities aren’t isolated any longer. They’re not limited to one man and one machine, with a mechanical interface of yokes, throttles, and gauges. More likely, machines contain a computer, or many, and are connected with other machines in a network. This means many humans interact with and through many machines. The connective tissue of entire communities has become mechanized. Apparatuses aren’t simply extensions of our muscles and brains; they are extensions of our relationships to others— family, friends, colleagues, and compatriots. And technology reflects and shapes those relationships.

Control and communication began to shift fundamentally during World War II. It was then that a new set of ideas emerged to capture the change: cybernetics. The famously eccentric MIT mathematician Norbert Wiener coined1 the term, inspired by the Greek verb kybernan, which means “to steer, navigate, or govern.” Cybernetics; or, Control and Communication in the Animal and the Machine, Wiener’s pathbreaking book, was published in the fall of 1948. The volume was full of daredevil prophecies about the future: of self-adaptive machines that would think and learn and become cleverer than “man,” all made credible by formidable mathematical formulas and imposing engineering jargon…

…From today’s vantage point, the future is hazy, dim, and formless. But these questions aren’t new. The future of machines has a past. And mastering our future with machines requires mastering our past with machines. Stepping back twenty or forty or even sixty years brings the future into sharper relief, with exaggerated clarity, like a caricature, revealing the most distinct and marked features. And cybernetics was a major force in molding these features.

That cybernetic tension of dystopian and utopian visions dates back many decades. Yet the history of our most potent ideas about the future of technology is often neglected. It doesn’t enter archives in the same way that diplomacy and foreign affairs would. For a very long period of time, utopian ideas have dominated; ever since Wiener’s death in March 1964, the future of man’s love affair with the machine was a starry-eyed view of a better, automated, computerized, borderless, networked, and freer future. Machines, our own cybernetic creations, would be able to overcome the innate weaknesses of our inferior bodies, our fallible minds, and our dirty politics. The myth of the clean, infallible, and superior machines was in overdrive, out of balance.

By the 1990s, dystopia had returned. The ideas of digital war, conflict, abuse, mass surveillance, and the loss of privacy— even if widely exaggerated— can serve as a crucial corrective to the machine’s overwhelming utopian appeal. But this is possible only if contradictions are revealed— contradictions covered up and smothered by the cybernetic myth. Enthusiasts, driven by hope and hype, overestimated the power of new and emerging computer technologies to transform society into utopia; skeptics, often fueled by fear and foreboding, overestimated the dystopian effects of these technologies. And sometimes hope and fear joined forces, especially in the shady world of spies and generals. But misguided visions of the future are easily forgotten, discarded into the dustbin of the history of ideas. Still, we ignore them at our own peril. Ignorance risks repeating the same mistakes.

Cybernetics, without doubt, is one of the twentieth century’s biggest ideas, a veritable ideology of machines born during the first truly global industrial war that was itself fueled by ideology. Like most great ideas, cybernetics was nimble and shifted shape several times, adding new layers to its twisted history decade by decade. This book peels back these layers, which were nearly erased and overwritten again and again, like a palimpsest of technologies. This historical depth, although almost lost, is what shines through the ubiquitous use of the small word “cyber” today…

Rid, Thomas (2016-06-28). Rise of the Machines: A Cybernetic History (Kindle Locations 167-233). W. W. Norton & Company. Kindle Edition.
Still reading this one. A fine read. spans roughly the period from WWII to 2016.

I can think of a number of relevant others I've heretofore cited, but these will do for now.


The Deal With Big Data

Move over! Big data analytics and standardization are the next big thing in quality
by Michele Boulanger, Wo Chang, Mark Johnson and T.M. Kubiak

Just the Facts

  • More and more organizations have realized the important role big data plays in today’s marketplaces.

  • Recognizing this shift toward big data practices, quality professionals must step up their understanding of big data and how organizations can use and take advantage of their transactional data.

  • Standards groups realize big data is here to stay and are beginning to develop foundational standards for big data and big data analytics.
The era of big data is upon us. While providing a formidable challenge to the classically trained quality practitioner, big data also offers substantial opportunities for redirecting a career path into a computational and data-intensive environment.

The change to big data analytics from the status quo of applying quality principles to manufacturing and service operations could be considered a paradigm shift comparable to the changes quality professionals experienced when statistical computing packages became widely available, or when control charts were first introduced.

The challenge for quality practitioners is to recognize this shift and secure the training and understanding necessary to take full advantage of the opportunities.

What’s the big deal?
What exactly is big data? You’ve probably noticed that big data often is associated with transactional data sets (for example, American Express and Amazon), social media (for example, Facebook and Twitter) and, of course, search engines (for example, Google). Most formal definitions of big data involve some variant of the four V’s:

  • Volume: Data set size.
  • Variety: Diverse data types residing in multiple locations.
  • Velocity: Speed of generation and transmission of data.
  • Variability: Nonconstancy of volume, variety and velocity.
This set of V’s is attributable originally to Gartner Inc., a research and advisory company, and documented by the National Institute of Standards and Technology (NIST) in the first volume of a set of seven documents. Big data clearly is the order of the day when the quality practitioner is confronted with a data set that exceeds the laptop’s memory, which may be by orders of magnitude.

In this article, we’ll reveal the big data era to the quality practitioner and describe the strategy being taken by standardization bodies to streamline their entry into the exciting and emerging field of big data analytics. This is all done with an eye on preserving the inherently useful quality principles that underlie the core competencies of these standardization bodies.

Primary classes of big data problems
The 2016 ASQ Global State of Quality reports included a spotlight report titled "A Trend? A Fad? Or Is Big Data the Next Big Thing?" hinting that big data is here to stay. If the conversion from acceptance sampling, control charts or design of experiments seems a world away from the tools associated with big data, rest assured that the statistical bases still apply. 
Of course, the actual data, per the four V’s, are different. Relevant formulations of big data problems, however, enjoy solutions or approaches that are statistical, though the focus is more on retrospective data and causal models in traditional statistics, and more forward-looking data and predictive analytics in big data analytics. Two primary classes of problems occur in big data:
  • Supervised problems occur when there is a dependent variable of interest that relates to a potentially large number of independent variables. For this, regression analysis comes into play, for which the typical quality practitioner likely has some background.
  • Unsupervised problems occur when unstructured data are the order of the day (for example, doctor’s notes, medical diagnostics, police reports or internet transactions).
Unsupervised problems seek to find the associations among the variables. In these instances, cluster and association analysis can be used. The quality practitioner can easily pick up such techniques...
Good piece. It's firewalled, unfortunately, but, ASQ does provide "free registration" for non-member "open access" to certain content, including this article.

The article, as we might expect, is focused on the tech and operational aspects of using "big data" for analytics -- i.e., the requisite interdisciplinary skill sets needed, data heterogeneity problems going to 'interoperability," and the chronic related problems associated with "data quality." These considerations differ materially from those of "my day-" -- I joined ASQ in the 80's when it as still "ASQC," the "American Society for Quality Control." Consequently, I am an old school Deming - Shewhart guy. I successfully sat for the rigorous ASQ "Certified Quality Engineer" exam (CQE) in 1992. It was comprised at the time of about 2/3rds applied industrial statistics -- sampling theory, probability calculations, design of experiments, all aimed principally at assessing and improving things like "fraction defectives" amid production, and maintaining "SPC," Statistical Process Control, etc.

That kind of work pretty much assumed relative homogeneity of data under tight in-house control.
Such was even the case during my time in bank credit risk modeling and management. See, e.g., my 2003 whitepaper "FNBM Credit Underwriting Model Development" (large pdf). Among our data resources, we maintained a fairly large Oracle data warehouse comprising several million accounts from which I could pull customer-related data into SAS for analytics.
While such analytic methods do in fact continue to be deployed, the issues pale in comparison the the challenges we face in a far-flung "cloud-based" "big data" world comprised of data of wildly varying pedigree. The Quality Progress article provides a good overview of the current terrain and issues requiring our attention.

One publicly available linked resource the article provides:

The NBD-PWG was established together with the industry, academia and government to create a consensus-based extensible Big Data Interoperability Framework (NBDIF) which is a vendor-neutral, technology- and infrastructure-independent ecosystem. It can enable Big Data stakeholders (e.g. data scientists, researchers, etc.) to utilize the best available analytics tools to process and derive knowledge through the use of standard interfaces between swappable architectural components. The NBDIF is being developed in three stages with the goal to achieve the following with respect to the NIST Big Data Reference Architecture (NBD-RA), which was developed in Stage 1:
  1. Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic;
  2. Define general interfaces between the NBD-RA components with the goals to aggregate low-level interactions into high-level general interfaces and produce set of white papers to demonstrate how NBD-RA can be used;
  3. Validate the NBD-RA by building Big Data general applications through the general interfaces.
The "Use Case" pages contains links to work spanning the breadth of big data application domains, e.g., government operations, commercial, defense, healthcare and life sciences, deep learning and social media, the ecosystem for research, astronomy and physics, environmental and polar sciences, and energy.

From the Electronic Medical Record (EMR) Data "use case" document; Shaun Grannis, Indiana University
As health care systems increasingly gather and consume electronic medical record data, large national initiatives aiming to leverage such data are emerging, and include developing a digital learning health care system to support increasingly evidence-based clinical decisions with timely accurate and up-to-date patient-centered clinical information; using electronic observational clinical data to efficiently and rapidly translate scientific discoveries into effective clinical treatments; and electronically sharing integrated health data to improve healthcare process efficiency and outcomes. These key initiatives all rely on high-quality, large-scale, standardized and aggregate health data.  Despite the promise that increasingly prevalent and ubiquitous electronic medical record data hold, enhanced methods for integrating and rationalizing these data are needed for a variety of reasons. Data from clinical systems evolve over time. This is because the concept space in healthcare is constantly evolving: new scientific discoveries lead to new disease entities, new diagnostic modalities, and new disease management approaches. These in turn lead to new clinical concepts, which drives the evolution of health concept ontologies. Using heterogeneous data from the Indiana Network for Patient Care (INPC), the nation's largest and longest-running health information exchange, which includes more than 4 billion discrete coded clinical observations from more than 100 hospitals for more than 12 million patients, we will use information retrieval techniques to identify highly relevant clinical features from electronic observational data. We will deploy information retrieval and natural language processing techniques to extract clinical features. Validated features will be used to parameterize clinical phenotype decision models based on maximum likelihood estimators and Bayesian networks. Using these decision models we will identify a variety of clinical phenotypes such as diabetes, congestive heart failure, and pancreatic cancer…

Patients increasingly receive health care in a variety of clinical settings. The subsequent EMR data is fragmented and heterogeneous. In order to realize the promise of a Learning Health Care system as advocated by the National Academy of Science and the Institute of Medicine, EMR data must be rationalized and integrated. The methods we propose in this use-case support integrating and rationalizing clinical data to support decision-making at multiple levels.
This document is dated August 11, 2013. Let's don't hurry up or anything. "Interoperababble."



Back to the themes with which I began this post. From


ABOUT A WEEK ago, Stanford University researchers posted online a study on the latest dystopian AI: They'd made a machine learning algorithm that essentially works as gaydar. After training the algorithm with tens of thousands of photographs from a dating site, the algorithm could, for example, guess if a white man in a photograph was gay with 81 percent accuracy. The researchers’ motives? They wanted to protect gay people. “[Our] findings expose a threat to the privacy and safety of gay men and women,” wrote Michal Kosinski and Yilun Wang in the paper. They built the bomb so they could alert the public about its dangers.

Alas, their good intentions fell on deaf ears. In a joint statement, LGBT advocacy groups Human Rights Campaign and GLAAD condemned the work, writing that the researchers had built a tool based on “junk science” that governments could use to identify and persecute gay people. AI expert Kate Crawford of Microsoft Research called it “AI phrenology” on Twitter. The American Psychological Association, whose journal was readying their work for publication, now says the study is under “ethical review.” Kosinski has received e-mail death threats.

But the controversy illuminates a problem in AI bigger than any single algorithm. More social scientists are using AI intending to solve society’s ills, but they don’t have clear ethical guidelines to prevent them from accidentally harming people, says ethicist Jake Metcalf of Data and Society. “There aren’t consistent standards or transparent review practices,” he says. The guidelines governing social experiments are outdated and often irrelevant—meaning researchers have to make ad hoc rules as they go…
Yeah, it's about "AI" rather than "big data" per se. But, to me the direct linkage is pretty obvious.

Again, apropos, I refer you to my prior post "The old internet of data, the new internet of things and "Big Data," and the evolving internet of YOU."

Tangentially, see also my 'Watson and cancer" post.


By no means exhaustive. Also recommend you read a bunch of Kevin Kelly among others.


From Scientific American: 
Searching for the Next Facebook or Google: Bloomberg Helps Launch Tech Incubator
The former mayor speaks with Scientific American about the new Cornell Tech campus in New York City: “Culture attracts capital a lot quicker than capital will attract culture.”

More to come...

No comments:

Post a Comment