Data Science Ethics



Data Science in a Box

datasciencebox.org

Cornell College
DSC 223 - Fall 2022

October 10th, 2022

Misrepresentation

Causality

Time

LA Times

Original study

Moore, Steven C., et al. “Association of leisure-time physical activity with risk of 26 types of cancer in 1.44 million adults.” JAMA internal medicine 176.6 (2016): 816-825.

  • Volunteers were asked about their physical activity level over the preceding year.
  • Half exercised less than about 150 minutes per week, half exercised more.
  • Compared to the bottom 10% of exercisers, the top 10% had lower rates of esophageal, liver, lung, endometrial, colon, and breast cancer.
  • Researchers found no association between exercising and 13 other cancers (e.g. pancreatic, ovarian, and brain).

Axes and scale

What is the difference between these two pictures? Which presents a better way to represent these data?

What is wrong with this picture? How would you correct it?

Cost of Gas

What is wrong with this picture? How would you correct it?

Maps and areas

Do you recognize this map? What does it show?

Right? Wrong?

Visualizing uncertainty

Do you want Catalonia to become an independent state?

On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.

Do you want Catalonia to become an independent state?

Further reading

How Charts Lie
Getting Smarter about Visual Information

by Alberto Cairo

Calling Bullshit
The Art of Skepticism in a
Data-Driven World

by Carl Bergstrom and Jevin West

Data privacy

Case study: AOL search data leak

Case study: OK Cupid

OK Cupid data breach

  • In 2016, researchers published data of 70,000 OkCupid users—including usernames, political leanings, drug usage, and intimate sexual details

  • Researchers didn’t release the real names and pictures of OKCupid users, but their identities could easily be uncovered from the details provided, e.g. usernames

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.

Researchers Emil Kirkegaard and Julius Daugbjerg Bjerrekær

In analysis of data that individuals willingly shared publicly on a given platform (e.g. social media), how do you make sure you don’t violate reasonable expectations of privacy?

Case study: Facebook & Cambridge Analytica

Algorithmic bias

First a bit of fun…

The Hathaway Effect

  • Oct. 3, 2008: Rachel Getting Married opens, BRK.A up 0.44%
  • Jan. 5, 2009: Bride Wars opens, BRK.A up 2.61%
  • Feb. 8, 2010: Valentine’s Day opens, BRK.A up 1.01%
  • March 5, 2010: Alice in Wonderland opens, BRK.A up 0.74%
  • Nov. 24, 2010: Love and Other Drugs opens, BRK.A up 1.62%
  • Nov. 29, 2010: Anne announced as co-host of the Oscars, BRK.A up 0.25%

Algorithmic bias and gender

Google Translate

Amazon’s experimental hiring algorithm

  • Used AI to give job candidates scores ranging from one to five stars – much like shoppers rate products on Amazon
  • Amazon’s system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way; it taught itself that male candidates were preferable

Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said.

Algorithmic bias and race

Facial recognition

Criminal Sentencing

“There’s software used across the country to predict future criminals.
And it’s biased against blacks.”

A tale of two convicts

“Although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice,” he said, adding, “they may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.”

Then U.S. Attorney General Eric Holder (2014)

ProPublica analysis

Data:

Risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014 + whether they were charged with new crimes over the next two years

ProPublica analysis

Results:

  • 20% of those predicted to commit violent crimes actually did
  • Algorithm had higher accuracy (61%) when full range of crimes taken into account (e.g. misdemeanors) ::: {.cell} ::: {.cell-output-display} ::: :::
  • Algorithm was more likely to falsely flag black defendants as future criminals, at almost twice the rate as white defendants
  • White defendants were mislabeled as low risk more often than black defendants

How to write a racist AI without trying

Further reading

Machine Bias

by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner

Ethics and Data Science

by Mike Loukides, Hilary Mason, DJ Patil
(Free Kindle download)

Weapons of Math Destruction
How Big Data Increases Inequality and Threatens Democracy

by Cathy O’Neil

Algorithms of Oppression
How Search Engines Reinforce Racism

by Safiya Umoja Noble

Parting thoughts

  • At some point during your data science learning journey you will learn tools that can be used unethically
  • You might also be tempted to use your knowledge in a way that is ethically questionable either because of business goals or for the pursuit of further knowledge (or because your boss told you to do so)

How do you train yourself to make the right decisions (or reduce the likelihood of accidentally making the wrong decisions) at those points?

Do good with data

Further watching