Security And Threat Modeling (All Too Briefly)

Today's notes are more of an outline. They are accompanied by the slides.

Example 1: The Parler Data Leak

Some time ago, hackers were able to obtain nearly all data from the Parler website. It turns out that this was technically pretty easy to do, because of a variety of engineering decisions. These included:

insecure direct object references: post IDs were consecutive, and so a for loop could reliably and efficiently extract every post on the site.
no rate limiting: the above-mentioned for loop could run quickly.
failure to scrub geo-location data from uploaded images: the physical location of users was compromised.

Combining multiple security mistakes tends to add up to more than the sum of the individual impacts.

Example 2: Phishing Email

Here's a screenshot of a real email someone received a couple of years ago.

There was a lot more to follow, carefully crafted to trigger fear and loss aversion in the recipient. The sender claimed that they would toss the (salacious and embarrassing) information they had in exchange for a bit of money... in Bitcoin, of course.

This is an example of a social engineering attack: no technical aspect of a system is compromised, but still its human or institutional facets can be hacked. Social engineering is pretty broad, but other examples include a hacker calling Google to reset your password, or a trojan-horse USB drive being left in a bank parking lot.

Example 3: Taxis

Today's discussion: Of Taxis and Rainbows. There's also a corresponding Hacker News thread.

Let's talk about the core anonymization issue that the article reported on.

What's in the data, and what was anonymized?
What did they do to anonymize the data before release?
What went wrong? (how was the anonymization circumvented?)
Is this attack a generally useful technique? For what?
- See my proof-of-concept code.

Takeaways:

True anonymization is hard, and it's difficult if not impossible to undo mistakes.
Often the legitimate interests of historians, social scientists, journalists and others may be competing with privacy interests. There is not always a single right answer, and there certainly isn't an absolute answer that is right for all such cases.
When presented with such a situation in tech, think carefully. Seek advice if it's available on both the technical and non-technical angles. Here, the FOIA request was answered in 2 days.

Leaving the anonymization problems aside, what else could an adversary do with this data?

What about the passengers? There's no identifying information, right?
Is there any risk of correlation with other data?

Let's analyze this Hacker News exchange.

There were 2 arguments I thought were interesting near the top:

"NYC is too dense for reasonable correlation"
"nobody who lives in a non-dense part of NYC can afford to take a taxi anyway"

New York City is a lot more than skyscrapers. It includes, say, Staten Island:

Here's a random Google street view:

Take Malte's 2390, Julia's 1952B, and other such courses if you think this sort of thing is interesting. Most importantly, if you think of "social implications" as a separate thing from engineering, stop. It's not always inseparable, but it frequently is. As with much else, there's nuance.

Are there possible defenses against this?

Assume that you want to give weight to the historian's goals—you can't just say "no". We'll also ignore the specific obligations of the Freedom of Information Act to explore the design space.

Think, then click.

The historian needs data at the granularity of individual trips.

Here are some potential mitigations. Of course, if you find yourself in this situation for real, consult with experts rather than these notes. Crucially, none of these ideas is "perfect". Different situations warrant different approaches.

Reduce the spatial detail: snap locations to a grid or add random noise. This might affect the ability to tell (e.g.) which corners people tend to pick up taxis, however. Perhaps we could adjust the level of noise by the density of the area.
Reduce the temporal detail: report rides only at (e.g.) the per-hour granularity. This causes other problems for the historian, but also helps to preserve privacy.
Provide opt-out options and provide both individual trips (for those who don't opt-out) and aggregate data (for those who do). This is admittedly difficult to implement for this specific case.
Don't provide fields where certain very rare values (vs. the overall corpus) could aid deanonymization. Classic examples often involve self-reported ethnicity, gender identity, etc. Suppose I asked in an anonymous feedback form for your country of origin along with the usual survey. Then unless country-of-origin is pretty uniform, I might be able to deanonymize the data.

Don't Forget the People Involved

Here's a classical experiment in psychology.

Variation 1

Suppose you're shown four cards. The deck these cards have been taken from have colors on one side (either yellow or red), and numbers (between 1 and 100) on the other. At first, you only see one side of each. Perhaps you see:

1
2
yellow
red

You're asked to determine if these four cards obey a simple rule: if the card's number is even, then the card's color is red. The question is, which cards do you need to turn over to check the rule?

Variation 2

Suppose you're shown four cards. The deck these cards have been taken from have drink orders on one side (either juice or beer), and ages (between 1 and 100) on the other. At first, you only see one side of each. Perhaps you see:

beer
water
17
23

You're asked to determine if these four cards obey a simple rule: if the drink order is alcoholic, then the age is no less than 21. The question is, which cards do you need to turn over to check the rule?

Psychological Context

These are called [Wason Selection Task]s(https://en.wikipedia.org/wiki/Wason_selection_task) and have been studied extensively in psychology.

Surprisingly, humans tend to do poorly on the first task, but far better on the second. Why? Because context matters. If you can catch someone outside of their security mindset with an apparently rote task, you have a higher chance of exploiting them.

Good security involves far more than just the software. But whether or not we're talking about software or the human brain, bias can be attacked, and the attacker can abuse leaky abstractions.

CSCI 0320 Notes

Security And Threat Modeling (All Too Briefly)

Example 1: The Parler Data Leak

Example 2: Phishing Email

Example 3: Taxis

Let's talk about the core anonymization issue that the article reported on.

Takeaways:

Leaving the anonymization problems aside, what else could an adversary do with this data?

Let's analyze this Hacker News exchange.

Are there possible defenses against this?

Don't Forget the People Involved

Variation 1

Variation 2

Psychological Context

A Cognitive Tool: Threat Modeling

Authentication (vs. Spoofing)

Integrity (vs. Tampering)

Non-repuditation (vs. Repudiation)

Confidentiality (vs. Information Disclosure)

Availability (vs. Denial of Service)

Authorization (vs. Elevation of Privilege)

Elevation of Privilege