Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some of these are probably fingerprinting but the Twitch one isn't (I worked on the video system at Twitch for a number of years).

"player-core-variant-....js" is the Javascript video player and it uses WebGL as a way to guess what video format the browser can handle. A lot of the times mobile android devices will say "I can play video X, Y, Z" but when sent a bitstream of Y it fails to decode it. WebGL allows you to get a good indication of the supported hardware features which you can use to guess what the actual underlying hardware decoder is.

This is sadly the state of a lot of "lower end" mobile SoCs. They will pretend to do anything but in reality it just... doesn't.



JavaScript legitimately needs to know about it's runtime environment. The problem with fingerprinting is not the act of examining the environment itself, but rather sending the results of that examination to the server as a form of identification. I would rather confront the second problem more directly with a "per tab Little Snitch" type solution that eliminates communication that is not in the user's interest, rather than eliminate fingerprinting, for precisely the reasons you give, slashink.


Unfortunately web apps can defeat that by bundling unwanted telemetry type stuff in with other API calls, directly or by batching. So, it would need to be a complex tool to deal with that or suffer limited applicability. Perhaps if it stayed under the radar it could be effective in many cases without instigating countermeasures by site owners.


Let's not make perfect the enemy of good. Counter-measures like user-interactive, request specific firewalls (like Little Snitch) can always be defeated by a motivated malfactor willing to commmit resources. That does not mean it isn't worth doing.

Consider that virtually all physical locks are trivial to pick by someone who knows how (see youtube.com/lockpickinglawyer), and yet we still use locks. Pickable locks improve security because they increase the cost to the attacker enough that it deters most attacks.


The point is there isn't any extra cost nor difficulty in circumventing the checks you describe. You just run a library.


I write webapps for a living. If a browser plugin wanted to selectively allow XHR/fetch calls based on payload, there is very little I could do about it.

The implementation might be to have your plugin content script wrap the DOM XHR/fetch in a proxy. The proxy runs a predicate on the payload, and if true, lets it go through. The predicate would be something like "No PII", which would also imply that the traffic be unencrypted.

An app could remove the proxy. But it seems to me that most people wouldn't bother. It's also possible that there are other mechanisms, for example a special Plugin IO API that cannot be changed by content scripts.


> I write webapps for a living.

I’d imagine most people in this thread do or have. Myself included. It’s a pretty massive industry :)

What you’re missing is that whatever you do to remove fingerprints does itself add a unique metric to fingerprint. This is also compounded by how easy, cheap and legal it is to add fingerprinting tech to ones site. Literally the only way to break fingerprinting is if the majority of the web browsing population ran systems that randomised fake responses. But as it stands at the moment, it’s possible to:

1. Identify when a plug-in it overloading a builtin function

2. Identify which users are consistently doing so because so few are and there’s methods of fingerprinting that exist outside of your JS VM.

I don’t have the link to hand but there’s a website that you can visit and it tells you how identifiable you are. I used to think it was possible to hide until I visited that site and then I realised just how many different metrics they collect and how a great many of them are literally impossible to block or rewrite without breaking the website entirely.


You may have been thinking of this one: https://coveryourtracks.eff.org/


It was this one: https://www.amiunique.org/fp

It goes into more detail than the EFF link where it breaks down your uniqueness per each metric (and how much entropy each metric adds) as well as giving you an overall uniqueness.

It's a fantastic but also scary resource :)


The cover your tracks implementation also breaks down uniqueness per category, per metric.

They are nice resources, but don't get too scared!

Frankly, both are exaggerating a little - e.g. including stuff like browser version numbers which only appear as unique as they do because the time-span they cover is long enough to overlap update cycles (AmIUnique even seems to have it cover the entire history by default??? That's just noise), yet not stable for more than a short period of time. AmIUnique includes the exact referer, which is likely not nearly as useful as that would make it seem.

Then there's stuff like "Upgrade Insecure Requests" and "Do not track", which is likely extremely highly correlated with browser version choice.

Both sites can't really tell you how reliable the identification is, only how unique you are at this moment. And that matters a lot, because if identification is unreliable (i.e. the same person in some metric has multiple distinct fingerprints) the end result is that for reliable overall identification a fingerprinter may need many times as many bits of entropy as a naive estimate might assume, especially if visits are occasionally sparse and thus changes to fingerprints may frequently come all at once.

Clearly over the very short term you are likely uniquely identifiable as a visitor. However, it's less clear how stable that is.


uMatrix. It does what you describe and I always use it.

But the solution isn't good per se.

It provides a high level of granularity and it could theoretically provide even more granularity. But its already an advanced tool that an average user will never use.


umatrix is end of life, not?


Not all JavaScript does, and the kind that does isn't neccessarily something I asked for. I would be plenty happy if GitHub and Stripe couldn't show their 3D globe animations until I request them, for the sake of privacy.


After realizing the unbelievable CPU hog that the Github globe is, I simply added an element hiding rule for that crap. Not sure if element hiding rules help preventing fingerprinting from such sources.


how would that work, once js has information it can send it anywhere


incidentally part of my anti fingerprinting script looks like this

    let UNMASKED_RENDERER_WEBGL = ["ANGLE (AMD Radeon HD 7900 Series Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (Intel(R) HD Graphics 4000 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (Intel(R) HD Graphics 4600 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (Intel(R) HD Graphics 520 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (Intel(R) HD Graphics 530 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (Intel(R) HD Graphics Family Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (NVIDIA GeForce GTX 960 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (NVIDIA GeForce GTX 1070 Direct3D11 vs_5_0 ps_5_0)",
                               "ANGLE (NVIDIA GeForce GTX 760 Direct3D11 vs_5_0 ps_5_0)"]

those were the most popular desktop GPUs according to Steam a year or two ago.


Won't a custom script make you special and thus very easy to precisely fingerprint?

I can imagine that having several typical configs and switching between them at random would help blend in.


>I can imagine that having several typical configs and switching between them at random would help blend in.

You have to be careful with that too. An anti-anti-fingerprinting implementation can record the values and compare them across visits to see whether they stay the same. If they change every few months that's reasonable (eg. changing hardware), but if they change every day or every week there's most certainly spoofing involved.


It should change every few requests. The point is not to conceal spoofing, but to foil attempts to fingerprint.

Maybe an explicit sign of spoofing is even better, it sends a message in a gentle way.

Unspoofing on the server side, if at all possible, would likely be too expensive for whatever gain it might bring.


Unless a major anti fingerprinting solution uses the same list of GPUs as you doing this puts you in a tiny bucket and provides massive entropy to trackers, possibly even enough to exactly identify you given many webGL calls.


You could seed your random number generator with a hash of the hostname, guaranteeing consistency across all the random values you return to the one host.


Anti-anti-fingerprinting? :/

I feel like going there and telling them to stop following me around.


You can't do that because of the anti-anti-anti-fingerprinting, I know this because of reasons


Does it matter if they know you're spoofing, as long as it prevents them from linking two separate sessions together?


you'll run into this problem: https://xkcd.com/1105/


Would you be willing to share this script?


Don't bother. It's hard to do it correctly. If you look through the snippets (or the MDN docs[1]), the value is retrieved using the getParameter() function. You might be tempted to override the function by doing something like

    gl.getParameter = () => "test"
but that's easily detectable. If you run

    gl.getParameter.toString()
You get back

    "() => "test""
whereas the original function you get back

    "function getParameter() { [native code] }"
In general, don't try to fix fingerprinting via content scripts[2]. It's very much detectable. Your best bet is a browser that handles it natively.

[1] https://developer.mozilla.org/en-US/docs/Web/API/WEBGL_debug...

[2] https://palant.info/2020/12/10/how-anti-fingerprinting-exten...


You can easily hide it by hijacking Function.prototype.toString to see if `this == fake gl.getParemeter or this == fake toString`. Then the js code needs to find a real Function.prototype.toString by creating an iframe, but then you can detect that. Then I'm out of ideas on how to rescue the original toString


So the issue is that the fingerprinting code can detect the anti-fingerprinting code? Doesn't that mean the best solution is for everyone to override the same functions with the same dummy information?


This can be fixed by overriding valueOf() and toString() on the prototype. Just return another native function, like JSON.stringify ;)


Sadly there are still things you cant programmatically override/proxy, like storagemanager

    await navigator.storage.estimate()


gl.getParameter.toString() = () => 'function getParameter() { [native code] }'


    -> gl.getParameter.toString.toString()
    <- "() => 'function getParameter() { [native code] }'"
Not to mention the iframe trick mentioned in palant's article.


is that Firefox? in Chrome I get

    gl.getParameter.toString() = () => 'function getParameter() { [native code] }'
    gl.getParameter.toString()
    "function getParameter() { [native code] }"
    gl.getParameter.toString().toString()
    "function getParameter() { [native code] }"
    gl.getParameter.toString().toString().toString()
    "function getParameter() { [native code] }"
iframes, worker, sharedworker, serviceWorker are all covered. Good luck timing the difference.


You're running

  gl.getParameter.toString().toString()
what the comment you're replying to is trying to tell you to run is:

  gl.getParameter.toString.toString()
Call toString on the toString fuction, not on its result.



nice try NSA


Do you present a set of 9 GPUs or pick one at random?


UNMASKED_RENDERER_WEBGL[Math.floor(Math.random()*UNMASKED_RENDERER_WEBGL.length)]


Agreed: fingerprinting is using ways one browser or device consistently differs from others to derive a stable identifier.

Several others on this list are also not used for fingerprinting, and are instead detecting robots/spam/abuse. Unfortunately, there isn't any technical way for the public to verify that, because client-side all it looks like is collecting a bunch of high-entropy signals.

All the major browsers have said they intend to remove identifying bits to where fingerprinting is not possible, which will also make these other uses stop working.


That was always a short-term hack anyway. The limiting case is that spammers simply pay humans somewhere to continue whatever "abuse" was formerly automated, as currently happens with captchas.


Almost all spam is not profitable enough to pay humans to do every step manually, so if you make it that expensive, it's the same thing as winning.


This is interesting. I can see how it could have legitimate uses to get around bad inplementations with stating what capabilities are possible.

However that being said it's like fingerprinting to sort out what their system really has. Still an abuse of the system.


That's a fair take. In a sense it's is a "fingerprinting" method although I personally think fingerprinting embodies using this data to track devices between contexts.

If you're interested why this data was exposed in the first place, the MDN docs has good info https://developer.mozilla.org/en-US/docs/Web/API/WEBGL_debug... . "Generally, the graphics driver information should only be used in edge cases to optimize your WebGL content or to debug GPU problems."

Sadly that's the reality of some of these tools. The intent was good and in many cases they are a necessity to create a web experience that works on every device. On the flip side this allows people to use this data to fingerprint.


As per my sibling (cousin?) comment, pretty much any legitimate capability-determining data can be used for fingerprinting.

This can only be solved with legislation, IMO. There is no way for an industry to self-regulate something like this, the candy bowl too big and the candy too sweet.


Yeah, I think the world would be better off without ads altogether. Just allow people to search for what they need or want by themselves, that's enough.


...and pay out of pocket for the use of a search engine? Why, well, I'd do that; e.g. neeva.com is accepting sign-ups.

Also, ads without personal targeting, much like dead-trees newspaper / magazine ads, could still work and prop up certain web publishers.


>...and pay out of pocket for the use of a search engine?

That's the only way service providers will see a cent from me going forward. Ads present not only a privacy risk but they're increasingly becoming a security risk too. I will not allow them on any of the devices I own or that connect to my home WiFi.


I think that static ads served from the first-party CDN are security-wise no worse that the content itself.

Blocking scripts and requests by third-party ad networks makes complete sense from security perspective, though.

Affiliate links going directly to relevant item pages in a store are fine by me, too. They have to be relevant for anyone to click on them, they don't play video or make sound, etc. They do give some tracking opportunity, but without third-party cookies and third-party requests, it's hard to achieve anything resembling the privacy-invading precise tracking which current ad networks routinely do.

In any case, I much more prefer the absence of AWS and an honest donation button.


To make fingerprinting illegal, you mean? Or were you thinking of some other way to solve the problem with regulation?


More like (the spirit of) GDPR, where data collection itself becomes a legal and financial liability to point where it's not worth it to collect and retain it for a typical entity.


So absolutely no data on the device obtained via webGL is used for marketing or other BI workloads? None of it is shared with 3rd parties (especially advertisers)? Its entirely used for the user experience and then discarded when no longer relevant?

FWIW while I'm playing hardball here I really appreciate your answer and expertise.


I don't work at Twitch anymore so I can't answer your question without guessing and I rather not.

There is always a likelihood that data gets used for reasons beyond the original purpose. In the best of world the hardware that runs on consumers devices would do the right thing which would allow the web to be a perfect sandbox. I think we're slowly getting there, in terms of video it's slowly getting to a point where H.264 support is "universally true" rather than a minefield although VP9 and AV1 is a bit of being back to square one.

I think the spirit of my original comment was not to say "I promise that X company isn't doing Y" more to explain why this code existed in the first place. A search engine doesn't need to know what WebGL capabilities a device has as it doesn't deal with rendering whereas a site that has to work with hardware decoders most likely does need to know.


Just looked at it, it only triggers when player encounters an error while decoding video.


> FWIW while I'm playing hardball here I really appreciate your answer and expertise.

You’re actually just outright accusing them of lying.


I didn't interpret the comment as an acquisition to my original comment.

It's good to ask the hard questions and even though I'm not able to answer it in detail I still think that 'tmpz22' brought up a good point in that data can be used for both good and bad at the same time.


outright is literal, this isn't outright its just challenging their answer and its journalistically a good sequence of questions.

You need to be able to tell the good from the bad and IMO you're wearing these trousers the wrong way round.


Considering companies have lied before when it comes to privacy & ad tracking (see Facebook's "promise" of not using 2FA phone numbers for advertising purposes) his concerns are totally reasonable.


Do you see those question marks? Those denote questions. Questions are not accusations.


Have you stopped beating your wife?

Edit: this is the typical example of a loaded question, not an actual accusation against the parent comment https://en.m.wikipedia.org/wiki/Loaded_question


Really? never?

So when did you stop beating your wife?


I understand the point you are trying to make, with the question incorporating an accusation (i.e. that the person beat their wife in the past). That is different from asking pointed questions about potential actions (e.g. "have you ever beat your wife?").


Tell that to Socrates.


I'm curious if something like [1] would work for those SoCs, eg ask if it supports "video/mp4-liar-liar-pants-on-fire-from-twitchtv"

[1]: https://devblogs.microsoft.com/oldnewthing/?p=40663


Good question!

So the problem here is a bit different. It's not that devices will say "I can play Format X" and then not play it. It's that devices say "I can play Format X at Resolution A, B, C". When you give the device resolution A and B it succeeds but at resolution C it fails to decode it.

In H.264 this would be the "Level" https://en.wikipedia.org/wiki/Advanced_Video_Coding#Levels . A device may say that it can decode Level 4.2 but in reality it can only do 4.1. That means it can play back 1080p30 but not 1080p60. The only way to know is to actually try and observe the failure (which often btw is a silent failure fro m the browsers point of view, meaning you need to rely on user reports).


Wouldn't it be just as easy to test videos in formats A, B, and C and see if the play? You could check that video.currentTime advances. If it lies about that you could draw to a canvas and check the values. That seems more robust than checking WebGL.


Also a good question.

The issue here is the architectural difference between the hardware decoder and the GPU. What happens under the hood with MSE ( https://developer.mozilla.org/en-US/docs/Web/API/Media_Sourc...) is that you are responsible for handing off a buffer to the hardware decoder as a bitstream. Underneath, the GPU sets up a texture and sends the bitstream to the hardware decoder that's responsible for painting the decoded video into that texture.

What often ends up happening is that the GPU driver says "yes the hardware decoder can do this", it accepts the bitstream, sets up the texture for you which is bound against your canvas in HTML. Starts playing the video, moves the timeline playhead but the actual buffer is just an empty black texture. From the software's point of view, the pipeline is doing what it's supposed, due to the hardware decoder being a black box from the Javascript perspective it's impossible to know if it "actually" worked. Good decoders will throw errors or refuse to advance the PTS, bad decoders won't.

Knowing this, your second suggestion was to read back the canvas and detect video. That would work but the problem here is "what constitutes working video". We can detect if the video is just a black box but what if the video plays back but at 1 frame per second, or plays back with the wrong colors. It's impossible to know without knowing the exact source content, a luxury that a UGC platform like Twitch does not have.

For this reason just doing heuristics with WebGL is often the "best" path to detecting bad actors when it comes to decoders.


My point with the video to canvas is if you create samples of the various formats in various resolutions then you can check a video with known content (solid red on top, solid green on left, solid blue on right, solid yellow on bottom) and check if that video works. If it does then other videos of the same format/res should render? I've written conformance tests that do this.

At worst it seems like you'd need to do this once per format per user per device but only if that user hasn't already had the test for that video size/format. (save a cookie/indexed-db/local-storage that their device supports that format) so after that only new sizes and formats need to be checked.

Just an idea. No idea if what problems would crop up


That said, surely this "functional" information can also be valuable fingerprinting data, no? What's stopping an enterprising data science team from pulling it into their data lake, using to build ad models for logged-out users, maybe submitting it to 3rd-party machine learning vendors, etc. and generally making it undeletable?


... but that is sorta fingerprinting, right? Atleast, inferring a user's device capabilities via WebGL is hardly using WebGL.

I think it's a beautiful, legitimate use — but I can't fault the author for labeling fingerprinting.


Is fingerprinting not trying to uniquely identify a user?


Since there is no way to 100% fingerprint a device, therefore there is no way to uniquely identify anyone with 100% confidence using pure fingerprinting techniques.

My view is that fingerprinting is a set of tools which can be used for "good or evil" if that makes sense. If you are gathering meta-data to determine the capabilities of the device, then this is part of the wider framework of data points which can, in principle, be used for fingerprinting a user. This data can be imported into a completely different system by a sophisticated adversary, so it needs to be treated as a security vector, imho


>Since there is no way to 100% fingerprint a device, therefore there is no way to uniquely identify anyone with 100% confidence using pure fingerprinting techniques.

Pedantic point, so forgive me, but 100% uniquely identifying a device does not imply 100% uniquely identifying the user of the device. We call them User-Agents for a reason. Anyone could be using it.

It's critical people not fall into the habit of conflating users and user-agents. Two completely different things, and increasingly, law enforcement has gotten more and more gung-ho about surreptitiously forgetting the difference.

Ad networks and device/User-Agent based surveillance only makes it worse.

There are several initiatives to implement UUID's for devices. There is the Android Advertising ID, systemD's machine-id file, Intel burns in a unique identifier into every CPU.

IPv6 (without address randomization) would also work as a poor man's UUID.

It's frighteningly easy, and you'll be surprised how unintentionally one can be implementing something seemingly innocent and end up furthering the purposes of those seeking to surveil.


You could fingerprint the user as well:

- look at the statistical behavior of how they operate the mouse

- estimate their reading speed based on their scrolling

- for mobile devices, use the IMU to fingerprint their walking gait and angle at which they hold the phone (IMU needs no permissions)

- measure how the IMU responds at the exact moment a touch event occurs. this tells you a quite a bit about how they hold their phone

- if they ever accidentally drop their phone, use the IMU to detect that and measure the fall time, which tells you the distance from the ground to the height they held the phone. then assuming the phone is held normal to the eyes, you can use the angle they held the phone to extrapolate the location of the eyes and estimate the user's approximate height


That's a lot of extraneous data to be adding to a stream leaving the phone. (Or dumping to a locally stored db file.), but you're technically correct, though not infallibly so.

The level of noise is incredibly problematic. My leading cause of dropped phone, for instance is forgetting I have it in my shirt pocket, on my lap, off my desk, or from my back pocket if I don't put it in just right. Am I a different person in each of those circumstances? The statistical answer would be no, but only cones from widening the scope of collected data. Suppose I fiddle with it? Dance with it? Have a habit of leaving it in a car? Without a control, you have a different set of relative patterns. At best you know there is a user with X. Yes you can make some statistical assumptions, but at best, when it really counts, it still needs to line up with a hell of a lot of other circumstantial datapoints to hold water.

Furthermore, I guarantee not a single person would dare make any high impact assumption based on that metadata given that once it gets out, it's so adversarially exploitable it isn't even funny. Imagine a phone unlock you could do just by changing your gait. Or worse, a phone that locks the moment you get a cramp or blister. Madness. Getting different ads because you started walking like someone else for a bit. Do I become a different person because I try to read something without my glasses, or dwell on a passage to re-read it several times? Or blaze through a section because I already know where it is going? These are not slam dunk "fingerprints" by a long shot. More like corraborating data than anything else, and in that sense, even more dangerous, because people are not at all naturally inclined to look at these things with a sense of perspective. It can lead a group of non-data-savvy folks to thinking there is a much cleaner tighter case than there necessarily is, and on top of that, mandates that people be okay with the gathering of that data in the first place, which has only been acceptable up until now because there was no social imperative to disclose that collection.

Going off on a tangent here, so I'll close with the following.

There is the argument to be made that that exact kind of practice is why defensive software analysis should be taught as a matter of basic existence nowadays. If I find symbols that line up with libraries or namespaces that access those resources, why should I be running that software in the first place?

I can't overstate how over 90% of software I come across I won't even recommend anymore without digging into it anymore. There's just too much willingness to spread data around and repurpose it for revenue extraction. It does more harm than good. What people don't know can most certainly hurt them, and software is a goldmine for creating profitable information asymmetries.


> My leading cause of dropped phone, for instance is forgetting I have it in my shirt pocket, on my lap, off my desk, or from my back pocket if I don't put it in just right. Am I a different person in each of those circumstances? The statistical answer would be no, but only cones from widening the scope of collected data. Suppose I fiddle with it? Dance with it? Have a habit of leaving it in a car? Without a control,

Oh, but all of these can be added to your statistical model and learned over time! If we figure out that you suddenly walk with a limp, and all the other metrics match, we can recommend painkillers! Or if the other metrics match and you start dancing, we start recommending dance instructors! Hell, we can even figure out how well you dance using the IMU and recommend classes of the appropriate skill level.

For a recommendation system, like ads, the consequences of mis-indentification wouldn't be that high either. You'd still target much better than random, which is the alternative in the absence of fingerprinting.


This is an excellent point! Thank you for pointing that out +1


Fingerprinting works because devices are surprisingly easy to identify just by enumerating their capabilities. If you are collecting this data, you are likely fingerprinting (read: uniquely identifying) machines even if you aren't trying to.

The same is true of humans, by the way. Even something as innocuous as surname, gender, and county of residence could be enough.


+1 just because fingerprinting with WebGL has practical applications and legitimate use cases, this does not mean it's not fingerprinting


It would only be fingerprinting if the "fingerprint" is persisted alongside some other information about you as a user, and subsequently used in attempts to identify other activity as belonging to said user. That is not at all what was implied by the approach described above (which would just be used at the time of initializing every video streaming session).


I stand corrected. You make a good point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: