I'm not unbiased when it cames to Nate, who is one of my older friends, because he's dragooned me into being an advisor for SourceDNA. I've promised to donate all proceeds from his venture to charity, unless it returns enough to buy me a private jet, in which case I'm going to buy a private jet and then donate the rest to charity. I almost quit Matasano to join him; the day after I flew out to work out a role, we got the acquisition offer, and I had to stay.
Nate is way underselling himself. He's essentially not only acquired most of the contents of most of the app stores, and not only decompiled them, but has then built up a comparative analytics framework that can answer questions based on code similarity (as a first order of available facts) and behavior (as a sort of second-order thing).
I'm really curious to see what ideas other people would have for this kind of data set. If you could answer virtually any question about the behavior of any/every app in the app store, what would you do with that capability?
Also: people should ask him questions about how this stuff works. It's really neat.
To elaborate: he's got an up-to-the-minute view on library and technology choices made across the entire App Store. I figure this would basically be killer for e.g. sales teams at API companies, analysts following them, etc.
(Do a for loop over your customers and profile as many signals as you can. For example, I don't know, "adopted analytics solution X." Search store for firms who match the signals which correlate with adopting you but have not yet adopted your app. Call them to pitch.)
> library and technology choices made across the entire App Store. I figure this would basically be killer for e.g. sales teams at API companies, analysts following them
You nailed it, Patrick, and I wish I talked to you sooner. We were collecting this app data at MixRank for a while before it occurred to us to track every SDK and API integration. It turns out that knowing all the integrations an app has (ads, analytics, social, cloud services, tracking, libraries, etc) is really useful for predicting and understanding your customers!
> Do a for loop over your customers and profile as many signals as you can.
The most effective signal is the most obvious one: Did the app just integrate my competitor's SDK? This usually means the app is in a trial period and still deciding— you couldn't ask for better timing. Next best is: Did the app just uninstall my competitor's SDK? Then they're likely unhappy and searching for alternatives. (Source: we work with a bunch of sales teams that use us for exactly this)
This was the most surprising thing to me. Apart from the InfoWorld "Hot API of the week" columnist (who's writing for the CEO who mostly only cares about looking informed), I had a hard time imagining anyone would pay to know who uses what library. Cool that I'm wrong.
I would be interested in pointing a (distributed, imprecise) symbolic executor at each app to gain a sense of the "state depth" of that app, and then correlating that with bugs discovered and rate of code churn in the app, and comparing the "state depth" across apps in a given 'vertical' and all apps globally.
Interesting, thanks. We've discussed providing some kind of code quality score. Also, we can definitely do something interesting by comparing how other apps have performed that share code features with yours. The ideal would be a predictor that makes recommended changes based on the insights we've learned from other code we've seen.
As a novice mobile engineer I might go to stack overflow to see how to implement things like exception handling, caching, etc.
I'd like to know the best way to do these things, given your corpus of data. Decompile snippets of code from the most popular mobile apps and show me how they do all these things that everyone has to do.
We've actually resisted going after malware because every binary analysis technology seems to be pulled into its orbit eventually. It's like the Black Hole of Reversing. :)
What I do wonder is if there are more interesting questions to answer about the intent of code besides "pass/fail"?
Answering questions about information flow might be interesting in this regard. You could quantify what sources of information an application retrieves and how far through the application that information traverses.
Downsides are that it's very difficult to get a binary verdict out of something like that, even as a consumer of the information. You see a report that says "it takes video data and streams it out to the Internet!" and if the app purports to be a flashlight app that's bad, but if it's Skype then that's expected.
Maybe the additional insight into "where does my data go in this app" would be useful? Maybe people could start asking "why do you need to take my contacts and pass them to the network, why don't you just work with the data on my device?"
Hi, I'm happy to discuss how we're finding hidden flaws in millions of apps that even developers didn't know about. We've built a really cool binary code search engine that has indexed the structure and behavior of apps. Our engine allows us to quickly find apps that exhibit particular problems, such as calling a broken API or using a version of a library that has a vulnerability.
I need to write more about how it works. We translate the app code into an intermediate language (like LLVM bitcode) and index features derived from both the structure (callgraph/control flow graph) and syntax (opcodes) of each function. This allows you to search for snippets of code that match particular patterns or discover the relationship between modules by assessing the similarity of each. Since we use an IL, we can match code cross-platform.
I'd love to talk about it here if you have questions.
While that's a great joke, we actually have a unique way of minimizing false positives compared to the average static analysis tool.
We apply code similarity techniques to reuse analysis across apps, so we don't naively treat a binary as a random jumble of code. So if there's a flaw in a particular parameter passed to a single function, we can find all apps that have the relevant code and then zoom in on just calls to this function. The smaller the number of candidates a matching algorithm has to consider, the less chance of a false positive.
Just how specifically does it identify its findings? Will it just say "this looks like a common buffer overflow/format string bug/etc." or can it say things like "this embeds openssl 0.9.8q, which is horribly insecure"?
We give very specific instructions how to fix, telling devs exactly which files or components are causing the issue and why. You might also like the native library dependency graph, which helps you understand the linkage of various libs even in apps that don't have any bugs. We're not going to dump a list of thousands of random potential flaws because we think devs are busy enough as it is.
The chord graphs show the linkage relationships amongst all the native libraries in Android apps. Each library has an edge to the other libraries it depends on. To keep it clear, we don't show platform libraries that are too common, like libc.
We think it might help devs understand how dependencies are dragged in, especially from code that's not theirs. I'd appreciate feedback on if it actually is helpful or what other things we could make more clear.
It looks like a wormhole and a couple of the edges are hard to read, but it does display in an interesting way which libraries are sucking in all the crap.
Do you plan to make the tool available to general application development, e.g. outside of apps available on Google Play/AppStore?
If I understand correctly, your architecture translates the binary code into an intermediate language (effectively a reverse compiler) so that patterns across different build targets can be recognized as the same API use? Can you translate from platforms like Java? Can your platform recognize a call to a poor native API if done through a platform like Java? (e.g. a Java program calls a function that the JVM translates into a native call, but the parameters are setting up the caller for a vulnerability?) Does your platform handle transitive data identification? My point is that the call itself may be generalized, and specific sources of that call may be the source of the problem, but that may be abstracted by several layers of functions.
How does your product differentiate itself from the other static code analyzers available in the market place?
We're focused on applying our engine to solve problems in the app stores, so no plans to make it a standalone software product.
Yes, we track JNI links from Java -> native. Yes, we do data flow analysis as well, but it's targeted via control flow analysis.
In other words, first we identify a code path of interest and then target the data values/types that appear on that path. This way we're not wasting time analyzing data that is unimportant.
We're not a static code analyzer, we're a service for helping developers improve mobile apps. We'll never tell you "Here are 8488 potential integer overflows to investigate." Instead, we've pre-processed the issues to give targeted recommendations that any dev can understand. Quality over volume, in other words.
Yes, in two ways. First, it's often useful to target malware or other systems by the common packer code they use. This doesn't require deobfuscating anything.
More interestingly, the matching algorithms we developed are by design resistant to many common obfuscation schemes. For example, Proguard renames variables and functions, as well as discarding dead code. But other aspects of the code that we index, including control flow structures and data references survive. We've designed our matching engine to apply a combination of all these factors to be resilient to changes in subsets of them.
Compilation is one of the simplest forms of "obfuscation", so started with dealing with different levels of optimization discarding code or changing the opcodes.
If a program is self-decrypting, then we have to apply standard unpacking techniques to get back to a reasonable format before ingesting it. Nothing magic there.
We depend on a huge number of program features so CFG alone isn't enough to throw it off (data references, function layout, linker behavior, and much more are included in our matching).
You're right that, at some point, you've destroyed enough of the features that matching will fail. For that case, you'll always need targeted techniques, such as for virtualizing obfuscators. Since there's a big performance loss when you use them, almost no legitimate mobile apps are willing to pay the battery and speed loss.
It's really common in Android (10-20% of apps), but happens occasionally in iOS too.
The most common Android obfuscator is Proguard, but we also see Codeguard and others. The most surprising thing we found is that many developers build their own custom obfuscator. I think it's because they're trying to keep people from ripping off their code or hide secrets on the client side. Because Android allows a variety of techniques (custom classloaders, native code extensions that can overwrite their own code), there's a proliferation of methods there.
In iOS, it's less common. You'll see a few frameworks like RNCryptor used to hide client-side secrets. Some apps use UAObfuscatedString to try to hide class names and selectors, but our similarity algorithms are based on a number of independent program features, including control-flow graph structure, which are unaffected by this.
There are obfuscating compilers too, but it's very rare for anyone to use them.
> our similarity algorithms are based on a number of independent program features, including control-flow graph structure, which are unaffected by this.
I wonder if/when that will change. Obscuring CFG fingerprints wasn't a problem before. I suppose it depends if people are truly trying to hide their dependencies, or if the 3rd party code just gets caught up as part of trying to prevent the main code from being decompiled.
It's usually the latter, which is that 3rd-party code just gets pulled into what the developer is actually trying to protect. Nobody tries to hide all their dependencies, as far as I can tell. Occasionally, they're trying to hide one particular extension, for example, a banking auth framework.
Nate is way underselling himself. He's essentially not only acquired most of the contents of most of the app stores, and not only decompiled them, but has then built up a comparative analytics framework that can answer questions based on code similarity (as a first order of available facts) and behavior (as a sort of second-order thing).
I'm really curious to see what ideas other people would have for this kind of data set. If you could answer virtually any question about the behavior of any/every app in the app store, what would you do with that capability?
Also: people should ask him questions about how this stuff works. It's really neat.