How we mine Google Play Data

Collection of Initial set of android application ids (Example app id: "com.facebook.katana")

We collected the initial set of app ids using simple crawling scripts and manual selection. We also used Androzoo for collecting app ids for the crawler.

Androzoo has mined over 7 million android applications until now.

Crawler architecture

The crawler basically uses a client server architecure along with work queues. There are individual scripts and queus responsible for mining metadata, apks and reviews of a particular android application. Once we push the appids to the work queue,

  1. Meta data script uses Google play API services and dumps the reponse to a mongoDB instance. It enqueues app id and its version code (obtained from metadata) to another queue for downloading apks. It also enqueues the app id, (along with number of review pages that need to be crawled) to another queue for crawling reviews.
  2. Once app id and its corresponding version code is available in apk queue, apk script starts downloading the apk. Note: version code determines the version of the apk.
  3. Review script that is responsible for reviews queue starts crawling app review once the app ids are available in this queue.

Note: Steps 1 and 2 (metadata and apk scripts) uses the npm package google-play-cli. Step 3 (reviews script) uses the npm package google-play-scraper. More new apk ids are crawled using google-play-scraper package and enqueued for metadata.

Scripts to crawl this data and others package installation instructions are available in this repo.