We collected the initial set of app ids using simple crawling scripts and manual selection. We also used Androzoo for collecting app ids for the crawler.
Androzoo has mined over 7 million android applications until now.
The crawler basically uses a client server architecure along with work queues. There are individual scripts and queus responsible for mining metadata, apks and reviews of a particular android application. Once we push the appids to the work queue,
Note: Steps 1 and 2 (metadata and apk scripts) uses the npm package google-play-cli. Step 3 (reviews script) uses the npm package google-play-scraper. More new apk ids are crawled using google-play-scraper package and enqueued for metadata.
Scripts to crawl this data and others package installation instructions are available in this repo.