CIC Droid Sandbox -Our Datasets

Dataset 3:
Extensible Android Malware Detection and Family Classification Using Network-Flows and API-Calls (CICInvesAndMAl2019 - Second part of the CICAndMal2017):

We provide the second part of the CICAndMal2017 dataset publicly available which includes permissions and intents as static features and API calls as dynamic features. The following table compares the CICAndMal2017 dataset with other previous available datasets based on our proposed 15 criteria.

According to this Table, previous datasets have several shortcomings such as lacks in the App installation on real-phones, conducting interaction scenarios with installed Apps, having diverse malware categories & families, consisting of a balanced number between malware & benignware, etc. On the other hand, CICAndMal2017 dataset can address all these shortcomings. Based on this Table in the following Figure, we visualize the comparison of the CICAndMal2017 dataset with previous ones. As we can see in this figure, the CICAndMal2017 dataset lacks in none of the criteria.

Also, we utilized Droidkin to derive a graph of our samples’ similarity relations according to their binary and metadata characteristics. The below Figure presents this graph with the important hub connections between Benign samples and some malware samples.

Our proposed analysis framework consists of two layers: Static Binary Classification (SBC) and Dynamic Malware Classification (DMC). We assume if the static-based first layer detects a suspicious malware, there are more possibility of malicious intentions in that sample. We believe if a sample is detected suspicious with static-based layer, the analyzer should consider it as malware for the next layer. As a result, we can reduce the risk of trusting unknown samples.

In the following Table, we compare the classification results regarding the first part of CICAndMal2017 dataset with our currently results in the second part of this dataset in this paper. According to Table III, we observe a noticeable enlargement in precision and recall

Our contributions to this research are as follows:
• Proposing a comparison among previously available Android malware datasets based on 15 essential criteria.
• Providing the second part of the CICAndMal2017 dataset which includes Permission and Intent as static features and API calls as dynamic features.
• Presenting a two-layer Android malware analyzer based on static and dynamic features.
• Improving our previous network-flow analyses with appending extracted n-gram sequential relations of API calls.


Dataset 2:
Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification (CICAndMal2017-part1)

We generate a new Android malware dataset, named CICAndMal2017, which is fully labeled and includes network traffic, logs, API/SYS calls, phone statistics, and memory dumps of 42 malware families. The following fiqure shows the network architecture for the experiment set up and configuration. In this experiment, we used three laptops which are connected to three real smart-phones.

In order to create a comprehensive scenario for data capturing, we first created a taxonomy to understand the malware behavior and its characteristics. We classified the taxonomy into 20 types of attacks (A1-A20) and 4 types of C&C communications (C1-C4) as shown in the below table.

Based on the provided taxonomy, we created a specific scenario for each of the malware categories. The below table shows the list of the scenario for each malware category.

Also, we defined three states of data capturing in order to overcome the stealthiness of an advanced malware:
1) Installation: The first state of data capturing which occurs immediately after installing malware (1-3 min).
2) Before restart: The second state of data capturing which occurs 15 min before rebooting phones.
3) After restart: The last state of data capturing which occurs 15 min after rebooting phones.

For feature Extraction and Selection:
We captured network traffic features (.pcap files), and extracted more than 80 features by CICFlowMeter-V3 during all three mentioned states (installation, before restart, and after restart).

The full research paper outlining the details of the dataset and its underlying principles:

Arash Habibi Lashkari, Andi Fitriah A.Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.


Dataset 1:
Android Adware and General Malware Dataset (AAGM):

A labeled dataset of mobile malware traffic from real smartphones, built with nine new flow-based network traffic features. This dataset includes 1900 benign and malicious apps in 12 different families. The 400 malware apps are from two categories: adware (250), and general malware (150).
The operations of creating the AAGM dataset are divided into three phases, which you can see in below figure.

To further analyze our dataset, we employed Droidkin, a lightweight detector of the similarity of Android apps. Droidkin is used to investigate the relationships between each apps category: adware, general malware, and benign. The following figure visualizes the result of the detection analysis. The red circles represent categories, and the small black circles represent the apps that belong to those categories. Overall, there is a weak-relationship between these three categories.
For more information about this dataset, please find the related published paper here:

Arash Habibi Lashkari, Andi Fitriah A.Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceedings of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017.

To requesting the datasets, please visit following links:
Dataset Link Description
AndMal2017 - part2 CICInvesAndMal2019 Android malware while running on real phones, alongwith network traffic, memory dump, logs, permission, API calls, and phone statistics data
AndMal2017 - part1 CICAndMal2017 Android malware while running on real phones, includes permissions and intents as static features, and API calls as dynamic features
AAGM CICAAGMAndroid Adware and General Malware Dataset
In the case that you've used our datasets in your research, please cite their publsihed paper as well.