Introduction

This section describes the data used for completing the final project of the Big Data Specialization by University of California San Diego on Coursera. The project focuses on developing recommendations for increasing revenue in a fictitious online game called “Catch the Pink Flamingo”.

The data for the project consists of 9 files with data on in-app purchases, ad clicks and game-specific information as well as 6 files with chat data. The data were downloaded from the Coursera website. All the data for the project were generated by a development team of data scientists. The data were designed to simulate many aspects of game play and in-game user activity.

The project entails classification analysis by fitting a decision tree and cluster analysis using k-means. It also uses graph analytics to identify the chattiest users / teams, longest conversations and active user groups.

The analysis in this project was performed using Spark, Python, Splunk, KNIME and Neo4j.

In the remainder of this section we will first describe the Catch the Pink Flamingo game, then we will provide an overview of the Catch the Pink Flamingo data model and the data sets.

Data exploration is described in the next section.

Description of Catch the Pink Flamingo Online Game

One of the products of an imaginary company Eglence Inc. is a highly popular mobile game called “Catch the Pink Flamingo”. The objective of the game is to catch as many Pink Flamingos as possible by following the missions provided by real-time prompts in the game and cover the map provided for each level. The levels get more complicated in mission speed and map complexity as the users move from level to level.

It’s a multi-user game where the players have to catch Pink Flamingos that randomly pop up on a gridded world map based on missions that change in real-time. For the player or team to move to the next complexity level, they need to have at least one point in every map grid cell, i.e., cover the whole world map. An example mission would be “Catch the Flamingos on land with stars on their belly” in which the player should only click on flamingos that match the mission criteria, in this case, stars and being on land. If the player tags any other flamingo on the map, he/she or his/her team gets a negative point (-1) on that map location.

After the initial sign up, a player (user) is asked to play the Level 1 individually without joining any team. This is where the user gets trained as a player and starts building a game history. Level 1 is an easy entry to the game composed of only 64 (8x8) grid cells and longer, more obvious, fun missions. Upon completion of Level 1, the player gets asked if she/he wants to join any team or form a team and will continue the rest of the time as a team player even if that means the user is a 1-person team of her/his own. Each user is a member of at most one team.

At the beginning of each level, the game creates a brand new map with more cells than the level before. The complexity of the missions also increases. The missions change more frequently as the levels increase.

The players keep in touch via chat boards assigned to the teams and also via social media, e.g., Twitter.

There are some things to consider while designing an information system for this game:

Overview of the Catch the Pink Flamingo Data Model

The data generation scripts create several log files recording the activities of people playing Catch the Pink Flamingo. This document describes the fields in those log files.

The image below is an Entity Relationship Diagram (ERD) for the Catch the Pink Flamingo game data model.

Data Set Overview

Data on In-App Purchases, Ad Clicks and Game-Specific Information

The table below lists each of the files available for analysis with a short description of what is found in each one.

File Name Description Fields                                                                                   
ad-clicks.csv A line is added to this file when a player clicks on an advertisement in the Flamingo app. timestamp: when the click occurred.

txId: a unique id (within ad-clicks.log) for the click.

userSessionid: the id of the user session for the user who made the click.

teamid: the current team id of the user who made the click.

userid: the user id of the user who made the click.

adId: the id of the ad clicked on.

adCategory: the category/type of ad clicked on.
buy-clicks.csv A line is added to this file when a player makes an in-app purchase in the Flamingo app. timestamp: when the purchase was made.

txId: a unique id (within buy-clicks.log) for the purchase.

userSessionId: the id of the user session for the user who made the purchase.

team: the current team id of the user who made the purchase.

userId: the user id of the user who made the purchase.

buyId: the id of the item purchased.

price: the price of the item purchased.
users.csv This file contains a line for each user playing the game. timestamp: when user first played the game.

userId: the user id assigned to the user.

nick: the nickname chosen by the user.

twitter: the twitter handle of the user.

dob: the date of birth of the user.

country: the two-letter country code where the user lives.
team.csv This file contains a line for each team terminated in the game. teamId: the id of the team

name: the name of the team.

teamCreationTime: the timestamp when the team was created.

teamEndTime: the timestamp when the last member left the team.

strength: a measure of team strength, roughly corresponding to the success of a team.

currentLevel: the current level of the team.
team- assignments.csv A line is added to this file each time a user joins a team. A user can be in at most a single team at a time. timestamp: when the user joined the team.

team: the id of the team.

userId: the id of the user.

assignmentId: a unique id for this assignment.
level-events.csv A line is added to this file each time a team starts or finishes a level in the game timestamp: when the event occurred.

eventId: a unique id for the event.

teamId: the id of the team.

teamLevel: the level started or completed.

eventType: the type of event, either start or end.
user- session.csv Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started. timestamp: a timestamp denoting when the event occurred.

userSessionId: a unique id for the session.

userId: the current user’s ID.

teamId: the current user’s team.

assignmentId: the team assignment id for the user to the team.

sessionType: whether the event is the start or end of a session.

teamLevel: the level of the team during this session.

platformType: the type of platform of the user during this session.
game-clicks.csv A line is added to this file each time a user performs a click in the game. timestamp: when the click occurred.

clickId: a unique id for the click.

userId: the id of the user performing the click.

userSessionId: the id of the session of the user when the click is performed.

isHit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0).

teamId: the id of the team of the user.

teamLevel: the current level of the team of the user.
combined_ data.csv Combines data from 3 of the log files: user-session.csv, buy-clicks.csv, and game-clicks.csv. userid: User ID

userSessionid: User session ID

team_level: User’s team level

platformType: Platform used by user

count_gameclicks: Total number of game clicks for user session

count_hits: Total number of game hits for user session

count_buyid: Total number of purchases for user session

avg_price: Average purchase price for user session

Schema of the Graph Database for Chats

The schema of the 6 CSV files used to construct the graph database for chats is described in the table below.

File Name Description Example
chat_create_team_chat.csv A line is added to this file when a player creates a new chat with their team. userid, teamid, TeamChatSessionID, timestamp

559,48,6288,14567
876,15,6289,24244
1166,68,6290,65522
chat_item_team_chat.csv Creates nodes labeled ChatItems. Column 0 is User id, column 1 is the TeamChatSession id, column 2 is the ChatItem id (i.e., the id property of the ChatItem node), column 3 is the timestamp for an edge labeled “CreateChat”. Also creates an edge labeled “PartOf” from the ChatItem node to the TeamChatSession node. This edge has a timestamp property using the value from Column 3. userid, teamchatsessionid, chatitemid, timestamp

1,956,629,963,051,460,000,000
2,081,629,663,111,460,000,000
1,166,629,063,161,460,000,000
chat_join_team_chat.csv Creates an edge labeled “Joins” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Joins edge. userid, TeamChatSessionID, timestamp

559,628,812,345
876,628,915,468
1,166,629,015,648
chat_leave_team_chat.csv Creates an edge labeled “Leaves” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Leaves edge. userid, teamchatsessionid, timestamp

124,468,211,464,241,000.00
107,468,381,464,243,000.00
35,067,771,464,246,600.00
chat_mention_team_chat.csv Creates an edge labeled “Mentioned”. Column 0 is the id of the ChatItem, column 1 is the id of the User, and column 2 is the timestamp of the edge going from the chatItem to the User. ChatItem, userid, timestamp

63,492,508
63,662,491
6,371,104
chat_respond_team_chat.csv A line is added to this file when player with chatid2 responds to a chat post by another player with chatid1. chatid1, chatid2, timestamp

6,326,630,521,564
6,364,632,654,544
6,371,636,654,567

Each of the 6 CSV files was loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph.

For example, the code below loads the nodes and values from chat_join_team_chat.csv. Each row in this file has 3 values: userid, TeamChatSessionID and timestamp. As the code reads each row of the file, it merges the imported value from the first column with a node of the type “User”, the value from the second column with a node of the type “TeamChatSession” and the value from the third column with an edge of the type “timestamp”. The code also specifies that this edge links each User to the User’s TeamChatSession.

LOAD CSV FROM "file:///chat-data/chat_join_team_chat.csv" AS row 
MERGE (u:User {id: toInteger(row[0])}) 
MERGE (c:TeamChatSession {id: toInteger(row[1])}) 
MERGE (u)-[:Join{timestamp: row[2]}]->(c)

Next step: Data Exploration