Introduction

This section describes the data used for completing the final project of the Big Data Specialization by University of California San Diego on Coursera. The project focuses on developing recommendations for increasing revenue in a fictitious online game called “Catch the Pink Flamingo”.

The data for the project consists of 9 files with data on in-app purchases, ad clicks and game-specific information as well as 6 files with chat data. The data were downloaded from the Coursera website. All the data for the project were generated by a development team of data scientists. The data were designed to simulate many aspects of game play and in-game user activity.

The project entails classification analysis by fitting a decision tree and cluster analysis using k-means. It also uses graph analytics to identify the chattiest users / teams, longest conversations and active user groups.

The analysis in this project was performed using Spark, Python, Splunk, KNIME and Neo4j.

In the remainder of this section we will first describe the Catch the Pink Flamingo game, then we will provide an overview of the Catch the Pink Flamingo data model and the data sets.

Data exploration is described in the next section.

Description of Catch the Pink Flamingo Online Game

One of the products of an imaginary company Eglence Inc. is a highly popular mobile game called “Catch the Pink Flamingo”. The objective of the game is to catch as many Pink Flamingos as possible by following the missions provided by real-time prompts in the game and cover the map provided for each level. The levels get more complicated in mission speed and map complexity as the users move from level to level.

It’s a multi-user game where the players have to catch Pink Flamingos that randomly pop up on a gridded world map based on missions that change in real-time. For the player or team to move to the next complexity level, they need to have at least one point in every map grid cell, i.e., cover the whole world map. An example mission would be “Catch the Flamingos on land with stars on their belly” in which the player should only click on flamingos that match the mission criteria, in this case, stars and being on land. If the player tags any other flamingo on the map, he/she or his/her team gets a negative point (-1) on that map location.

After the initial sign up, a player (user) is asked to play the Level 1 individually without joining any team. This is where the user gets trained as a player and starts building a game history. Level 1 is an easy entry to the game composed of only 64 (8x8) grid cells and longer, more obvious, fun missions. Upon completion of Level 1, the player gets asked if she/he wants to join any team or form a team and will continue the rest of the time as a team player even if that means the user is a 1-person team of her/his own. Each user is a member of at most one team.

At the beginning of each level, the game creates a brand new map with more cells than the level before. The complexity of the missions also increases. The missions change more frequently as the levels increase.

The players keep in touch via chat boards assigned to the teams and also via social media, e.g., Twitter.

There are some things to consider while designing an information system for this game:

Ranking of Users: Each user will be ranked individually by the speed and accuracy of their click to completion. The rankings get tracked in real-time and can be viewed both via the mobile app and the website for the game. In addition to score, speed and accuracy based ranking, the other players can see what parts of the map the user has the most points for. The players are also categorized based on their history as “rising star”, “veteran”, “coach”, “social butterfly” and “hot flamingo”. These refer to the qualities of players in addition to the game statistics.
Ranking of Teams: The teams are ranked publicly. There is a maximum of 30 members in a team and a minimum of 1 member. The players “ask” to join a team and get voted in when 80% of the team members allow. A team may choose to “recruit” if they think a player can contribute or “outvote” a player if a player is not contributing. The players are also allowed to change their teams and bring all their points along. The competition is built on “point-based economy” and it is encouraged by the game providers. When all players leave a team, the team automatically gets removed from public and archived by Eglence Inc.
In-game Purchases: Users are allowed in game purchases including binoculars to spot the mission specific flamingos, special flamingos that count for more than one grid point, ice blocks to freeze a mission for 20 seconds when needed, and trading cards to transfer the extra points from some grid cells to the ones without any points.
Game Completion: The game never ends, meaning that there will always be a more complicated next level. A challenge for Eglence Inc. is to keep the game interesting and engaging for players who have been around for a long time. They make use of big data analytics to make sure the veteran players are still around.

Overview of the Catch the Pink Flamingo Data Model

The data generation scripts create several log files recording the activities of people playing Catch the Pink Flamingo. This document describes the fields in those log files.

The image below is an Entity Relationship Diagram (ERD) for the Catch the Pink Flamingo game data model.

Data Set Overview

Data on In-App Purchases, Ad Clicks and Game-Specific Information

The table below lists each of the files available for analysis with a short description of what is found in each one.

File Name	Description	Fields
ad-clicks.csv	A line is added to this file when a player clicks on an advertisement in the Flamingo app.	timestamp: when the click occurred. txId: a unique id (within ad-clicks.log) for the click. userSessionid: the id of the user session for the user who made the click. teamid: the current team id of the user who made the click. userid: the user id of the user who made the click. adId: the id of the ad clicked on. adCategory: the category/type of ad clicked on.
buy-clicks.csv	A line is added to this file when a player makes an in-app purchase in the Flamingo app.	timestamp: when the purchase was made. txId: a unique id (within buy-clicks.log) for the purchase. userSessionId: the id of the user session for the user who made the purchase. team: the current team id of the user who made the purchase. userId: the user id of the user who made the purchase. buyId: the id of the item purchased. price: the price of the item purchased.
users.csv	This file contains a line for each user playing the game.	timestamp: when user first played the game. userId: the user id assigned to the user. nick: the nickname chosen by the user. twitter: the twitter handle of the user. dob: the date of birth of the user. country: the two-letter country code where the user lives.
team.csv	This file contains a line for each team terminated in the game.	teamId: the id of the team name: the name of the team. teamCreationTime: the timestamp when the team was created. teamEndTime: the timestamp when the last member left the team. strength: a measure of team strength, roughly corresponding to the success of a team. currentLevel: the current level of the team.
team- assignments.csv	A line is added to this file each time a user joins a team. A user can be in at most a single team at a time.	timestamp: when the user joined the team. team: the id of the team. userId: the id of the user. assignmentId: a unique id for this assignment.
level-events.csv	A line is added to this file each time a team starts or finishes a level in the game	timestamp: when the event occurred. eventId: a unique id for the event. teamId: the id of the team. teamLevel: the level started or completed. eventType: the type of event, either start or end.
user- session.csv	Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started.	timestamp: a timestamp denoting when the event occurred. userSessionId: a unique id for the session. userId: the current user’s ID. teamId: the current user’s team. assignmentId: the team assignment id for the user to the team. sessionType: whether the event is the start or end of a session. teamLevel: the level of the team during this session. platformType: the type of platform of the user during this session.
game-clicks.csv	A line is added to this file each time a user performs a click in the game.	timestamp: when the click occurred. clickId: a unique id for the click. userId: the id of the user performing the click. userSessionId: the id of the session of the user when the click is performed. isHit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0). teamId: the id of the team of the user. teamLevel: the current level of the team of the user.
combined_ data.csv	Combines data from 3 of the log files: user-session.csv, buy-clicks.csv, and game-clicks.csv.	userid: User ID userSessionid: User session ID team_level: User’s team level platformType: Platform used by user count_gameclicks: Total number of game clicks for user session count_hits: Total number of game hits for user session count_buyid: Total number of purchases for user session avg_price: Average purchase price for user session

Schema of the Graph Database for Chats

The schema of the 6 CSV files used to construct the graph database for chats is described in the table below.

File Name	Description	Example
chat_create_team_chat.csv	A line is added to this file when a player creates a new chat with their team.	userid, teamid, TeamChatSessionID, timestamp 559,48,6288,14567 876,15,6289,24244 1166,68,6290,65522
chat_item_team_chat.csv	Creates nodes labeled ChatItems. Column 0 is User id, column 1 is the TeamChatSession id, column 2 is the ChatItem id (i.e., the id property of the ChatItem node), column 3 is the timestamp for an edge labeled “CreateChat”. Also creates an edge labeled “PartOf” from the ChatItem node to the TeamChatSession node. This edge has a timestamp property using the value from Column 3.	userid, teamchatsessionid, chatitemid, timestamp 1,956,629,963,051,460,000,000 2,081,629,663,111,460,000,000 1,166,629,063,161,460,000,000
chat_join_team_chat.csv	Creates an edge labeled “Joins” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Joins edge.	userid, TeamChatSessionID, timestamp 559,628,812,345 876,628,915,468 1,166,629,015,648
chat_leave_team_chat.csv	Creates an edge labeled “Leaves” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Leaves edge.	userid, teamchatsessionid, timestamp 124,468,211,464,241,000.00 107,468,381,464,243,000.00 35,067,771,464,246,600.00
chat_mention_team_chat.csv	Creates an edge labeled “Mentioned”. Column 0 is the id of the ChatItem, column 1 is the id of the User, and column 2 is the timestamp of the edge going from the chatItem to the User.	ChatItem, userid, timestamp 63,492,508 63,662,491 6,371,104
chat_respond_team_chat.csv	A line is added to this file when player with chatid2 responds to a chat post by another player with chatid1.	chatid1, chatid2, timestamp 6,326,630,521,564 6,364,632,654,544 6,371,636,654,567

Each of the 6 CSV files was loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph.

For example, the code below loads the nodes and values from chat_join_team_chat.csv. Each row in this file has 3 values: userid, TeamChatSessionID and timestamp. As the code reads each row of the file, it merges the imported value from the first column with a node of the type “User”, the value from the second column with a node of the type “TeamChatSession” and the value from the third column with an edge of the type “timestamp”. The code also specifies that this edge links each User to the User’s TeamChatSession.

LOAD CSV FROM "file:///chat-data/chat_join_team_chat.csv" AS row 
MERGE (u:User {id: toInteger(row[0])}) 
MERGE (c:TeamChatSession {id: toInteger(row[1])}) 
MERGE (u)-[:Join{timestamp: row[2]}]->(c)

Next step: Data Exploration