Overview
This section explores a Twitter dataset by querying MongoDB. The queries were performed in the Cloudera environment.
This project is based on assignments from Big Data Integration and Processing by University of California, San Diego on Coursera.
MongoDB was started by executing the following commands:
./mongodb/bin/mongod --dbpath db
./mongodb/bin/mongo
where dbpath db in the first line specifies the MongoDB directory for data files. The second line is executed from a different terminal window. It runs mongodb shell so that we can query the server.
Exploration of Twitter Data
The command show dbs
shows the databases:
journaldev 0.000GB
local 0.000GB
sample 0.004GB
test 0.000GB
We will switch to the sample
database that contains Twitter JSON data by executing use sample
.
Then we will show collections by executing show collections
:
collection
users
The Twitter data are stored in the users
collection.
The following command outputs the number of documents in the users
collection:
db.users.count()
The output shows that there are 11,188 records in the collection.
Next, we will output one of the documents:
db.users.findOne()
The following output allows us to examine the contents of the document:
{
"_id" : ObjectId("578ffa8e7eb9513f4f55a935"),
"user_name" : "koteras",
"retweet_count" : 0,
"tweet_followers_count" : 461,
"source" : "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
"coordinates" : null,
"tweet_mentioned_count" : 1,
"tweet_ID" : "755891629932675072",
"tweet_text" : "RT @ochocinco: I beat them all for 10 straight hours #FIFA16KING https://t.co/BFnV6jfkBL",
"user" : {
"CreatedAt" : ISODate("2011-12-27T09:04:01Z"),
"FavouritesCount" : 5223,
"FollowersCount" : 461,
"FriendsCount" : 619,
"UserId" : 447818090,
"Location" : "501"
}
}
The document contains several fields, including nested fields under “user”.
We can use distinct
command to output distinct values for a specific field:
db.users.distinct("user_name")
The above command returns the following output:
[
"koteras",
"AllieLovesR5_1D",
"Tonkatol",
...
The next line of code searches for a filed with a specific value, i.e., “AllieLovesR5_1D” in user_name
:
db.users.find({user_name: "AllieLovesR5_1D"}).pretty()
The results are shown below:
{
"_id" : ObjectId("578ffa8f7eb9513f4f55a937"),
"user_name" : "AllieLovesR5_1D",
"retweet_count" : 0,
"tweet_followers_count" : 4601,
"source" : "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
"coordinates" : null,
"tweet_mentioned_count" : 3,
"tweet_ID" : "755891632759681024",
"tweet_text" : "RT @NiallOfficial: @Louis_Tomlinson @socceraid when I retired from playing because of my knee . I went and did my uefa A badges in Dublin",
"user" : {
"CreatedAt" : ISODate("2012-10-19T00:47:23Z"),
"FavouritesCount" : 15758,
"FollowersCount" : 4601,
"FriendsCount" : 5059,
"UserId" : 890030330,
"Location" : null
}
}
The code below selects only one field from the tweet above, the tweet_ID:
db.users.find({user_name: "AllieLovesR5_1D"}, {tweet_ID: 1})
which results in the following output:
{ "_id" : ObjectId("578ffa8f7eb9513f4f55a937"), "tweet_ID" : "755891632759681024" }
The next line of code removes the primary key _id
from the results:
db.users.find({user_name: "AllieLovesR5_1D"}, {tweet_ID: 1, _id: 0})
and shows the following output:
{ "tweet_ID" : "755891632759681024" }
Next, we perform a regular expression search for the word “football”, and count the results:
db.users.find({tweet_text: /football/}).count()
This search results in 2,868 documents.
The next line outputs the count of all tweets with tweet_mentioned_count
greater than 10:
db.users.find({tweet_mentioned_count: {$gt: 2}}).count()
This search results in 271 documents.
Next, we would like to count the number of tweets with tweet_mentioned_count greater than tweet_followers_count:
db.users.find({$where : "this.tweet_mentioned_count > this.tweet_followers_count"}).count()
This search results in 18 documents.
Finally, we would like to count documents with tweet_text
ending with the word “football” and tweet_metnioned_count
greater than 2:
db.users.find({$and : [ {tweet_text : /football$/}, {tweet_mentioned_count: {$gt: 2}}]}).count()
This search results in 3 documents.