Reddit comments dataset
This dataset contains publicly-available comments on Reddit that go back to December, 2005, to March, 2023, and contains over 14B rows of data. The raw data is in JSON format in compressed files and the rows look like the following:
{"controversiality":0,"body":"A look at Vietnam and Mexico exposes the myth of market liberalisation.","subreddit_id":"t5_6","link_id":"t3_17863","stickied":false,"subreddit":"reddit.com","score":2,"ups":2,"author_flair_css_class":null,"created_utc":1134365188,"author_flair_text":null,"author":"frjo","id":"c13","edited":false,"parent_id":"t3_17863","gilded":0,"distinguished":null,"retrieved_on":1473738411}
{"created_utc":1134365725,"author_flair_css_class":null,"score":1,"ups":1,"subreddit":"reddit.com","stickied":false,"link_id":"t3_17866","subreddit_id":"t5_6","controversiality":0,"body":"The site states \"What can I use it for? Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more...\", just like any other new breeed of sites that want us to store everything we have on the web. And they even guarantee multiple levels of security and encryption etc. But what prevents these web site operators fom accessing and/or stealing Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more, for competitive or personal gains...? I am pretty sure that most of them are honest, but what's there to prevent me from setting up a good useful site and stealing all your data? Call me paranoid - I am.","retrieved_on":1473738411,"distinguished":null,"gilded":0,"id":"c14","edited":false,"parent_id":"t3_17866","author":"zse7zse","author_flair_text":null}
{"gilded":0,"distinguished":null,"retrieved_on":1473738411,"author":"[deleted]","author_flair_text":null,"edited":false,"id":"c15","parent_id":"t3_17869","subreddit":"reddit.com","score":0,"ups":0,"created_utc":1134366848,"author_flair_css_class":null,"body":"Jython related topics by Frank Wierzbicki","controversiality":0,"subreddit_id":"t5_6","stickied":false,"link_id":"t3_17869"}
{"gilded":0,"retrieved_on":1473738411,"distinguished":null,"author_flair_text":null,"author":"[deleted]","edited":false,"parent_id":"t3_17870","id":"c16","subreddit":"reddit.com","created_utc":1134367660,"author_flair_css_class":null,"score":1,"ups":1,"body":"[deleted]","controversiality":0,"stickied":false,"link_id":"t3_17870","subreddit_id":"t5_6"}
{"gilded":0,"retrieved_on":1473738411,"distinguished":null,"author_flair_text":null,"author":"rjoseph","edited":false,"id":"c17","parent_id":"t3_17817","subreddit":"reddit.com","author_flair_css_class":null,"created_utc":1134367754,"score":1,"ups":1,"body":"Saft is by far the best extension you could tak onto your Safari","controversiality":0,"link_id":"t3_17817","stickied":false,"subreddit_id":"t5_6"}
A shoutout to Percona for the motivation behind ingesting this dataset, which we have downloaded and stored in an S3 bucket.
The following commands were executed on a Production instance of ClickHouse Cloud with the minimum memory set to 720GB. To run this on your own cluster, replace default
in the s3Cluster
function call with the name of your cluster. If you do not have a cluster, then replace the s3Cluster
function with the s3
function.
- Let's create a table for the Reddit data:
CREATE TABLE reddit
(
subreddit LowCardinality(String),
subreddit_id LowCardinality(String),
subreddit_type Enum('public' = 1, 'restricted' = 2, 'user' = 3, 'archived' = 4, 'gold_restricted' = 5, 'private' = 6),
author LowCardinality(String),
body String CODEC(ZSTD(6)),
created_date Date DEFAULT toDate(created_utc),
created_utc DateTime,
retrieved_on DateTime,
id String,
parent_id String,
link_id String,
score Int32,
total_awards_received UInt16,
controversiality UInt8,
gilded UInt8,
collapsed_because_crowd_control UInt8,
collapsed_reason Enum('' = 0, 'comment score below threshold' = 1, 'may be sensitive content' = 2, 'potentially toxic' = 3, 'potentially toxic content' = 4),
distinguished Enum('' = 0, 'moderator' = 1, 'admin' = 2, 'special' = 3),
removal_reason Enum('' = 0, 'legal' = 1),
author_created_utc DateTime,
author_fullname LowCardinality(String),
author_patreon_flair UInt8,
author_premium UInt8,
can_gild UInt8,
can_mod_post UInt8,
collapsed UInt8,
is_submitter UInt8,
_edited String,
locked UInt8,
quarantined UInt8,
no_follow UInt8,
send_replies UInt8,
stickied UInt8,
author_flair_text LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (subreddit, created_date, author);
The names of the files in S3 start with RC_YYYY-MM
where YYYY-MM
goes from 2005-12
to 2023-02
. The compression changes a couple of times though, so the file extensions are not consistent. For example:
- the file names are initially
RC_2005-12.bz2
toRC_2017-11.bz2
- then they look like
RC_2017-12.xz
toRC_2018-09.xz
- and finally
RC_2018-10.zst
toRC_2023-02.zst
- We are going to start with one month of data, but if you want to simply insert every row - skip ahead to step 8 below. The following file has 86M records from December, 2017:
INSERT INTO reddit
SELECT *
FROM s3(
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2017-12.xz',
'JSONEachRow'
);
- It will take a while depending on your resources, but when it's done verify it worked:
SELECT formatReadableQuantity(count())
FROM reddit;
ββformatReadableQuantity(count())ββ
β 85.97 million β
βββββββββββββββββββββββββββββββββββ
- Let's see how many unique subreddits were in December of 2017:
SELECT uniqExact(subreddit)
FROM reddit;
ββuniqExact(subreddit)ββ
β 91613 β
ββββββββββββββββββββββββ
1 row in set. Elapsed: 1.572 sec. Processed 85.97 million rows, 367.43 MB (54.71 million rows/s., 233.80 MB/s.)
- This query returns the top 10 subreddits (in terms of number of comments):
SELECT
subreddit,
count() AS c
FROM reddit
GROUP BY subreddit
ORDER BY c DESC
LIMIT 20;
ββsubredditββββββββ¬βββββββcββ
β AskReddit β 5245881 β
β politics β 1753120 β
β nfl β 1220266 β
β nba β 960388 β
β The_Donald β 931857 β
β news β 796617 β
β worldnews β 765709 β
β CFB β 710360 β
β gaming β 602761 β
β movies β 601966 β
β soccer β 590628 β
β Bitcoin β 583783 β
β pics β 563408 β
β StarWars β 562514 β
β funny β 547563 β
β leagueoflegends β 517213 β
β teenagers β 492020 β
β DestinyTheGame β 477377 β
β todayilearned β 472650 β
β videos β 450581 β
βββββββββββββββββββ΄ββββββββββ
20 rows in set. Elapsed: 0.368 sec. Processed 85.97 million rows, 367.43 MB (233.34 million rows/s., 997.25 MB/s.)
- Here are the top 10 authors in December of 2017, in terms of number of comments posted:
SELECT
author,
count() AS c
FROM reddit
GROUP BY author
ORDER BY c DESC
LIMIT 10;
ββauthorβββββββββββ¬βββββββcββ
β [deleted] β 5913324 β
β AutoModerator β 784886 β
β ImagesOfNetwork β 83241 β
β BitcoinAllBot β 54484 β
β imguralbumbot β 45822 β
β RPBot β 29337 β
β WikiTextBot β 25982 β
β Concise_AMA_Bot β 19974 β
β MTGCardFetcher β 19103 β
β TotesMessenger β 19057 β
βββββββββββββββββββ΄ββββββββββ
10 rows in set. Elapsed: 8.143 sec. Processed 85.97 million rows, 711.05 MB (10.56 million rows/s., 87.32 MB/s.)
- We already inserted some data, but we will start over:
TRUNCATE TABLE reddit;
- This is a fun dataset and it looks like we can find some great information, so let's go ahead and insert the entire dataset from 2005 to 2023. For practical reasons, it works well to insert the data by years starting with...
INSERT INTO reddit
SELECT *
FROM s3Cluster(
'default',
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2005*',
'JSONEachRow'
)
SETTINGS zstd_window_log_max = 31;
...and ending with:
INSERT INTO reddit
SELECT *
FROM s3Cluster(
'default',
'https://clickhouse-public-datasets.s3.amazonaws.com/reddit/original/RC_2023*',
'JSONEachRow'
)
SETTINGS zstd_window_log_max = 31;
If you do not have a cluster, use s3
instead of s3Cluster
:
INSERT INTO reddit
SELECT *
FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/reddit/original/RC_2005*',
'JSONEachRow'
)
SETTINGS zstd_window_log_max = 31;
- To verify it worked, here are the number of rows per year (as of February, 2023):
SELECT
toYear(created_utc) AS year,
formatReadableQuantity(count())
FROM reddit
GROUP BY year;
ββyearββ¬βformatReadableQuantity(count())ββ
β 2005 β 1.07 thousand β
β 2006 β 417.18 thousand β
β 2007 β 2.46 million β
β 2008 β 7.24 million β
β 2009 β 18.86 million β
β 2010 β 42.93 million β
β 2011 β 28.91 million β
β 2012 β 260.31 million β
β 2013 β 402.21 million β
β 2014 β 531.80 million β
β 2015 β 667.76 million β
β 2016 β 799.90 million β
β 2017 β 972.86 million β
β 2018 β 1.24 billion β
β 2019 β 1.66 billion β
β 2020 β 2.16 billion β
β 2021 β 2.59 billion β
β 2022 β 2.82 billion β
β 2023 β 474.86 million β
ββββββββ΄ββββββββββββββββββββββββββββββββββ
- Let's see how many rows were inserted and how much disk space the table is using:
SELECT
sum(rows) AS count,
formatReadableQuantity(count),
formatReadableSize(sum(bytes)) AS disk_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size
FROM system.parts
WHERE (table = 'reddit') AND active;
Notice the compression of disk storage is about 1/3 of the uncompressed size:
ββββββββcountββ¬βformatReadableQuantity(sum(rows))ββ¬βdisk_sizeββ¬βuncompressed_sizeββ
β 14688534662 β 14.69 billion β 1.03 TiB β 3.26 TiB β
βββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββββββββββ
1 row in set. Elapsed: 0.005 sec.
- The following query shows how many comments, authors and subreddits we have for each month:
SELECT
toStartOfMonth(created_utc) AS firstOfMonth,
count() AS c,
bar(c, 0, 50000000, 25) AS bar_count,
uniq(author) AS authors,
bar(authors, 0, 5000000, 25) AS bar_authors,
uniq(subreddit) AS subreddits,
bar(subreddits, 0, 100000, 25) AS bar_subreddits
FROM reddit
GROUP BY firstOfMonth
ORDER BY firstOfMonth ASC;
This is a substantial query that has to process all 14.69 billion rows, but we still get an impressive response time (about 48 seconds):
ββfirstOfMonthββ¬βββββββββcββ¬βbar_countββββββββββββββββββ¬ββauthorsββ¬βbar_authorsββββββββββββββββ¬βsubredditsββ¬βbar_subredditsβββββββββββββ
β 2005-12-01 β 1075 β β 394 β β 1 β β
β 2006-01-01 β 3666 β β 791 β β 2 β β
β 2006-02-01 β 9095 β β 1464 β β 18 β β
β 2006-03-01 β 13859 β β 1958 β β 15 β β
β 2006-04-01 β 19090 β β 2334 β β 21 β β
β 2006-05-01 β 26859 β β 2698 β β 21 β β
β 2006-06-01 β 29163 β β 3043 β β 19 β β
β 2006-07-01 β 37031 β β 3532 β β 22 β β
β 2006-08-01 β 50559 β β 4750 β β 24 β β
β 2006-09-01 β 50675 β β 4908 β β 21 β β
β 2006-10-01 β 54148 β β 5654 β β 31 β β
β 2006-11-01 β 62021 β β 6490 β β 23 β β
β 2006-12-01 β 61018 β β 6707 β β 24 β β
β 2007-01-01 β 81341 β β 7931 β β 23 β β
β 2007-02-01 β 95634 β β 9020 β β 21 β β
β 2007-03-01 β 112444 β β 10842 β β 23 β β
β 2007-04-01 β 126773 β β 10701 β β 26 β β
β 2007-05-01 β 170097 β β 11365 β β 25 β β
β 2007-06-01 β 178800 β β 11267 β β 22 β β
β 2007-07-01 β 203319 β β 12482 β β 25 β β
β 2007-08-01 β 225111 β β 14124 β β 30 β β
β 2007-09-01 β 259497 β β β 15416 β β 33 β β
β 2007-10-01 β 274170 β β β 15302 β β 36 β β
β 2007-11-01 β 372983 β β β 15134 β β 43 β β
β 2007-12-01 β 363390 β β β 15915 β β 31 β β
β 2008-01-01 β 452990 β β β 18857 β β 126 β β
β 2008-02-01 β 441768 β β β 18266 β β 173 β β
β 2008-03-01 β 463728 β β β 18947 β β 292 β β
β 2008-04-01 β 468317 β β β 18590 β β 323 β β
β 2008-05-01 β 536380 β β β 20861 β β 375 β β
β 2008-06-01 β 577684 β β β 22557 β β 575 β β β
β 2008-07-01 β 592610 β β β 23123 β β 657 β β β
β 2008-08-01 β 595959 β β β 23729 β β 707 β β β
β 2008-09-01 β 680892 β β β 26374 β β β 801 β β β
β 2008-10-01 β 789874 β β β 28970 β β β 893 β β β
β 2008-11-01 β 792310 β β β 30272 β β β 1024 β β β
β 2008-12-01 β 850359 β β β 34073 β β β 1103 β β β
β 2009-01-01 β 1051649 β β β 38978 β β β 1316 β β β
β 2009-02-01 β 944711 β β β 43390 β β β 1132 β β β
β 2009-03-01 β 1048643 β β β 46516 β β β 1203 β β β
β 2009-04-01 β 1094599 β β β 48284 β β β 1334 β β β
β 2009-05-01 β 1201257 β β β 52512 β β β 1395 β β β
β 2009-06-01 β 1258750 β β β 57728 β β β 1473 β β β
β 2009-07-01 β 1470290 β β β 60098 β β β 1686 β β β
β 2009-08-01 β 1750688 β β β 67347 β β β 1777 β β β
β 2009-09-01 β 2032276 β β β 78051 β β β 1784 β β β
β 2009-10-01 β 2242017 β β β 93409 β β β 2071 β β β
β 2009-11-01 β 2207444 β β β 95940 β β β 2141 β β β
β 2009-12-01 β 2560510 β ββ β 104239 β β β 2141 β β β
β 2010-01-01 β 2884096 β ββ β 114314 β β β 2313 β β β
β 2010-02-01 β 2687779 β ββ β 115683 β β β 2522 β β β
β 2010-03-01 β 3228254 β ββ β 125775 β β β 2890 β β β
β 2010-04-01 β 3209898 β ββ β 128936 β β β 3170 β β β
β 2010-05-01 β 3267363 β ββ β 131851 β β β 3166 β β β
β 2010-06-01 β 3532867 β ββ β 139522 β β β 3301 β β β
β 2010-07-01 β 806612 β β β 76486 β β β 1955 β β β
β 2010-08-01 β 4247982 β ββ β 164071 β β β 3653 β β β
β 2010-09-01 β 4704069 β βββ β 186613 β β β 4009 β β β
β 2010-10-01 β 5032368 β βββ β 203800 β β β 4154 β β β
β 2010-11-01 β 5689002 β βββ β 226134 β ββ β 4383 β β β
β 2010-12-01 β 3642690 β ββ β 196847 β β β 3914 β β β
β 2011-01-01 β 3924540 β ββ β 215057 β β β 4240 β β β
β 2011-02-01 β 3859131 β ββ β 223485 β β β 4371 β β β
β 2011-03-01 β 2877996 β ββ β 208607 β β β 3870 β β β
β 2011-04-01 β 3859131 β ββ β 248931 β ββ β 4881 β ββ β
β 2011-06-01 β 3859131 β ββ β 267197 β ββ β 5255 β ββ β
β 2011-08-01 β 2943405 β ββ β 259428 β ββ β 5806 β ββ β
β 2011-10-01 β 3859131 β ββ β 327342 β ββ β 6958 β ββ β
β 2011-12-01 β 3728313 β ββ β 354817 β ββ β 7713 β ββ β
β 2012-01-01 β 16350205 β βββββββββ β 696110 β ββββ β 14281 β ββββ β
β 2012-02-01 β 16015695 β ββββββββ β 722892 β ββββ β 14949 β ββββ β
β 2012-03-01 β 17881943 β βββββββββ β 789664 β ββββ β 15795 β ββββ β
β 2012-04-01 β 19044534 β ββββββββββ β 842491 β βββββ β 16440 β ββββ β
β 2012-05-01 β 20388260 β βββββββββββ β 886176 β βββββ β 16974 β βββββ β
β 2012-06-01 β 21897913 β βββββββββββ β 946798 β βββββ β 17952 β βββββ β
β 2012-07-01 β 24087517 β ββββββββββββ β 1018636 β βββββ β 19069 β βββββ β
β 2012-08-01 β 25703326 β βββββββββββββ β 1094445 β ββββββ β 20553 β ββββββ β
β 2012-09-01 β 23419524 β ββββββββββββ β 1088491 β ββββββ β 20831 β ββββββ β
β 2012-10-01 β 24788236 β βββββββββββββ β 1131885 β ββββββ β 21868 β ββββββ β
β 2012-11-01 β 24648302 β βββββββββββββ β 1167608 β ββββββ β 21791 β ββββββ β
β 2012-12-01 β 26080276 β βββββββββββββ β 1218402 β ββββββ β 22622 β ββββββ β
β 2013-01-01 β 30365867 β ββββββββββββββββ β 1341703 β βββββββ β 24696 β βββββββ β
β 2013-02-01 β 27213960 β ββββββββββββββ β 1304756 β βββββββ β 24514 β βββββββ β
β 2013-03-01 β 30771274 β ββββββββββββββββ β 1391703 β βββββββ β 25730 β βββββββ β
β 2013-04-01 β 33259557 β βββββββββββββββββ β 1485971 β ββββββββ β 27294 β βββββββ β
β 2013-05-01 β 33126225 β βββββββββββββββββ β 1506473 β ββββββββ β 27299 β βββββββ β
β 2013-06-01 β 32648247 β βββββββββββββββββ β 1506650 β ββββββββ β 27450 β βββββββ β
β 2013-07-01 β 34922133 β ββββββββββββββββββ β 1561771 β ββββββββ β 28294 β βββββββ β
β 2013-08-01 β 34766579 β ββββββββββββββββββ β 1589781 β ββββββββ β 28943 β ββββββββ β
β 2013-09-01 β 31990369 β ββββββββββββββββ β 1570342 β ββββββββ β 29408 β ββββββββ β
β 2013-10-01 β 35940040 β ββββββββββββββββββ β 1683770 β βββββββββ β 30273 β ββββββββ β
β 2013-11-01 β 37396497 β βββββββββββββββββββ β 1757467 β βββββββββ β 31173 β ββββββββ β
β 2013-12-01 β 39810216 β ββββββββββββββββββββ β 1846204 β ββββββββββ β 32326 β ββββββββ β
β 2014-01-01 β 42420655 β ββββββββββββββββββββββ β 1927229 β ββββββββββ β 35603 β βββββββββ β
β 2014-02-01 β 38703362 β ββββββββββββββββββββ β 1874067 β ββββββββββ β 37007 β ββββββββββ β
β 2014-03-01 β 42459956 β ββββββββββββββββββββββ β 1959888 β ββββββββββ β 37948 β ββββββββββ β
β 2014-04-01 β 42440735 β ββββββββββββββββββββββ β 1951369 β ββββββββββ β 38362 β ββββββββββ β
β 2014-05-01 β 42514094 β ββββββββββββββββββββββ β 1970197 β ββββββββββ β 39078 β ββββββββββ β
β 2014-06-01 β 41990650 β βββββββββββββββββββββ β 1943850 β ββββββββββ β 38268 β ββββββββββ β
β 2014-07-01 β 46868899 β ββββββββββββββββββββββββ β 2059346 β βββββββββββ β 40634 β βββββββββββ β
β 2014-08-01 β 46990813 β ββββββββββββββββββββββββ β 2117335 β βββββββββββ β 41764 β βββββββββββ β
β 2014-09-01 β 44992201 β βββββββββββββββββββββββ β 2124708 β βββββββββββ β 41890 β βββββββββββ β
β 2014-10-01 β 47497520 β ββββββββββββββββββββββββ β 2206535 β βββββββββββ β 43109 β βββββββββββ β
β 2014-11-01 β 46118074 β βββββββββββββββββββββββ β 2239747 β ββββββββββββ β 43718 β βββββββββββ β
β 2014-12-01 β 48807699 β βββββββββββββββββββββββββ β 2372945 β ββββββββββββ β 43823 β βββββββββββ β
β 2015-01-01 β 53851542 β βββββββββββββββββββββββββ β 2499536 β βββββββββββββ β 47172 β ββββββββββββ β
β 2015-02-01 β 48342747 β βββββββββββββββββββββββββ β 2448496 β βββββββββββββ β 47229 β ββββββββββββ β
β 2015-03-01 β 54564441 β βββββββββββββββββββββββββ β 2550534 β βββββββββββββ β 48156 β ββββββββββββ β
β 2015-04-01 β 55005780 β βββββββββββββββββββββββββ β 2609443 β βββββββββββββ β 49865 β βββββββββββββ β
β 2015-05-01 β 54504410 β βββββββββββββββββββββββββ β 2585535 β βββββββββββββ β 50137 β βββββββββββββ β
β 2015-06-01 β 54258492 β βββββββββββββββββββββββββ β 2595129 β βββββββββββββ β 49598 β βββββββββββββ β
β 2015-07-01 β 58451788 β βββββββββββββββββββββββββ β 2720026 β ββββββββββββββ β 55022 β ββββββββββββββ β
β 2015-08-01 β 58075327 β βββββββββββββββββββββββββ β 2743994 β ββββββββββββββ β 55302 β ββββββββββββββ β
β 2015-09-01 β 55574825 β βββββββββββββββββββββββββ β 2672793 β ββββββββββββββ β 53960 β ββββββββββββββ β
β 2015-10-01 β 59494045 β βββββββββββββββββββββββββ β 2816426 β ββββββββββββββ β 70210 β ββββββββββββββββββ β
β 2015-11-01 β 57117500 β βββββββββββββββββββββββββ β 2847146 β βββββββββββββββ β 71363 β ββββββββββββββββββ β
β 2015-12-01 β 58523312 β βββββββββββββββββββββββββ β 2854840 β βββββββββββββββ β 94559 β ββββββββββββββββββββββββ β
β 2016-01-01 β 61991732 β βββββββββββββββββββββββββ β 2920366 β βββββββββββββββ β 108438 β βββββββββββββββββββββββββ β
β 2016-02-01 β 59189875 β βββββββββββββββββββββββββ β 2854683 β βββββββββββββββ β 109916 β βββββββββββββββββββββββββ β
β 2016-03-01 β 63918864 β βββββββββββββββββββββββββ β 2969542 β βββββββββββββββ β 84787 β ββββββββββββββββββββββ β
β 2016-04-01 β 64271256 β βββββββββββββββββββββββββ β 2999086 β βββββββββββββββ β 61647 β ββββββββββββββββ β
β 2016-05-01 β 65212004 β βββββββββββββββββββββββββ β 3034674 β ββββββββββββββββ β 67465 β βββββββββββββββββ β
β 2016-06-01 β 65867743 β βββββββββββββββββββββββββ β 3057604 β ββββββββββββββββ β 75170 β βββββββββββββββββββ β
β 2016-07-01 β 66974735 β βββββββββββββββββββββββββ β 3199374 β ββββββββββββββββ β 77732 β ββββββββββββββββββββ β
β 2016-08-01 β 69654819 β βββββββββββββββββββββββββ β 3239957 β βββββββββββββββββ β 63080 β ββββββββββββββββ β
β 2016-09-01 β 67024973 β βββββββββββββββββββββββββ β 3190864 β ββββββββββββββββ β 62324 β ββββββββββββββββ β
β 2016-10-01 β 71826553 β βββββββββββββββββββββββββ β 3284340 β βββββββββββββββββ β 62549 β ββββββββββββββββ β
β 2016-11-01 β 71022319 β βββββββββββββββββββββββββ β 3300822 β βββββββββββββββββ β 69718 β ββββββββββββββββββ β
β 2016-12-01 β 72942967 β βββββββββββββββββββββββββ β 3430324 β ββββββββββββββββββ β 71705 β ββββββββββββββββββ β
β 2017-01-01 β 78946585 β βββββββββββββββββββββββββ β 3572093 β ββββββββββββββββββ β 78198 β ββββββββββββββββββββ β
β 2017-02-01 β 70609487 β βββββββββββββββββββββββββ β 3421115 β βββββββββββββββββ β 69823 β ββββββββββββββββββ β
β 2017-03-01 β 79723106 β βββββββββββββββββββββββββ β 3638122 β βββββββββββββββββββ β 73865 β βββββββββββββββββββ β
β 2017-04-01 β 77478009 β βββββββββββββββββββββββββ β 3620591 β ββββββββββββββββββ β 74387 β βββββββββββββββββββ β
β 2017-05-01 β 79810360 β βββββββββββββββββββββββββ β 3650820 β βββββββββββββββββββ β 74356 β βββββββββββββββββββ β
β 2017-06-01 β 79901711 β βββββββββββββββββββββββββ β 3737614 β βββββββββββββββββββ β 72114 β ββββββββββββββββββ β
β 2017-07-01 β 81798725 β βββββββββββββββββββββββββ β 3872330 β ββββββββββββββββββββ β 76052 β βββββββββββββββββββ β
β 2017-08-01 β 84658503 β βββββββββββββββββββββββββ β 3960093 β ββββββββββββββββββββ β 77798 β ββββββββββββββββββββ β
β 2017-09-01 β 83165192 β βββββββββββββββββββββββββ β 3880501 β ββββββββββββββββββββ β 78402 β ββββββββββββββββββββ β
β 2017-10-01 β 85828912 β βββββββββββββββββββββββββ β 3980335 β ββββββββββββββββββββ β 80685 β βββββββββββββββββββββ β
β 2017-11-01 β 84965681 β βββββββββββββββββββββββββ β 4026749 β βββββββββββββββββββββ β 82659 β βββββββββββββββββββββ β
β 2017-12-01 β 85973810 β βββββββββββββββββββββββββ β 4196354 β βββββββββββββββββββββ β 91984 β βββββββββββββββββββββββ β
β 2018-01-01 β 91558594 β βββββββββββββββββββββββββ β 4364443 β ββββββββββββββββββββββ β 102577 β βββββββββββββββββββββββββ β
β 2018-02-01 β 86467179 β βββββββββββββββββββββββββ β 4277899 β ββββββββββββββββββββββ β 104610 β βββββββββββββββββββββββββ β
β 2018-03-01 β 96490262 β βββββββββββββββββββββββββ β 4422470 β ββββββββββββββββββββββ β 112559 β βββββββββββββββββββββββββ β
β 2018-04-01 β 98101232 β βββββββββββββββββββββββββ β 4572434 β βββββββββββββββββββββββ β 105284 β βββββββββββββββββββββββββ β
β 2018-05-01 β 100109100 β βββββββββββββββββββββββββ β 4698908 β ββββββββββββββββββββββββ β 103910 β βββββββββββββββββββββββββ β
β 2018-06-01 β 100009462 β βββββββββββββββββββββββββ β 4697426 β ββββββββββββββββββββββββ β 101107 β βββββββββββββββββββββββββ β
β 2018-07-01 β 108151359 β βββββββββββββββββββββββββ β 5099492 β βββββββββββββββββββββββββ β 106184 β βββββββββββββββββββββββββ β
β 2018-08-01 β 107330940 β βββββββββββββββββββββββββ β 5084082 β βββββββββββββββββββββββββ β 109985 β βββββββββββββββββββββββββ β
β 2018-09-01 β 104473929 β βββββββββββββββββββββββββ β 5011953 β βββββββββββββββββββββββββ β 109710 β βββββββββββββββββββββββββ β
β 2018-10-01 β 112346556 β βββββββββββββββββββββββββ β 5320405 β βββββββββββββββββββββββββ β 112533 β βββββββββββββββββββββββββ β
β 2018-11-01 β 112573001 β βββββββββββββββββββββββββ β 5353282 β βββββββββββββββββββββββββ β 112211 β βββββββββββββββββββββββββ β
β 2018-12-01 β 121953600 β βββββββββββββββββββββββββ β 5611543 β βββββββββββββββββββββββββ β 118291 β βββββββββββββββββββββββββ β
β 2019-01-01 β 129386587 β βββββββββββββββββββββββββ β 6016687 β βββββββββββββββββββββββββ β 125725 β βββββββββββββββββββββββββ β
β 2019-02-01 β 120645639 β βββββββββββββββββββββββββ β 5974488 β βββββββββββββββββββββββββ β 125420 β βββββββββββββββββββββββββ β
β 2019-03-01 β 137650471 β βββββββββββββββββββββββββ β 6410197 β βββββββββββββββββββββββββ β 135924 β βββββββββββββββββββββββββ β
β 2019-04-01 β 138473643 β βββββββββββββββββββββββββ β 6416384 β βββββββββββββββββββββββββ β 139844 β βββββββββββββββββββββββββ β
β 2019-05-01 β 142463421 β βββββββββββββββββββββββββ β 6574836 β βββββββββββββββββββββββββ β 142012 β βββββββββββββββββββββββββ β
β 2019-06-01 β 134172939 β βββββββββββββββββββββββββ β 6601267 β βββββββββββββββββββββββββ β 140997 β βββββββββββββββββββββββββ β
β 2019-07-01 β 145965083 β βββββββββββββββββββββββββ β 6901822 β βββββββββββββββββββββββββ β 147802 β βββββββββββββββββββββββββ β
β 2019-08-01 β 146854393 β βββββββββββββββββββββββββ β 6993882 β βββββββββββββββββββββββββ β 151888 β βββββββββββββββββββββββββ β
β 2019-09-01 β 137540219 β βββββββββββββββββββββββββ β 7001362 β βββββββββββββββββββββββββ β 148839 β βββββββββββββββββββββββββ β
β 2019-10-01 β 145909884 β βββββββββββββββββββββββββ β 7160126 β βββββββββββββββββββββββββ β 152075 β βββββββββββββββββββββββββ β
β 2019-11-01 β 138512489 β βββββββββββββββββββββββββ β 7098723 β βββββββββββββββββββββββββ β 164597 β βββββββββββββββββββββββββ β
β 2019-12-01 β 146012313 β βββββββββββββββββββββββββ β 7438261 β βββββββββββββββββββββββββ β 166966 β βββββββββββββββββββββββββ β
β 2020-01-01 β 153498208 β βββββββββββββββββββββββββ β 7703548 β βββββββββββββββββββββββββ β 174390 β βββββββββββββββββββββββββ β
β 2020-02-01 β 148386817 β βββββββββββββββββββββββββ β 7582031 β βββββββββββββββββββββββββ β 170257 β βββββββββββββββββββββββββ β
β 2020-03-01 β 166266315 β βββββββββββββββββββββββββ β 8339049 β βββββββββββββββββββββββββ β 192460 β βββββββββββββββββββββββββ β
β 2020-04-01 β 178511581 β βββββββββββββββββββββββββ β 8991649 β βββββββββββββββββββββββββ β 202334 β βββββββββββββββββββββββββ β
β 2020-05-01 β 189993779 β βββββββββββββββββββββββββ β 9331358 β βββββββββββββββββββββββββ β 217357 β βββββββββββββββββββββββββ β
β 2020-06-01 β 187914434 β βββββββββββββββββββββββββ β 9085003 β βββββββββββββββββββββββββ β 223362 β βββββββββββββββββββββββββ β
β 2020-07-01 β 194244994 β βββββββββββββββββββββββββ β 9321706 β βββββββββββββββββββββββββ β 228222 β βββββββββββββββββββββββββ β
β 2020-08-01 β 196099301 β βββββββββββββββββββββββββ β 9368408 β βββββββββββββββββββββββββ β 230251 β βββββββββββββββββββββββββ β
β 2020-09-01 β 182549761 β βββββββββββββββββββββββββ β 9271571 β βββββββββββββββββββββββββ β 227889 β βββββββββββββββββββββββββ β
β 2020-10-01 β 186583890 β βββββββββββββββββββββββββ β 9396112 β βββββββββββββββββββββββββ β 233715 β βββββββββββββββββββββββββ β
β 2020-11-01 β 186083723 β βββββββββββββββββββββββββ β 9623053 β βββββββββββββββββββββββββ β 234963 β βββββββββββββββββββββββββ β
β 2020-12-01 β 191317162 β βββββββββββββββββββββββββ β 9898168 β βββββββββββββββββββββββββ β 249115 β βββββββββββββββββββββββββ β
β 2021-01-01 β 210496207 β βββββββββββββββββββββββββ β 10503943 β βββββββββββββββββββββββββ β 259805 β βββββββββββββββββββββββββ β
β 2021-02-01 β 193510365 β βββββββββββββββββββββββββ β 10215033 β βββββββββββββββββββββββββ β 253656 β βββββββββββββββββββββββββ β
β 2021-03-01 β 207454415 β βββββββββββββββββββββββββ β 10365629 β βββββββββββββββββββββββββ β 267263 β βββββββββββββββββββββββββ β
β 2021-04-01 β 204573086 β βββββββββββββββββββββββββ β 10391984 β βββββββββββββββββββββββββ β 270543 β βββββββββββββββββββββββββ β
β 2021-05-01 β 217655366 β βββββββββββββββββββββββββ β 10648130 β βββββββββββββββββββββββββ β 288555 β βββββββββββββββββββββββββ β
β 2021-06-01 β 208027069 β βββββββββββββββββββββββββ β 10397311 β βββββββββββββββββββββββββ β 291520 β βββββββββββββββββββββββββ β
β 2021-07-01 β 210955954 β βββββββββββββββββββββββββ β 10063967 β βββββββββββββββββββββββββ β 252061 β βββββββββββββββββββββββββ β
β 2021-08-01 β 225681244 β βββββββββββββββββββββββββ β 10383556 β βββββββββββββββββββββββββ β 254569 β βββββββββββββββββββββββββ β
β 2021-09-01 β 220086513 β βββββββββββββββββββββββββ β 10298344 β βββββββββββββββββββββββββ β 256826 β βββββββββββββββββββββββββ β
β 2021-10-01 β 227527379 β βββββββββββββββββββββββββ β 10729882 β βββββββββββββββββββββββββ β 283328 β βββββββββββββββββββββββββ β
β 2021-11-01 β 228289963 β βββββββββββββββββββββββββ β 10995197 β βββββββββββββββββββββββββ β 302386 β βββββββββββββββββββββββββ β
β 2021-12-01 β 235807471 β βββββββββββββββββββββββββ β 11312798 β βββββββββββββββββββββββββ β 313876 β βββββββββββββββββββββββββ β
β 2022-01-01 β 256766679 β βββββββββββββββββββββββββ β 12074520 β βββββββββββββββββββββββββ β 340407 β βββββββββββββββββββββββββ β
β 2022-02-01 β 219927645 β βββββββββββββββββββββββββ β 10846045 β βββββββββββββββββββββββββ β 293236 β βββββββββββββββββββββββββ β
β 2022-03-01 β 236554668 β βββββββββββββββββββββββββ β 11330285 β βββββββββββββββββββββββββ β 302387 β βββββββββββββββββββββββββ β
β 2022-04-01 β 231188077 β βββββββββββββββββββββββββ β 11697995 β βββββββββββββββββββββββββ β 316303 β βββββββββββββββββββββββββ β
β 2022-05-01 β 230492108 β βββββββββββββββββββββββββ β 11448584 β βββββββββββββββββββββββββ β 323725 β βββββββββββββββββββββββββ β
β 2022-06-01 β 218842949 β βββββββββββββββββββββββββ β 11400399 β βββββββββββββββββββββββββ β 324846 β βββββββββββββββββββββββββ β
β 2022-07-01 β 242504279 β βββββββββββββββββββββββββ β 12049204 β βββββββββββββββββββββββββ β 335621 β βββββββββββββββββββββββββ β
β 2022-08-01 β 247215325 β βββββββββββββββββββββββββ β 12189276 β βββββββββββββββββββββββββ β 337873 β βββββββββββββββββββββββββ β
β 2022-09-01 β 234131223 β βββββββββββββββββββββββββ β 11674079 β βββββββββββββββββββββββββ β 326325 β βββββββββββββββββββββββββ β
β 2022-10-01 β 237365072 β βββββββββββββββββββββββββ β 11804508 β βββββββββββββββββββββββββ β 336063 β βββββββββββββββββββββββββ β
β 2022-11-01 β 229478878 β βββββββββββββββββββββββββ β 11543020 β βββββββββββββββββββββββββ β 323122 β βββββββββββββββββββββββββ β
β 2022-12-01 β 238862690 β βββββββββββββββββββββββββ β 11967451 β βββββββββββββββββββββββββ β 331668 β βββββββββββββββββββββββββ β
β 2023-01-01 β 253577512 β βββββββββββββββββββββββββ β 12264087 β βββββββββββββββββββββββββ β 332711 β βββββββββββββββββββββββββ β
β 2023-02-01 β 221285501 β βββββββββββββββββββββββββ β 11537091 β βββββββββββββββββββββββββ β 317879 β βββββββββββββββββββββββββ β
ββββββββββββββββ΄ββββββββββββ΄ββββββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββββββββ
203 rows in set. Elapsed: 48.492 sec. Processed 14.69 billion rows, 213.35 GB (302.91 million rows/s., 4.40 GB/s.)
- Here are the top 10 subreddits of 2022:
SELECT
subreddit,
count() AS count
FROM reddit
WHERE toYear(created_utc) = 2022
GROUP BY subreddit
ORDER BY count DESC
LIMIT 10;
ββsubredditβββββββ¬ββββcountββ
β AskReddit β 72312060 β
β AmItheAsshole β 25323210 β
β teenagers β 22355960 β
β worldnews β 17797707 β
β FreeKarma4U β 15652274 β
β FreeKarma4You β 14929055 β
β wallstreetbets β 14235271 β
β politics β 12511136 β
β memes β 11610792 β
β nba β 11586571 β
ββββββββββββββββββ΄βββββββββββ
10 rows in set. Elapsed: 5.956 sec. Processed 14.69 billion rows, 126.19 GB (2.47 billion rows/s., 21.19 GB/s.)
- Let's see which subreddits had the biggest increase in comments from 2018 to 2019:
SELECT
subreddit,
newcount - oldcount AS diff
FROM
(
SELECT
subreddit,
count(*) AS newcount
FROM reddit
WHERE toYear(created_utc) = 2019
GROUP BY subreddit
)
ALL INNER JOIN
(
SELECT
subreddit,
count(*) AS oldcount
FROM reddit
WHERE toYear(created_utc) = 2018
GROUP BY subreddit
) USING (subreddit)
ORDER BY diff DESC
LIMIT 50
SETTINGS joined_subquery_requires_alias = 0;
It looks like memes and teenagers were busy on Reddit in 2019:
ββsubredditβββββββββββββ¬βββββdiffββ
β AskReddit β 18765909 β
β memes β 16496996 β
β teenagers β 13071715 β
β AmItheAsshole β 12312663 β
β dankmemes β 12016716 β
β unpopularopinion β 6809935 β
β PewdiepieSubmissions β 6330844 β
β Market76 β 5213690 β
β relationship_advice β 4060717 β
β Minecraft β 3328659 β
β freefolk β 3227970 β
β classicwow β 3063133 β
β Animemes β 2866876 β
β gonewild β 2457680 β
β PublicFreakout β 2452288 β
β gameofthrones β 2411661 β
β RoastMe β 2378781 β
β ShitPostCrusaders β 2345414 β
β AnthemTheGame β 1813152 β
β nfl β 1804407 β
β Showerthoughts β 1797968 β
β Cringetopia β 1764034 β
β pokemon β 1763269 β
β entitledparents β 1744852 β
β HistoryMemes β 1721645 β
β MortalKombat β 1718184 β
β trashy β 1684357 β
β ChapoTrapHouse β 1675363 β
β Brawlstars β 1663763 β
β iamatotalpieceofshit β 1647381 β
β ukpolitics β 1599204 β
β cursedcomments β 1590781 β
β Pikabu β 1578597 β
β wallstreetbets β 1535225 β
β AskOuija β 1533214 β
β interestingasfuck β 1528910 β
β aww β 1439008 β
β wholesomememes β 1436566 β
β SquaredCircle β 1432172 β
β insanepeoplefacebook β 1290686 β
β borderlands3 β 1274462 β
β FreeKarma4U β 1217769 β
β YangForPresidentHQ β 1186918 β
β FortniteCompetitive β 1184508 β
β AskMen β 1180820 β
β EpicSeven β 1172061 β
β MurderedByWords β 1112476 β
β politics β 1084087 β
β barstoolsports β 1068020 β
β BattlefieldV β 1053878 β
ββββββββββββββββββββββββ΄βββββββββββ
50 rows in set. Elapsed: 10.680 sec. Processed 29.38 billion rows, 198.67 GB (2.75 billion rows/s., 18.60 GB/s.)
- One more query: let's compare ClickHouse mentions to other technologies like Snowflake and Postgres. This query is a big one because it has to search all 14.69 billion comments three times for a substring, but the performance is actually quite impressive. (Unfortunately ClickHouse users are not very active on Reddit yet):
SELECT
toStartOfQuarter(created_utc) AS quarter,
sum(if(positionCaseInsensitive(body, 'clickhouse') > 0, 1, 0)) AS clickhouse,
sum(if(positionCaseInsensitive(body, 'snowflake') > 0, 1, 0)) AS snowflake,
sum(if(positionCaseInsensitive(body, 'postgres') > 0, 1, 0)) AS postgres
FROM reddit
GROUP BY quarter
ORDER BY quarter ASC;
βββββquarterββ¬βclickhouseββ¬βsnowflakeββ¬βpostgresββ
β 2005-10-01 β 0 β 0 β 0 β
β 2006-01-01 β 0 β 2 β 23 β
β 2006-04-01 β 0 β 2 β 24 β
β 2006-07-01 β 0 β 4 β 13 β
β 2006-10-01 β 0 β 23 β 73 β
β 2007-01-01 β 0 β 14 β 91 β
β 2007-04-01 β 0 β 10 β 59 β
β 2007-07-01 β 0 β 39 β 116 β
β 2007-10-01 β 0 β 45 β 125 β
β 2008-01-01 β 0 β 53 β 234 β
β 2008-04-01 β 0 β 79 β 303 β
β 2008-07-01 β 0 β 102 β 174 β
β 2008-10-01 β 0 β 156 β 323 β
β 2009-01-01 β 0 β 206 β 208 β
β 2009-04-01 β 0 β 178 β 417 β
β 2009-07-01 β 0 β 300 β 295 β
β 2009-10-01 β 0 β 633 β 589 β
β 2010-01-01 β 0 β 555 β 501 β
β 2010-04-01 β 0 β 587 β 469 β
β 2010-07-01 β 0 β 601 β 696 β
β 2010-10-01 β 0 β 1246 β 505 β
β 2011-01-01 β 0 β 758 β 247 β
β 2011-04-01 β 0 β 537 β 113 β
β 2011-07-01 β 0 β 173 β 64 β
β 2011-10-01 β 0 β 649 β 96 β
β 2012-01-01 β 0 β 4621 β 662 β
β 2012-04-01 β 0 β 5737 β 785 β
β 2012-07-01 β 0 β 6097 β 1127 β
β 2012-10-01 β 0 β 7986 β 600 β
β 2013-01-01 β 0 β 9704 β 839 β
β 2013-04-01 β 0 β 8161 β 853 β
β 2013-07-01 β 0 β 9704 β 1028 β
β 2013-10-01 β 0 β 12879 β 1404 β
β 2014-01-01 β 0 β 12317 β 1548 β
β 2014-04-01 β 0 β 13181 β 1577 β
β 2014-07-01 β 0 β 15640 β 1710 β
β 2014-10-01 β 0 β 19479 β 1959 β
β 2015-01-01 β 0 β 20411 β 2104 β
β 2015-04-01 β 1 β 20309 β 9112 β
β 2015-07-01 β 0 β 20325 β 4771 β
β 2015-10-01 β 0 β 25087 β 3030 β
β 2016-01-01 β 0 β 23462 β 3126 β
β 2016-04-01 β 3 β 25496 β 2757 β
β 2016-07-01 β 4 β 28233 β 2928 β
β 2016-10-01 β 2 β 45445 β 2449 β
β 2017-01-01 β 9 β 76019 β 2808 β
β 2017-04-01 β 9 β 67919 β 2803 β
β 2017-07-01 β 13 β 68974 β 2771 β
β 2017-10-01 β 12 β 69730 β 2906 β
β 2018-01-01 β 17 β 67476 β 3152 β
β 2018-04-01 β 3 β 67139 β 3986 β
β 2018-07-01 β 14 β 67979 β 3609 β
β 2018-10-01 β 28 β 74147 β 3850 β
β 2019-01-01 β 14 β 80250 β 4305 β
β 2019-04-01 β 30 β 70307 β 3872 β
β 2019-07-01 β 33 β 77149 β 4164 β
β 2019-10-01 β 22 β 113011 β 4369 β
β 2020-01-01 β 34 β 238273 β 5133 β
β 2020-04-01 β 52 β 454467 β 6100 β
β 2020-07-01 β 37 β 406623 β 5507 β
β 2020-10-01 β 49 β 212143 β 5385 β
β 2021-01-01 β 56 β 151262 β 5749 β
β 2021-04-01 β 71 β 119928 β 6039 β
β 2021-07-01 β 53 β 110342 β 5765 β
β 2021-10-01 β 92 β 121144 β 6401 β
β 2022-01-01 β 93 β 107512 β 6772 β
β 2022-04-01 β 120 β 91560 β 6687 β
β 2022-07-01 β 183 β 99764 β 7377 β
β 2022-10-01 β 123 β 99447 β 7052 β
β 2023-01-01 β 126 β 58733 β 4891 β
ββββββββββββββ΄βββββββββββββ΄ββββββββββββ΄βββββββββββ
70 rows in set. Elapsed: 325.835 sec. Processed 14.69 billion rows, 2.57 TB (45.08 million rows/s., 7.87 GB/s.)