Scraping Dcard with cloudscraper
This post goes over the steps for scraping data from Dcard at regular intervals and saving the data to a SQL database.
- Intro
- Installing dependencies
- Testing the Dcard API
- Parsing the JSON response
- Creating DataFrame from JSON
- Getting forum names
- Persisting data to SQL
- Testing the logger
- Testing the scheduler
- Putting everything together
- Recap
Dcard
is a popular social networking platform in Taiwan, and as such offers great resources for text mining and NLP model building. Our goals in this post is to scrape data from Dcard at regular intervals and persist it to a SQL database without duplicating the same records. We'll be leveraging the Dcard API v2 as well as the following libraries, which are not included in Python's standard library:
-
cloudscraper
: for bypassing Cloudflare's anti-bot page -
pandas
: for organizing the scraped data into a tabular format, which can then be easily saved as a SQL table -
schedule
: for scheduling tasks
Since pandas
is preinstalled on Colab, we only need to install cloudscraper
and schedule
.
!pip install cloudscraper
!pip install schedule
#!pip pandas # uncomment this if pandas is not installed in your environment
Let's check out the Python version on Colab.
!python --version
Let's first test the Dcard API v2 with cloudscraper
, the syntax of which is much like that of requests
. The only difference is that we'll have to first create a scraper instance with cloudscraper.create_scraper()
. For each HTTP request, we'll get a batch of 30 posts. The forum
variable in the URL is the English name of a forum, and we'll first test the one named stock
.
import cloudscraper
forum = "stock"
URL = f"https://www.dcard.tw/service/api/v2/forums/{forum}/posts"
scraper = cloudscraper.create_scraper()
batch = scraper.get(URL).json()
len(batch)
Here's what one single post looks like.
import pprint
pprint.pprint(batch[0])
Then we'll parse the JSON response to get the data we're interested in, including title
, createdAt
, categories
, excerpt
, and topics
.
cols = ['title', 'createdAt', 'categories', 'excerpt', 'topics']
title = [item.get(cols[0]) for item in batch]
createdAt = [item.get(cols[1]) for item in batch]
categories = [item.get(cols[2]) for item in batch]
excerpt = [item.get(cols[3]) for item in batch]
topics = [item.get(cols[4]) for item in batch]
For instance, the topics
column contains a list of topic terms for each post, but the list may be empty.
topics
Now let's define a function called parse_batch()
that takes the JSON response as input and returns a DataFrame instance.
import pandas as pd
def parse_batch(batch):
createdAt = [item.get('createdAt', 'None') for item in batch]
title = [item.get('title', 'None') for item in batch]
excerpt = [item.get('excerpt', 'None') for item in batch]
dummy = []
categories = [item.get('categories', dummy) for item in batch] # every element is a list
topics = [item.get('topics', dummy) for item in batch] # every element is a list
data = {
'createdAt': createdAt,
'title': title,
'excerpt': excerpt,
'categories': categories,
'topics': topics,
}
df = pd.DataFrame(data)
df.loc [:, 'categories'] = df['categories'].apply(lambda x: " | ".join(x))
df.loc [:, 'topics'] = df['topics'].apply(lambda x: " | ".join(x))
return df
Here's the first five rows of our scraped data.
stock = parse_batch(batch)
stock.head()
As of Sep 11, 2021, there are in total 527 forums on Dcard.
import cloudscraper
URL = "https://www.dcard.tw/service/api/v2/forums"
scraper = cloudscraper.create_scraper()
result = scraper.get(URL).json()
len(result)
For each forum, we can get its English name, Chinese name, and the number of users who subscribe to it, as shown in the following dataframe.
import pandas as pd
alias = [item.get('alias') for item in result]
name = [item.get('name') for item in result]
subscriptionCount = [item.get('subscriptionCount') for item in result]
df = pd.DataFrame({"name": name, "alias": alias, "subs": subscriptionCount})
df
Let's just focus on the top 20 forums in terms of subscriptions.
df.sort_values(by=['subs'], ascending=False, inplace=True)
top20 = df.head(20)
top20
To get a better visual representation, let's plot out top20
with plotly
, which has better support for Chinese characters than matplotlib
.
import plotly.express as px
fig = px.bar(
top20, # df object
x="name",
y="subs",
color="subs",
title="Dcard各版訂閱數",
barmode="group",
height=300,
)
fig.show()
We'll use the sqlite3
module to interact with a SQL database. First, the sqlite3.connect()
function creates and then connects to a .db
file, which we name Dcard.db
. The next important thing to do is to create a table in the database. The create_table
variable contains SQL syntax for creating a table named Posts
with five columns, including createdAt
, title
, excerpt
, categories
, and topics
. Crucially, we make the createdAt
column the primary key so that posts with the same primary key will be ignored. The assumption here is that posts with the same timestamp are duplicates, though this might not be always the case. But in lack of info like post IDs, we'll just make do with timestamps.
import sqlite3
conn = sqlite3.connect('Dcard.db')
cursor = conn.cursor()
create_table = """
CREATE TABLE IF NOT EXISTS Posts (
createdAt TIMESTAMP PRIMARY KEY ON CONFLICT IGNORE,
title,
excerpt,
categories,
topics);
"""
cursor.execute(create_table)
conn.commit()
Then we save the stock
dataframe to the table we just created.
stock.to_sql('Posts', conn, if_exists='append', index=False)
conn.commit()
To make sure the data is properly saved, let's load back the dataframe from the database.
new_stock = pd.read_sql("SELECT * FROM Posts;", conn)
new_stock
We'll use the logging
module to create a log file named logging.txt
, which can be configured by the logging.basicConfig()
function. I'd like the logging format to be [{timestamp}] {logging level} | {logging message}
, so the value of the format
argument is [%(asctime)s] %(levelname)s | %(message)s
. In addition, the format of the timestamp can be set up by the datefmt
argument.
import logging
logging.basicConfig(
filename='logging.txt',
filemode="a",
level=logging.INFO,
format="[%(asctime)s] %(levelname)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
Let's test out three types of logs and check out the logging file.
logging.info("This is an info.")
logging.error("This an error!")
logging.warning("This is a warning!")
!head logging.txt
We'll use the schedule
library to activate our Dcard scraper at regular intervals. As a test for the scheduling function, the scheduler.py
simply logs the current time to logging.txt
every three seconds. The first step for scheduling a job is to define a function, which is named job()
in this case. Then the job can be put on schedule by simply calling the schedule.every({num}).{unit}.do({job})
function, where {num}
is an integer, {unit}
a string for unit of time like seconds
, minutes
or hours
, and finally {job}
the function scheduled to run. Finally, if we call the schedule.run_pending()
function within a while
loop, the program will run indefinitely. Run the following cell to create scheduler.py
.
%%writefile scheduler.py
import schedule
import time
from datetime import datetime
import logging
logging.basicConfig(
filename='logging.txt',
filemode="a",
level=logging.INFO,
format="[%(asctime)s] %(levelname)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
def job():
now = datetime.now()
message = f"Hello, the current time is {now}."
logging.info(message)
schedule.every(3).seconds.do(job)
schedule.run_all() #Without this line, the job will start in 3 seconds rather than immediately.
while True:
schedule.run_pending()
time.sleep(1)
Now run python scheduler.py
in the terminal to test the scheduler, which will keep running unless stopped! If you run it for a while and then stop it, logging.txt
will look something like this:
!tail logging.txt
Now that we've covered all the components we need, let's remove logging.txt
and Dcard.db
to start afresh and put everything together. To do that, just run rm logging.txt Dcard.db
in the terminal.
Finally, it's time to put everything together. Run the following cell to create Dcard_scraper.py
. The only thing new here is that this time around we are going to scrape multiple forums rather than just one. So we first create a dictionary called forums
, where the keys are forum names in English and the values their equivalents in Chinese. We'll need the English forum names to get the API full URLs. Plus, we add two more columns in the Posts
tabel of Dcard.db
(i.e. forum_en
and forum_zh
) to store the forum names. The main()
function takes care of iteration over every forum stored in the forums
variable as well as some basic exception handling.
%%writefile Dcard_scraper.py
import cloudscraper
import logging
import pandas as pd
from random import randint
import schedule
import sqlite3
import time
# Configuring the logging.txt file
logging.basicConfig(
filename='logging.txt',
filemode="a",
level=logging.INFO,
format="[%(asctime)s] %(levelname)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
# Dcard API base URL
baseURL = "https://www.dcard.tw/service/api/v2/forums/"
# List of forums. Add as many as you want. Here I'm just picking 18 forums.
forums = {
"dressup": "穿搭",
"relationship": "感情",
"makeup": "美妝",
"food": "美食",
"horoscopes": "星座",
"talk": "閒聊",
"trending": "時事",
"money": "理財",
"funny": "有趣",
"girl": "女孩",
"netflix": "Netflix",
"youtuber": "YouTuber",
"mood": "心情",
"pet": "寵物",
"weight_loss": "減肥",
"fitness": "健身",
"stock": "股票",
"job": "工作",
}
# Creating a SQLite database and a table named Posts
conn = sqlite3.connect('Dcard.db')
cursor = conn.cursor()
create_table = """
CREATE TABLE IF NOT EXISTS Posts (
createdAt TIMESTAMP PRIMARY KEY ON CONFLICT IGNORE,
title,
excerpt,
categories,
topics,
forum_en,
forum_zh);
"""
cursor.execute(create_table)
conn.commit()
# Parsing a batch of JSON response and creating a dataframe out of it
def parse_batch(batch):
createdAt = [item.get('createdAt', 'None') for item in batch]
title = [item.get('title', 'None') for item in batch]
excerpt = [item.get('excerpt', 'None') for item in batch]
dummy = []
categories = [item.get('categories', dummy) for item in batch] # every element is a list
topics = [item.get('topics', dummy) for item in batch] # every element is a list
data = {
'createdAt': createdAt,
'title': title,
'excerpt': excerpt,
'categories': categories,
'topics': topics,
}
df = pd.DataFrame(data)
df.loc [:, 'categories'] = df['categories'].apply(lambda x: " | ".join(x))
df.loc [:, 'topics'] = df['topics'].apply(lambda x: " | ".join(x))
return df
# Main scraper
def main():
scraper = cloudscraper.create_scraper()
sec = randint(1, 15)
for forum_en, forum_zh in forums.items():
result = scraper.get(baseURL + forum_en + "/posts")
if result.status_code == 200:
batch = result.json()
try:
df = parse_batch(batch)
df["forum_en"] = forum_en
df["forum_zh"] = forum_zh
logging.info(f"{df.shape[0]} posts on {forum_en} have been scraped.")
df.to_sql("Posts", conn, if_exists="append", index=False)
conn.commit()
cursor.execute(f"SELECT COUNT(*) from Posts;")
rows = cursor.fetchone()[0]
logging.info(f"There are in total {rows} posts in the DB.")
except Exception as argument:
logging.error(argument)
else:
logging.error(f"The request on {forum_en} was unsuccessful.")
time.sleep(sec)
# Setting the scraping interval
schedule.every(30).minutes.do(main)
schedule.run_all()
while True:
schedule.run_pending()
time.sleep(1)
Now it's harvest time! Run python Dcard_scraper.py
in the terminal to start the scraper, which will run every 30 minutes unless stopped. If everything goes well, the logging.txt
file will look like this:
!tail logging.txt
And here's the result of our hard work! In my case, I ran the scraper for around 3 minutes and got 540 posts.
conn = sqlite3.connect('Dcard.db')
data = pd.read_sql("SELECT * FROM Posts;", conn)
data
In this post, we used cloudscraper
to scrape data from Dcard and schedule
to regularly run the scraper. Both are powerful and elegant libraries that can be applied to any other scraping project. As a side note, I was able to run the Dcard scraper for several days in a row without having any error!