Intro

Dcard is a popular social networking platform in Taiwan, and as such offers great resources for text mining and NLP model building. Our goals in this post is to scrape data from Dcard at regular intervals and persist it to a SQL database without duplicating the same records. We'll be leveraging the Dcard API v2 as well as the following libraries, which are not included in Python's standard library:

  • cloudscraper: for bypassing Cloudflare's anti-bot page
  • pandas: for organizing the scraped data into a tabular format, which can then be easily saved as a SQL table
  • schedule: for scheduling tasks

Installing dependencies

Since pandas is preinstalled on Colab, we only need to install cloudscraper and schedule.

!pip install cloudscraper
!pip install schedule
#!pip pandas # uncomment this if pandas is not installed in your environment

Collecting cloudscraper
  Downloading cloudscraper-1.2.58-py2.py3-none-any.whl (96 kB)
     |████████████████████████████████| 96 kB 4.5 MB/s 
Requirement already satisfied: requests>=2.9.2 in /usr/local/lib/python3.7/dist-packages (from cloudscraper) (2.23.0)
Requirement already satisfied: pyparsing>=2.4.7 in /usr/local/lib/python3.7/dist-packages (from cloudscraper) (2.4.7)
Collecting requests-toolbelt>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 2.4 MB/s 
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.2->cloudscraper) (2021.5.30)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.2->cloudscraper) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.2->cloudscraper) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.2->cloudscraper) (1.24.3)
Installing collected packages: requests-toolbelt, cloudscraper
Successfully installed cloudscraper-1.2.58 requests-toolbelt-0.9.1
Collecting schedule
  Downloading schedule-1.1.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: schedule
Successfully installed schedule-1.1.0

Let's check out the Python version on Colab.

!python --version
Python 3.7.11

Testing the Dcard API

Let's first test the Dcard API v2 with cloudscraper, the syntax of which is much like that of requests. The only difference is that we'll have to first create a scraper instance with cloudscraper.create_scraper(). For each HTTP request, we'll get a batch of 30 posts. The forum variable in the URL is the English name of a forum, and we'll first test the one named stock.

import cloudscraper

forum = "stock"
URL = f"https://www.dcard.tw/service/api/v2/forums/{forum}/posts"
scraper = cloudscraper.create_scraper() 
batch = scraper.get(URL).json()
len(batch)
30

Here's what one single post looks like.

import pprint
pprint.pprint(batch[0])

{'activityAvatar': '',
 'anonymousDepartment': True,
 'anonymousSchool': False,
 'categories': ['請益'],
 'commentCount': 0,
 'createdAt': '2021-09-12T14:22:22.276Z',
 'customStyle': None,
 'elapsedTime': 515,
 'excerpt': '如題,我知道官股存股首推兆豐金,配息又配股,可是會配股的股票代表股本要很大,且每年獲利如果沒跟上的話,EPS會掉,股價也會跟著掉,官股獲利跟十年前比有如一攤死水,小弟我真的不曉得為什麼存股比起存民營金',
 'excerptComments': [],
 'forumAlias': 'stock',
 'forumId': '2fb88b62-aa28-4b18-af51-dda08dd037a9',
 'forumName': '股票',
 'gender': 'M',
 'hidden': False,
 'id': 236953704,
 'isModerator': False,
 'isSuspiciousAccount': False,
 'layout': 'classic',
 'likeCount': 0,
 'media': [],
 'mediaMeta': [],
 'memberType': '',
 'meta': {'layout': 'classic'},
 'nsfw': False,
 'pinned': False,
 'postAvatar': '',
 'reactions': [],
 'replyId': None,
 'replyTitle': None,
 'reportReason': '',
 'reportReasonText': '',
 'school': '弘光科技大學',
 'spoilerAlert': False,
 'tags': [],
 'title': '#請益 #請益 金融股存股疑問 官股 民股',
 'topics': ['請益', '金融', '投資', '官股', '民營金控'],
 'totalCommentCount': 0,
 'updatedAt': '2021-09-12T14:22:22.276Z',
 'verifiedBadge': False,
 'withImages': False,
 'withNickname': False,
 'withVideos': False}

Parsing the JSON response

Then we'll parse the JSON response to get the data we're interested in, including title, createdAt, categories, excerpt, and topics.

cols = ['title', 'createdAt', 'categories', 'excerpt', 'topics']
title = [item.get(cols[0]) for item in batch]
createdAt = [item.get(cols[1]) for item in batch]
categories = [item.get(cols[2]) for item in batch]
excerpt = [item.get(cols[3]) for item in batch]
topics = [item.get(cols[4]) for item in batch]

For instance, the topics column contains a list of topic terms for each post, but the list may be empty.

topics

[['請益', '金融', '投資', '官股', '民營金控'],
 ['分析', '台股', '當沖', '波段', '技術分析'],
 ['投資', '股票', '理財', '台股', '股市'],
 ['請益'],
 ['兇', '韭菜'],
 ['投資', '股票', '美股', 'ETF', '新聞'],
 ['投資', '股票', '美股', '股市', '新手'],
 ['股票', '投資', '理財', '台股', '當沖'],
 ['投資', '股票', '理財', '台股', '生活'],
 ['股票', '美股', '技術分析', '狼王'],
 ['請益', '新聞', '影響', '投資', '股票'],
 ['投資', '股票', '理財', '台股', '股市'],
 ['時事', '分享', '股票', '理財', '新聞'],
 ['老師', '直播', '筆記', '股票', '投資'],
 ['海外', '券商', '法律'],
 ['分享', '金融', '投資', '股票', '理財'],
 ['投資', 'app'],
 ['股票'],
 ['股票', '分享'],
 ['分享', '股票', '投資', '當沖', '分析'],
 ['股息', '分享', '股利', '股票', '投資'],
 ['投資', '理財', '台股', '股票', '股市'],
 ['股票', '台股'],
 ['投資', '股票', '理財', '台股', '股市'],
 ['股市', '當沖', '波段', '大盤'],
 ['etf', '投資'],
 ['美股', '股票', '技術分析', '狼王'],
 ['美股', '股票', '理財', 'NVIDIA', '投資'],
 ['股票', '投資'],
 ['請益', '券商', '股票', '投資', '台股']]

Creating DataFrame from JSON

Now let's define a function called parse_batch() that takes the JSON response as input and returns a DataFrame instance.

import pandas as pd

def parse_batch(batch):
    createdAt = [item.get('createdAt', 'None') for item in batch]
    title = [item.get('title', 'None') for item in batch]
    excerpt = [item.get('excerpt', 'None') for item in batch]
    dummy = []
    categories = [item.get('categories', dummy) for item in batch] # every element is a list
    topics = [item.get('topics', dummy) for item in batch] # every element is a list
    data = {
        'createdAt': createdAt,
        'title': title,
        'excerpt': excerpt,
        'categories': categories,
        'topics': topics,    
        }
    df = pd.DataFrame(data)
    df.loc [:, 'categories'] = df['categories'].apply(lambda x: " | ".join(x))
    df.loc [:, 'topics'] = df['topics'].apply(lambda x: " | ".join(x))
    return df

Here's the first five rows of our scraped data.

stock = parse_batch(batch)
stock.head()
createdAt title excerpt categories topics
0 2021-09-12T14:22:22.276Z #請益 #請益 金融股存股疑問 官股 民股 如題,我知道官股存股首推兆豐金,配息又配股,可是會配股的股票代表股本要很大,且每年獲利如果沒... 請益 請益 | 金融 | 投資 | 官股 | 民營金控
1 2021-09-12T13:59:36.831Z #分享 9/12隔日當沖+波段分析 **無推薦跟單之意**,**純個人操作分享**,**損益自負**,本人當沖熱愛Tick流玩法... 分享 分析 | 台股 | 當沖 | 波段 | 技術分析
2 2021-09-12T13:02:01.859Z #分享 09/12類股分享-技術、籌碼分析 以下為個人技術及籌碼面分析,僅供參考,進出場請依照個人看法做決定。每日會有一篇更詳細的類股分... 分享 投資 | 股票 | 理財 | 台股 | 股市
3 2021-09-12T11:55:08.194Z #請益 長榮成本17元。11年前買的 請問一下。長榮海運股票10張。成本17元…11年前買的,,,忘記自己有 這檔股票。何時該出場賣掉? 請益 請益
4 2021-09-12T11:25:08.604Z #其他 簡訊越來越兇了啦 現在報明牌的簡訊越來越兇了~大家有發現嗎?啊每天那麼多封簡訊~一下ㄟ咪~一下candy,都不... 其他 兇 | 韭菜

Getting forum names

As of Sep 11, 2021, there are in total 527 forums on Dcard.

import cloudscraper

URL = "https://www.dcard.tw/service/api/v2/forums" 
scraper = cloudscraper.create_scraper() 
result = scraper.get(URL).json()
len(result)
527

For each forum, we can get its English name, Chinese name, and the number of users who subscribe to it, as shown in the following dataframe.

import pandas as pd

alias = [item.get('alias') for item in result]
name = [item.get('name') for item in result]
subscriptionCount = [item.get('subscriptionCount') for item in result]
df = pd.DataFrame({"name": name, "alias": alias, "subs": subscriptionCount})
df
name alias subs
0 午夜實驗室 midnightlab 1711
1 時光膠囊 timecapsule 4284
2 母親節 mother 373
3 聖誕CiaoCiao merryxmas 16807
4 父親節 father 363
... ... ... ...
522 スポーツ jp_sport 110
523 ミーム jp_meme 48
524 MAMAMOO mamamoo 4316
525 無性戀 asexuality 769
526 學士後 post_bachelor 592

527 rows × 3 columns

Let's just focus on the top 20 forums in terms of subscriptions.

df.sort_values(by=['subs'], ascending=False, inplace=True)
top20 = df.head(20)
top20
name alias subs
373 西斯 sex 639112
224 穿搭 dressup 586341
228 感情 relationship 583232
217 美妝 makeup 487542
233 梗圖 meme 476599
273 美食 food 413792
230 閒聊 talk 375398
270 星座 horoscopes 364226
346 時事 trending 358119
340 理財 money 323464
231 有趣 funny 295991
287 Netflix netflix 295088
234 女孩 girl 289326
212 YouTuber youtuber 283766
229 心情 mood 281588
328 減肥 weight_loss 255195
261 寵物 pet 248843
447 股票 stock 239823
327 健身 fitness 239310
347 工作 job 231564

To get a better visual representation, let's plot out top20 with plotly, which has better support for Chinese characters than matplotlib.

import plotly.express as px

fig = px.bar(
            top20, # df object
            x="name", 
            y="subs",
            color="subs",
            title="Dcard各版訂閱數",
            barmode="group",
            height=300,
            )
fig.show()

Persisting data to SQL

We'll use the sqlite3 module to interact with a SQL database. First, the sqlite3.connect() function creates and then connects to a .db file, which we name Dcard.db. The next important thing to do is to create a table in the database. The create_table variable contains SQL syntax for creating a table named Posts with five columns, including createdAt, title, excerpt, categories, and topics. Crucially, we make the createdAt column the primary key so that posts with the same primary key will be ignored. The assumption here is that posts with the same timestamp are duplicates, though this might not be always the case. But in lack of info like post IDs, we'll just make do with timestamps.

import sqlite3

conn = sqlite3.connect('Dcard.db')  
cursor = conn.cursor()
create_table = """
CREATE TABLE IF NOT EXISTS Posts (
    createdAt TIMESTAMP PRIMARY KEY ON CONFLICT IGNORE,
    title,
    excerpt, 
    categories, 
    topics);
"""
cursor.execute(create_table)  
conn.commit()

Then we save the stock dataframe to the table we just created.

stock.to_sql('Posts', conn, if_exists='append', index=False) 
conn.commit()

To make sure the data is properly saved, let's load back the dataframe from the database.

new_stock = pd.read_sql("SELECT * FROM Posts;", conn)
new_stock

createdAt title excerpt categories topics
0 2021-09-12T14:22:22.276Z #請益 #請益 金融股存股疑問 官股 民股 如題,我知道官股存股首推兆豐金,配息又配股,可是會配股的股票代表股本要很大,且每年獲利如果沒... 請益 請益 | 金融 | 投資 | 官股 | 民營金控
1 2021-09-12T13:59:36.831Z #分享 9/12隔日當沖+波段分析 **無推薦跟單之意**,**純個人操作分享**,**損益自負**,本人當沖熱愛Tick流玩法... 分享 分析 | 台股 | 當沖 | 波段 | 技術分析
2 2021-09-12T13:02:01.859Z #分享 09/12類股分享-技術、籌碼分析 以下為個人技術及籌碼面分析,僅供參考,進出場請依照個人看法做決定。每日會有一篇更詳細的類股分... 分享 投資 | 股票 | 理財 | 台股 | 股市
3 2021-09-12T11:55:08.194Z #請益 長榮成本17元。11年前買的 請問一下。長榮海運股票10張。成本17元…11年前買的,,,忘記自己有 這檔股票。何時該出場賣掉? 請益 請益
4 2021-09-12T11:25:08.604Z #其他 簡訊越來越兇了啦 現在報明牌的簡訊越來越兇了~大家有發現嗎?啊每天那麼多封簡訊~一下ㄟ咪~一下candy,都不... 其他 兇 | 韭菜
5 2021-09-12T10:23:31.827Z #分享 這週方舟機構ARK持股變化 分享這週(9/6 ~9/10)ARK持股變化,股票代碼-ARKQ 所屬ETF ARKQ,PA... 分享 投資 | 股票 | 美股 | ETF | 新聞
6 2021-09-12T10:10:13.080Z #分享 美股屑財報-本週財報與重點事件 .,IG :美股餅乾屑,(週報固定每週日晚上6點更新),.,本週財報真的是有些無聊,但!要發... 分享 投資 | 股票 | 美股 | 股市 | 新手
7 2021-09-12T09:55:47.721Z #分享 明日當沖觀盤重點 歡迎大家追蹤我一起學習哦! 分享 股票 | 投資 | 理財 | 台股 | 當沖
8 2021-09-12T09:47:05.404Z #標的 聯電以及智原個股分析 這禮拜最後一次寫聯電發現大家真的對聯電很有興趣🤣那我們廢話不多說,馬上開始吧~聯電(2303... 標的 投資 | 股票 | 理財 | 台股 | 生活
9 2021-09-12T08:03:31.864Z #分享 狼王9月11日周六特輯 粉絲個股投票時間以及那些可以 中線佈局的股票們 ROKU ADSK CHWY SAVA SA... 分享 股票 | 美股 | 技術分析 | 狼王
10 2021-09-12T05:46:00.967Z #請益 關於這新聞,各位怎麼看?真的會影響嗎? 像這種新聞,對於股市的影響程度會有影響嗎?股市常常起起伏伏,真的很怕被這種新聞給狙擊 請益 請益 | 新聞 | 影響 | 投資 | 股票
11 2021-09-12T05:26:03.419Z #分享 分享個股看法 2390 3450 本次解析一下云辰和聯鈞,歡迎下方留言處討論,有任何地方標示錯誤請指教。我是股海一滴水,我們下... 分享 投資 | 股票 | 理財 | 台股 | 股市
12 2021-09-12T04:44:13.512Z #分享 時事分享—以股分交換作為企業佈局手段 大鯨魚吃小蝦米的故事,在資本市場其實很常發生,尤其是在歐美市場很流行以併購的方式,來壯大公司... 分享 時事 | 分享 | 股票 | 理財 | 新聞
13 2021-09-12T03:57:18.993Z #分享 Ashin老師9/10直播筆記 分享 老師 | 直播 | 筆記 | 股票 | 投資
14 2021-09-12T03:33:35.076Z #請益 海外券商閒置太久會有什麼法律上的問題嗎 如題目 最近想開始存美股了 哪其實上網查到很多有用的資訊了 剩下這個問題 如果閒置太久會有法... 請益 海外 | 券商 | 法律
15 2021-09-12T02:57:04.070Z #分享 搭上轉型列車-金融數位化 金融轉型勢必為未來趨勢,金融數位化,也是現在台灣金融產業正在著手進行的事情,而開發金也不難看... 分享 分享 | 金融 | 投資 | 股票 | 理財
16 2021-09-12T02:38:43.280Z #請益 【請教】XQ全球贏家APP 新功能-贏家選股使用心得 最近在XQ全球贏家的FB粉專上,看到他們的手機APP有推出一個全新的功能—贏家選股,聽說有高... 請益 投資 | app
17 2021-09-12T02:00:14.107Z #其他 本周紀錄 前幾天發文沒開到卡稱,再發一次紀錄,這邊只留短線操作紀錄,長線是定期定額ETF,一詮,欣興小... 其他 股票
18 2021-09-11T18:23:49.445Z #分享 各位到底知道股票為什麼會漲嗎? 我資歷大約12年,從高中時就知道,一定要學股票,因為多數的基金也是靠買股票賺,那何不自己學起... 分享 股票 | 分享
19 2021-09-11T14:55:53.481Z #分享 當沖Tick流分享 這兩天很多人問我這個問題,這邊來分享一下當沖Tick玩法,單純分享自身經驗、操作模式,沒有任... 分享 分享 | 股票 | 投資 | 當沖 | 分析
20 2021-09-11T12:21:23.991Z #分享 近期高殖利率標的 下週一一張國揚(2505)可以領1.5,下週二一張聲寶(1604)可以領2.5,目前這兩檔都... 分享 股息 | 分享 | 股利 | 股票 | 投資
21 2021-09-11T12:17:40.308Z #分享 編劇給我找出來 不囉嗦上圖,你們就看看麗珠,早在幾年前就已經看好航運了,你各位現在才追?附上人權啦,豪冷,2... 分享 投資 | 理財 | 台股 | 股票 | 股市
22 2021-09-11T04:46:19.407Z #分享 高伯精選股-2481多 單純看K棒來說 股價來打左點紅棒低點 停損小,現在進場大約目標就是110的位置 我自己停損大... 分享 股票 | 台股
23 2021-09-11T02:42:41.606Z #分享 東海彼得 - 9/10盤後分析 9/10 盤後分析,近期美股的走勢相對平穩,相較於一個月前的波動,可以說反差非常劇烈,最主要... 分享 投資 | 股票 | 理財 | 台股 | 股市
24 2021-09-11T01:38:21.750Z #分享 如我上週所判斷,那股市下週該如何.. 上週分享的..也被我說中了..,以下是上週分享的觀點,那本週的狀況..,有關注的就會發現大盤... 分享 股市 | 當沖 | 波段 | 大盤
25 2021-09-11T01:09:59.029Z #請益 定期定額買etf 想請教大家,我是股市新手,同時也是社會新鮮人,由於現在台股一直都在1w7左右,應該是台股最旺... 請益 etf | 投資
26 2021-09-11T01:04:15.684Z #分享 狼王9月10日美股復盤 震蕩的一周 貌似上週推演就提醒危險了~?今天我出手買貨了哦~ MA TSLA AAPL NF... 分享 美股 | 股票 | 技術分析 | 狼王
27 2021-09-10T16:19:52.902Z #標的 美股 Nvidia 輝達($NVDA) 看多 今天要來談的標的是 $NVDA。如果你有在打遊戲,就一定會知道他們的超強顯卡。他們的顯卡不只... 標的 美股 | 股票 | 理財 | NVIDIA | 投資
28 2021-09-10T15:55:10.143Z #請益 請問這是600萬賠到300萬嗎 很謝謝大家的回覆但原本的資訊好像不太清楚,這是完整內容再麻煩大家解惑了謝謝,————————... 請益 股票 | 投資
29 2021-09-10T15:51:03.972Z #請益 #請益 請問這是哪家券商的介面 請問這是哪家券商的下單系統,可以顯示百分比!,跪求各位大神‍️‍️‍️ 請益 請益 | 券商 | 股票 | 投資 | 台股

Testing the logger

We'll use the logging module to create a log file named logging.txt, which can be configured by the logging.basicConfig() function. I'd like the logging format to be [{timestamp}] {logging level} | {logging message}, so the value of the format argument is [%(asctime)s] %(levelname)s | %(message)s. In addition, the format of the timestamp can be set up by the datefmt argument.

import logging

logging.basicConfig(
        filename='logging.txt',
        filemode="a",
        level=logging.INFO,
        format="[%(asctime)s] %(levelname)s | %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )

Let's test out three types of logs and check out the logging file.

logging.info("This is an info.")
logging.error("This an error!")
logging.warning("This is a warning!")
!head logging.txt
[2021-09-12 08:31:31] INFO | This is an info.
[2021-09-12 08:31:31] ERROR | This an error!
[2021-09-12 08:31:31] WARNING | This is a warning!

Testing the scheduler

We'll use the schedule library to activate our Dcard scraper at regular intervals. As a test for the scheduling function, the scheduler.py simply logs the current time to logging.txt every three seconds. The first step for scheduling a job is to define a function, which is named job() in this case. Then the job can be put on schedule by simply calling the schedule.every({num}).{unit}.do({job}) function, where {num} is an integer, {unit} a string for unit of time like seconds, minutes or hours, and finally {job} the function scheduled to run. Finally, if we call the schedule.run_pending() function within a while loop, the program will run indefinitely. Run the following cell to create scheduler.py.

%%writefile scheduler.py

import schedule 
import time
from datetime import datetime
import logging

logging.basicConfig(
        filename='logging.txt',
        filemode="a",
        level=logging.INFO,
        format="[%(asctime)s] %(levelname)s | %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )

def job():
    now = datetime.now()
    message = f"Hello, the current time is {now}."
    logging.info(message)

schedule.every(3).seconds.do(job)
schedule.run_all() #Without this line, the job will start in 3 seconds rather than immediately. 

while True:
    schedule.run_pending()
    time.sleep(1)

Now run python scheduler.py in the terminal to test the scheduler, which will keep running unless stopped! If you run it for a while and then stop it, logging.txt will look something like this:

!tail logging.txt
[2021-09-12 08:31:31] ERROR | This an error!
[2021-09-12 08:31:31] WARNING | This is a warning!
[2021-09-12 08:37:40] INFO | Hello, the current time is 2021-09-12 08:37:40.643431.
[2021-09-12 08:37:43] INFO | Hello, the current time is 2021-09-12 08:37:43.647108.
[2021-09-12 08:37:46] INFO | Hello, the current time is 2021-09-12 08:37:46.651045.
[2021-09-12 08:37:49] INFO | Hello, the current time is 2021-09-12 08:37:49.655009.
[2021-09-12 08:37:52] INFO | Hello, the current time is 2021-09-12 08:37:52.658558.
[2021-09-12 08:37:55] INFO | Hello, the current time is 2021-09-12 08:37:55.662393.
[2021-09-12 08:37:58] INFO | Hello, the current time is 2021-09-12 08:37:58.666367.
[2021-09-12 08:38:01] INFO | Hello, the current time is 2021-09-12 08:38:01.669397.

Now that we've covered all the components we need, let's remove logging.txt and Dcard.db to start afresh and put everything together. To do that, just run rm logging.txt Dcard.db in the terminal.

Putting everything together

Finally, it's time to put everything together. Run the following cell to create Dcard_scraper.py. The only thing new here is that this time around we are going to scrape multiple forums rather than just one. So we first create a dictionary called forums, where the keys are forum names in English and the values their equivalents in Chinese. We'll need the English forum names to get the API full URLs. Plus, we add two more columns in the Posts tabel of Dcard.db (i.e. forum_en and forum_zh) to store the forum names. The main() function takes care of iteration over every forum stored in the forums variable as well as some basic exception handling.

%%writefile Dcard_scraper.py

import cloudscraper
import logging
import pandas as pd
from random import randint
import schedule
import sqlite3
import time

# Configuring the logging.txt file
logging.basicConfig(
        filename='logging.txt',
        filemode="a",
        level=logging.INFO,
        format="[%(asctime)s] %(levelname)s | %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )

# Dcard API base URL
baseURL = "https://www.dcard.tw/service/api/v2/forums/"

# List of forums. Add as many as you want. Here I'm just picking 18 forums. 
forums = {
    "dressup": "穿搭",
    "relationship": "感情",
    "makeup": "美妝",
    "food": "美食",
    "horoscopes": "星座",
    "talk": "閒聊",
    "trending": "時事",
    "money": "理財",
    "funny": "有趣",
    "girl": "女孩",
    "netflix": "Netflix",
    "youtuber": "YouTuber",
    "mood": "心情",
    "pet": "寵物",
    "weight_loss": "減肥",
    "fitness": "健身",
    "stock": "股票",
    "job": "工作",
}

# Creating a SQLite database and a table named Posts
conn = sqlite3.connect('Dcard.db')  
cursor = conn.cursor()
create_table = """
CREATE TABLE IF NOT EXISTS Posts (
    createdAt TIMESTAMP PRIMARY KEY ON CONFLICT IGNORE,
    title,
    excerpt, 
    categories, 
    topics,
    forum_en,
    forum_zh);
"""
cursor.execute(create_table)  
conn.commit()

# Parsing a batch of JSON response and creating a dataframe out of it
def parse_batch(batch):
    createdAt = [item.get('createdAt', 'None') for item in batch]
    title = [item.get('title', 'None') for item in batch]
    excerpt = [item.get('excerpt', 'None') for item in batch]
    dummy = []
    categories = [item.get('categories', dummy) for item in batch] # every element is a list
    topics = [item.get('topics', dummy) for item in batch] # every element is a list
    data = {
        'createdAt': createdAt,
        'title': title,
        'excerpt': excerpt,
        'categories': categories,
        'topics': topics,    
        }
    df = pd.DataFrame(data)
    df.loc [:, 'categories'] = df['categories'].apply(lambda x: " | ".join(x))
    df.loc [:, 'topics'] = df['topics'].apply(lambda x: " | ".join(x))
    return df

# Main scraper
def main():
    scraper = cloudscraper.create_scraper()
    sec = randint(1, 15)

    for forum_en, forum_zh in forums.items():
        result = scraper.get(baseURL + forum_en + "/posts")

        if result.status_code == 200:
            batch = result.json()
            try:
                df = parse_batch(batch)
                df["forum_en"] = forum_en
                df["forum_zh"] = forum_zh
                logging.info(f"{df.shape[0]} posts on {forum_en} have been scraped.")
                df.to_sql("Posts", conn, if_exists="append", index=False)
                conn.commit()
                cursor.execute(f"SELECT COUNT(*) from Posts;")
                rows = cursor.fetchone()[0]
                logging.info(f"There are in total {rows} posts in the DB.")
            except Exception as argument:
                logging.error(argument)
        else:
            logging.error(f"The request on {forum_en} was unsuccessful.")

        time.sleep(sec)

# Setting the scraping interval
schedule.every(30).minutes.do(main)  
schedule.run_all()

while True:
    schedule.run_pending()
    time.sleep(1)

Now it's harvest time! Run python Dcard_scraper.py in the terminal to start the scraper, which will run every 30 minutes unless stopped. If everything goes well, the logging.txt file will look like this:

!tail logging.txt
[2021-09-12 14:44:21] INFO | 30 posts on pet have been scraped.
[2021-09-12 14:44:21] INFO | There are in total 420 posts in the DB.
[2021-09-12 14:44:33] INFO | 30 posts on weight_loss have been scraped.
[2021-09-12 14:44:33] INFO | There are in total 450 posts in the DB.
[2021-09-12 14:44:46] INFO | 30 posts on fitness have been scraped.
[2021-09-12 14:44:46] INFO | There are in total 480 posts in the DB.
[2021-09-12 14:44:58] INFO | 30 posts on stock have been scraped.
[2021-09-12 14:44:58] INFO | There are in total 510 posts in the DB.
[2021-09-12 14:45:10] INFO | 30 posts on job have been scraped.
[2021-09-12 14:45:10] INFO | There are in total 540 posts in the DB.

And here's the result of our hard work! In my case, I ran the scraper for around 3 minutes and got 540 posts.

conn = sqlite3.connect('Dcard.db')  
data = pd.read_sql("SELECT * FROM Posts;", conn)
data

createdAt title excerpt categories topics forum_en forum_zh
0 2021-09-12T14:35:07.314Z 問air force真假 ️第一次發文,不知道發在穿搭版可不可以,排版不好請見諒若有違反規定會刪文,前陣子在蝦皮購買一... 問 | force | 真假 | 穿搭 | 蝦皮 dressup 穿搭
1 2021-09-12T14:18:12.598Z 疫情買的衣服分享🙌淘寶居多 我是女生!,本人156/45,衣服都蠻平價的~1⃣️,洋裝:淘寶,包:Toae,鞋子:淘寶,... 疫情 | 衣服 | 分享 dressup 穿搭
2 2021-09-12T13:56:35.559Z 我與室友的穿搭分享 趁颱風天沒事來分享我跟室友的穿搭~(沒戴口罩的是疫情前拍的呦),先分享室友的,1. 單車褲穿... 穿搭 | 女生穿搭 dressup 穿搭
3 2021-09-12T13:54:26.217Z #問 求包包的關鍵字 想請問俞丁背的這種包叫什麼名字,有點像送子鳥包,可是我在蝦皮都找不到類似的,或是有人在哪些網... 包包 | 關鍵字 dressup 穿搭
4 2021-09-12T13:36:51.407Z #問 北臉包包代購 小妹想買這個包包很久了,但北臉的包包是第一次購買,怕買到仿冒品,想請教各位版友,有推薦的賣家... 問 | 北臉 | 包包 | 真假 dressup 穿搭
... ... ... ... ... ... ... ...
535 2021-09-12T12:37:12.067Z 轉職 通勤or租屋請益 大家好,小妹預計10月初到新公司(林口)報到,家住新北汐止,目前煩惱要開車通勤(40-50分... 通勤 | 租屋 job 工作
536 2021-09-12T12:24:17.122Z 早八晚五工作 請問有什麼工作是早八晚五,(正職),但是放假不是見紅就休?而是排休的? 工作 | 工作經驗 job 工作
537 2021-09-12T12:20:05.422Z #問 會問生活體驗是希望得到什麼答案 如題,面試時公司給了一張基本資料表,第一個問我對“工作”(沒確切說是應徵職位還是工作本身)的... 工作 | 求職 job 工作
538 2021-09-12T12:11:54.507Z 工作幾年後還會想回學校讀書嗎? 以前老師常說要讀就一口氣讀不要中斷不然很難工作後再回來讀,我自己是工作2年後離職準備一年考上... 學校 | 讀書 | 工作 job 工作
539 2021-09-12T12:10:44.789Z #問 大學研究中心徵才 本人為私立應屆,最近在求職中,看到很多四大四中的研究中心之類的在徵才,也有投遞履歷約面試,職... 徵才 | 問 | 應屆畢業生 job 工作

540 rows × 7 columns

Recap

In this post, we used cloudscraper to scrape data from Dcard and schedule to regularly run the scraper. Both are powerful and elegant libraries that can be applied to any other scraping project. As a side note, I was able to run the Dcard scraper for several days in a row without having any error!