Research Blog | My Site 3

Course Project: Part 1 – Collecting Reddit Data on Canadian Immigration

Mar 31
7 min read

Updated: Apr 1

⚠️ Disclaimer: This blog post represents my personal research and analysis conducted independently. The views, methods, and conclusions described herein are strictly my own and do not represent or imply any official position, policy, or endorsement from my affiliated institutions or organizations.

1 Research Background

I am conducting a study on Canadian immigration as part of my final project for the AIDI 1003 course, which is part of the Artificial Intelligence – Architecture, Design, and Implementation postgraduate certificate program at Georgian College. The core idea is to investigate whether users’ discussions about immigration-related issues on Reddit show any significant differences before and after a key policy change.

I consider May 31, 2023, as a watershed point and thus divide the timeline into:

Pre-policy: 2022-12-01 – 2023-05-31 (6 months)
Post-policy: 2023-06-01 – 2023-11-30 (6 months)

On that day, Immigration, Refugees and Citizenship Canada (IRCC) announced a major reform: a category-based Express Entry invitation targeting specific occupations or languages (Canada.ca announcement). This might affect how people discuss the immigration process, occupational demands, and their attitudes toward future applications. Therefore, I aim to observe whether there is any noticeable change in discussion intensity or themes within Reddit’s immigration communities following the announcement.

2 Research Purpose

2.1 Exploratory Research: Emphasis on Machine Learning (ML) Practice

This project is intended as an exploratory study, focused primarily on applying the machine learning (ML) tools I learned during the course and testing their usefulness and limitations in large-scale text analysis. Rather than following a traditional academic approach with a comprehensive literature review, my goal is to experiment with techniques like unsupervised topic modeling (e.g., BERTopic) and sentiment analysis, which are especially useful for large, unstructured social media datasets, to see if they can reveal valuable early insights or trends from real Reddit data.

2.2 A Practice-Driven Approach

As part of this process, the study focuses on designing a custom data pipeline and experimenting with various ML tools to assess their effectiveness in handling large-scale text data and their potential to discover meaningful insights. In addition to simply running these algorithms, I plan to evaluate them through measures such as topic coherence, interpretability of keywords in topic modeling, and accuracy or recall for sentiment analysis. This ensures that I’m not just “running the models,” but actually gauging how well they perform and whether they produce actionable findings.

Even if the outcomes do not align with my initial expectations, the process itself will serve as a valuable learning experience—enhancing my understanding of the practical limitations and real-world feasibility of ML techniques in research settings. By identifying concrete evaluation criteria for each model, I hope to strike a balance between hands-on experimentation and a more systematic assessment of how these tools can reveal insights about the Reddit discussions.

2.3 Relationship with More Academic Research

Despite its exploratory nature, this study still carries academic value — it can serve as a foundation for more systematic and comprehensive research in the future. At this stage, however, my focus remains on demonstrating the feasibility of the tools and methods themselves. For those engaged in academic research, this project may be seen as a small-scale “methodological experiment,” offering a glimpse into how ML techniques, when combined with human interpretation, could inspire ideas for future, more rigorous studies. Furthermore, this methodological experiment could later be expanded with labeled evaluation sets or integrated into a larger theoretical framework, allowing for more in-depth academic insights.

3 Subreddit Selection

This study focuses on two subreddits: r/ImmigrationCanada and r/CanadaImmigrant.

As of now, r/ImmigrationCanada has over 256K members (ranking in the top 1% of all subreddits), while r/CanadaImmigrant exceeds 11K members (top 7%). Both exhibit sufficient activity to provide enough data for analysis before and after the policy change.

These communities frequently discuss visa applications, work/study permit questions, and broader opinions on the immigration policies. Their scope is relatively comprehensive, covering various immigration programs and diverse user backgrounds. Hence, they serve as valuable vantage points for capturing how people respond to federal-level immigration policy changes.

In contrast, other specialized or regional subreddits—like r/ExpressEntry —tend to have narrower discussions. Such spaces may offer in-depth info on single aspects (e.g., Express Entry draws) but may not reflect a more general, nationwide immigration discourse. Since I aim to observe macro-level trends across different applicant types, r/ImmigrationCanada and r/CanadaImmigrant are better suited to capturing diverse perspectives, from policy talk to practical visa concerns.

4 Obtaining Reddit Data

4.1 Initial Attempts and Limitations

4.1.1 Reddit API

Initially, I intended to use Reddit’s official API through PRAW, but I soon ran into two major obstacles.

First, the search API typically returns only about 1,000 of the most recent or relevant posts per query—far too few to cover several months of historical data from any given subreddit.

Second, strict rate limits and complex pagination made it difficult to handle large volumes of data. To retrieve daily or weekly posts spanning multiple months, one would need to carefully manage segmented queries and track post IDs, which quickly becomes cumbersome.

In practice, these constraints make the official Reddit API unsuitable for complete historical coverage of specific subreddits, I ultimately could only retrieve a limited subset of recent posts.

4.1.2 Keyword Searches

I also attempted to retrieve data using Python scripts (via Reddit’s official API or PRAW) to perform keyword-based searches, but this method proved unreliable for systematically collecting large volumes of historical immigration-related posts.

First, keyword variations and synonyms pose a significant obstacle. For example, searching simply for “Canada immigration” or “visa” would certainly retrieve many posts—but would miss numerous alternative spellings (e.g., “IM Canada,” “enter Canada”) or synonymous expressions (e.g., “work permit,” “PR process”) frequently used in casual Reddit conversations. Capturing all relevant variations would require constructing an extensive keyword list, inevitably bringing in substantial irrelevant content and demanding additional spelling corrections.

Second, keyword searches typically produce substantial amounts of irrelevant noise. Searching “visa,” for instance, returns posts unrelated to immigration, such as discussions about stolen “VISA cards.” Moreover, if I also need strict coverage of a particular time period (e.g., all posts within a certain month), keyword search alone does not guarantee capturing all relevant content, as results are influenced by search ranking, relevance sorting, and the API's inherent limitations.

Therefore, such an approach fails to guarantee comprehensive data capture due to unavoidable keyword omissions, informal language variations, and noise filtering challenges. To overcome these limitations, I adopted a more reliable method in subsequent steps—downloading monthly .zst data archives.

4.2 Using Monthly .zst Archives

4.2.1 Finding the Dump Post in r/pushshift: Why Choose .zst Files?

Due to the limitations described above, I ultimately turned to using monthly .zst archives, which provide Reddit’s complete historical data without the constraints imposed by the official API or keyword searches. Specifically, on r/pushshift, I found a post titled “Dump files from 2005-06 to 2024-12” by user Watchful1, which provides torrent files containing monthly Reddit submissions (RS) and comments (RC) from June 2005 to December 2024. These archives must be downloaded via a BitTorrent client.

Each .zst file is approximately 15–20GB compressed, expanding to over 50GB after decompression, thus requiring substantial disk space. Unlike keyword searches or API-based methods, these archives guarantee complete coverage of all posts across every subreddit during each month, effectively bypassing the 1,000-post API limit. The primary drawback is the need for patience and sufficient bandwidth during torrent downloads; however, this method remains the most reliable approach for obtaining comprehensive historical Reddit data.

4.2.2 Download and Filtering

To obtain all posts from the selected subreddits, I followed three main steps:

Step 1: Torrent Download (Selecting Required Months)

I installed a torrent client (qBittorrent) and loaded the torrent file provided in the r/pushshift post. I decided not to include RC (comments) files because, at this stage, I’m not analyzing the interplay between top-level posts and the subsequent discussion threads. Future expansions may consider comment dynamics.

Step 2: Stream Decompression and Subreddit Filtering (Python + zstandard)

Considering directly decompressing these large files would consume excessive disk space and memory, I employed stream-based decompression using Python's zstandard library. My Python script (Github link) reads and decompresses files incrementally (in ~1MB chunks), avoiding memory overload. Each line of the decompressed file is a JSON-formatted Reddit submission, and my script efficiently filters these submissions by checking if the subreddit field matches "immigrationcanada" or "canadaimmigrant" (lowercase). Matching submissions are immediately written into smaller monthly .jsonl files (e.g., filtered_YYYY-MM.jsonl), while irrelevant entries are skipped. This approach significantly reduces storage overhead and produces smaller files that are easier to load for subsequent analysis.

Step 3: JSONL to CSV Conversion (Python + csv)

To further simplify analysis and modeling, I converted the filtered monthly .jsonl files into .csv format. Another Python script (also to be shared soon on GitHub) reads each JSON-formatted submission line-by-line, extracts relevant fields (subreddit, title, selftext, and created_utc), and writes them directly into structured CSV files. CSV files are convenient for subsequent analyses using Excel, R, Python pandas, or various machine learning tools.

Thus, by the end of this step, I obtained a set of monthly CSV files containing only the target subreddits’ posts, across the specific timeline, significantly cutting down on storage and preparing for the next stage.

5 Subsequent Analysis: Foundations for Unsupervised Learning and Sentiment Analysis

Having completed the above steps, I now have the Reddit posts from both the “pre-policy” and “post-policy” periods. Next, I plan to load them into an unsupervised topic modeling tool (e.g., BERTopic) for comparative analysis, and check whether users’ topics or sentiments differ between the two timeframes.

If you’re also interested in large-scale Reddit data collection, feel free to connect with me in the comments. In my next post, I’ll describe how to load these datasets into a model, tweak parameters, and share more thoughts on the May 2023 policy changes.

⚠️ Copyright Notice: This blog post and its associated scripts contain original work created by me. You are welcome to reference or adapt the information provided here for your own purposes, but please clearly cite this source.