Web Crawler Project for Automated Data Collection
This is a simple PHP-based project designed to collect data from websites and forums.
Web Crawler Project for Automated Data Collection from Websites and Forums.
A while ago, I needed to gather data on a specific topic from a forum containing over 100 pages of archived posts, with each page holding 10 posts. Manually collecting this volume of data was a tedious and time-consuming task, so I developed a PHP web crawler script to automatically collect and store all the targeted content in a database. This project evolved into an exciting venture for me, leading to additional features like collecting images from specific websites and navigating nested links for deeper searches.
The first version of this script is available on GitHub, with further details about its functionality provided below.
Technical Details:
- Script Language: PHP
- Database: MySQL
Key Features:
- Collects post data from forums and websites
- Stores collected data in a database
- Enables searching within stored data using AJAX
- Displays stored data in a paginated format
Example Video:
In the demo video, we extract all posts related to the topic “Most Beautiful Modern Poems (Short Poems)” from the “Shahr Sakht Afzar” forum.
Note: The script extracts all content between specified tags, including HTML tags if present.
The raw database required is available on GitHub. After downloading the project, you can import the database and add other project files to a server directory or localhost. Database connection settings can be found in the db.php
file.
To view the source code, please visit the GitHub repository via the link below: