projects

Web Crawler Project for Automated Data Collection

This is a simple PHP-based project designed to collect data from websites and forums.

Web Crawler Project for Automated Data Collection from Websites and Forums.

A while ago, I needed to gather data on a specific topic from a forum containing over 100 pages of archived posts, with each page holding 10 posts. Manually collecting this volume of data was a tedious and time-consuming task, so I developed a PHP web crawler script to automatically collect and store all the targeted content in a database. This project evolved into an exciting venture for me, leading to additional features like collecting images from specific websites and navigating nested links for deeper searches.

The first version of this script is available on GitHub, with further details about its functionality provided below.

Technical Details:

  • Script Language: PHP
  • Database: MySQL

Key Features:

  • Collects post data from forums and websites
  • Stores collected data in a database
  • Enables searching within stored data using AJAX
  • Displays stored data in a paginated format

Example Video:

In the demo video, we extract all posts related to the topic “Most Beautiful Modern Poems (Short Poems)” from the “Shahr Sakht Afzar” forum.

Note: The script extracts all content between specified tags, including HTML tags if present.

The raw database required is available on GitHub. After downloading the project, you can import the database and add other project files to a server directory or localhost. Database connection settings can be found in the db.php file.

To view the source code, please visit the GitHub repository via the link below:

Show More

Ehsan Heydari

I began my career in web and software development in 2011. Previously, I worked as an Android application developer using Java, and I am now proficient in PHP, JavaScript, and Python, with my main focus currently on developing web applications. Additionally, I have a strong interest in capital markets, blockchain, and the decentralized world of Web3, which has shaped my future roadmap.
دکمه بازگشت به بالا