Case Study

Automated Solution for Contact Information Crawling

Background

A high-profile player in the web analytics field seeking to aggregate contact information from a large number of domain homepages approached us. 
Their aim was to offer businesses and individuals access to regularly updated contact information sourced from an extensive array of websites. However, they confronted a host of challenges:

The client needed data to be extracted, organized, and presented in a unique format on a weekly basis.

The information, pulled from more than 50,000 domains weekly, was not just diverse and extensive but also unstructured.

Each domain housed 20-40 contact data points, with websites’ architectural frameworks varying considerably.

The domains spanned different countries and industries and were constructed using a range of technologies.

Various bot detection mechanisms were deployed across these domains.

The obligation for frequent manual verification and quality control of the information by the team escalated the resource allocation for the project and extended the data delivery timeline.

Faced with these obstacles, our task was to design and execute a meticulously structured data extraction, cleaning, parsing, and transformation pipeline that guaranteed high-precision output.

K
Domains Daily
M
Rows Daily
Data Points / Website

Impact

Our solution had a profound impact on our client’s operation. By providing a consistent, automated, and structured data stream, we allowed the client to significantly streamline their data handling process.

This improved their operational efficiency by 60% (evaluated by internal audit) due to the fact that their data collection and update process improved several times. Our solution allowed the client to cut costs by up to 30% by excluding all manual jobs needed for data collection and converting human resources to different directions.

Automation and significantly reduced errors in data extraction made it possible to automate data quality assessment as well. Our client was able to offer a more comprehensive and accurate service, contributing significantly to their business intelligence solution.

Challenges & Solutions

Challenge

Website Accessibility

Among the vast number of 50,000 domains the client sought to harvest data from, many posed accessibility issues due to diverse technological implementations and strict security protocols.

Solution

Building a Crawling Pipeline

To address the problem with website accessibility we have built a crawling pipeline with the logical tree. If the website was not accessed by common technology of country proxy, the pipeline would try various other options to achieve full data extraction.

Challenge

Data Inconsistency

Websites, with their varying structures and designs, complicated the task of uniformly extracting and structuring contact data. Each website contained 20-40 contact data points, and with no two sites being identical, this task was increasingly complex.

Solution

Dynamic Cleaning Mechanism

Acknowledging the diversity in website structures, we crafted a dynamic parsing algorithm that could adjust to a multitude of designs. Data processing involved parsing collected HTML and JavaScript of superfluous data.

Challenge

Data Accessibility

Often, vital contact information was hidden behind poorly structured HTML elements or dynamic JavaScript elements. This reality complicated the use of standard data extraction methods on a large-scale extraction.

Solution

Data Structuring

We unified the diverse contact data into the client’s preferred JSON format. Additionally, an automated data delivery system was implemented for weekly transfers to the client’s AWS S3 bucket.

Challenge

Datapoint Recognition

Recognizing and categorizing data points posed a significant challenge, given that most of them were not universally structured. For instance, the format and structure of phone numbers varied greatly by country and region.

Solution

Parsing and Data Points Identification

We incorporated over a hundred parsing methodologies to discover and extract all possible data on each website. Each website’s data was scrutinized by quality assurance algorithms to minimize the incidence of false positives and negatives.

Challenge

Historical Data Management

The client required an understanding of when target websites updated their contact information. This need introduced an additional recurring data collection goal, with the task of comparing new data with historical.

Solution

Historical Data Handling

We incorporated a module to extract, clean, and structure contact data from previously crawled domains, providing comprehensive, uniform historical data including all contact information updates as well.

Key Takeaways

Collecting Real-Time Data Across Platforms

Building a cross-platform solution demanded precise structuring, as real estate websites varied in layout, naming, and logic. Consistent data formatting was crucial for quick comparison and analysis.

Creating Flexible Investment Filters

To meet evolving investment strategies, the dashboard had to support complex, customizable filters. We made it easy for the client to set and adjust investment rules without needing technical intervention.

Automating Opportunity Detection

Speed was essential. Automating Slack alerts tied to specific criteria allowed the client to identify underpriced properties minutes after they were listed, gaining a major competitive advantage.

Visualizing Market Trends Effectively

Interactive dashboards empowered the client to track median price changes, listing volume, and price per square foot by region, allowing smarter and faster acquisition decisions.

Structuring Data for Long-Term Insights

By organizing listing history into a standardized and searchable format, the client could monitor property performance over time and make data-driven portfolio adjustments.


Conclusion

The success of this project was significant, leading to a 30% reduction in costs and increasing data acquisition speed several times. This also underlines the fact that data collection at scale with various data sources is already achievable with high accuracy and just requires using modern approaches of data recognition algorithms.

These achievements underline the importance of proficiently handling complex data extraction, structuring, and delivery tasks, especially when dealing with fluctuating and continuously evolving data structures.

Our robust solution not only addressed the client’s immediate requirements but also adapted to potential changes in the web data landscape. We delivered approximately 1,500,000 data points weekly, significantly reducing manual processes and error rates.

Take Action Now

We unlock data’s ability to transform.

Unlock the power of data to drive innovation, optimize operations, and make smarter decisions with Datamam’s comprehensive, integrated solutions.