- Link Building
- Content Marketing
- Case Studies
Content scraping, also known as “web scraping” or “data scraping,” refers to webmasters that steal content off of your site for their own domains. Webmasters may “scrape” content manually or use automated software. Content scraping is essentially a form of plagiarism and can infringe on copyright law.
The act of scraping content is also disrespectful to webmasters who have invested time, money, and resources into creating original content and employing genuine SEO services on their sites. Find out if your content has been scraped, and if so, how to handle it below.
Content scrapers may fully copy and republish content from your site or use slight modifications to avoid easy detection. Modifications may include the use of synonyms or other kinds of automated generation techniques. Generally, any replication that doesn’t add some kind of unique value to the reader could be deemed scraped content.
Content can be stolen manually or by way of automated software. This kind of software utilizes bots that crawl websites and collect information in seconds. Manually scraping content, on the other hand, is a laborious process. Web scrapers generally opt for the sophisticated bots.
The type of content scraping we’re talking about now is often employed with malicious intent, without a webmaster’s permission and in order to steal organic traffic and web rankings. Content scraping deceives readers, leading them to believe that the site with scraped content is legitimate and responsible for the content’s creation.
Types of content that may be scraped include, but are not limited to:
The reason webmasters may use scraped content has to do with bolstering the number of pages on their site, operating under the assumption that more pages equate to more value. This assumption, however, fails to consider the importance of uniqueness.
Google Search Console notes that: “Purely scraped content, even from high-quality sources, may not provide any added value to your users without additional useful services or content provided by your site; it may also constitute copyright infringement in some cases.”
So while the content being scraped and republished may be keyword-rich, topically relevant, and even high in quality, the lack of originality can severely limit the overall value this scraped content has on the site.
It’s important to know the difference between content scraping and syndication. Content scraping, as noted, occurs when someone else republishes your content without your permission. Syndication, on the other hand, is when you republish your content (or give permission for your original content to be republished) on websites that aren’t yours. Plagiarism is illegal; syndication is not.
Webmasters may republish guest posts that were first published on other sites. Guest blogging, a form of syndication, allows links back to a site’s original content. This is simply an act of giving credit where credit is due. However, syndicated content can still have the same limitations of originality and unique value — published syndicated articles on your site may help users discover authors and learn about important subjects. However, from an SEO perspective, it is still no substitute for hosting your own original content.
For the most part, content scraping (as the copying and republishing of someone else’s work) is bad — for SEO, for your site, and for content creators in general. Done without proper attribution and links to the original, it is plagiarism. Even with adequate attributions, scraped content is really not a viable substitute for original content, especially on a large scale. However, there is one positive lens that some webmasters may view it through.
Copying and republishing anybody else’s content is illegal, if that content is copyrighted. A 2019 court case ruled that “any data that is publicly available and not copyrighted is fair game for web crawlers.” Data also cannot be collected from sites that require authentication as this is generally against their Terms of Service. Publicly available sites, however, cannot require a user to agree to their Terms of Service, meaning that data is once again fair game for web crawlers.
It is important to note, however, that content scraping on its own is not illegal. This is because content scraping merely considers the act of gathering information. The act of republishing scraped content without attribution is illegal — because that is when it becomes plagiarism.
Allowing your content to be scraped and republished is sometimes tolerable on the basis that it passively supports link building. If scraped content contains links back to your site’s domain, these links may be crawled and indexed as legitimate backlinks — emphasis on “may be.” Links from scraped content typically are not quality links. They likely won’t hurt you — search engines aren’t going to automatically view your site negatively just because other sites scrape and republish your content — but the links also won’t be a benefit. However, you may decide you want to go after content theft or try to mitigate the number of links pointing to your site from these sources.
Below are a few tools that may help you catch content scrapers:
Copyscape is a free online tool that checks whether your content has been duplicated. Simply enter the URL of your site and the service will identify whether other sites have republished it.
Trackbacks in WordPress notify webmasters that a piece of their content has been linked back to. Adding internal links to your content is key to having trackbacks appear from content scrapers. Remember to also use good anchor text in your links. This will help catch content scrapers and provide your site with link juice.
If Google is your preferred search engine, use their Webmaster Tools to see who is linking to your site and why. Enter your site in Webmaster Tools followed by “Your Site on the Web” followed by “Links to Your Site” followed by the “Linked Pages Column.”
Google Alerts can be used to keep track of content published on your site. Create a Google alert by putting the exact title of a single post in quotation marks. Google alerts can be delivered through email or to an RSS feed.
If you’ve decided you want to stop all content scraping and republication of your own materials, you have a few options. Your first option is to speak to content scrapers directly. Go through their site’s contact form, email address, or social media accounts to get in touch.
If you cannot access contact information, try the WHOIS lookup tool. This search may help find who owns a given domain, granted it is not privately registered. If neither of these practices work, you must contact the domain registrar or hosting company for a given website. Inform them that your content has been stolen and of the domain in question. The website may be suspended or banned as a result.