Phone: (888) 281-7665 >> Email: info@calcoastwebdesign.com

News & Resources

/

SES 2007 – Duplicate Content & Multiple Site Issues

  • By calcoastwebdesign
  • 1 Tags
  • 1 Comments
  • 22 Aug 2007

…I just have to say before I post this that I KNOW this duplicate stuff is extensive but I am trying to help our virtual assistants, and I just want to get the most accurate notes as possible. Angela will beat me with a stick if I don’t 🙂 There’s a lot to be learned by reading this stuff, but if you are falling asleep…. hopefully you will just realize how important it is, hire us because we are so passionate about it, and be done with it. VA’s, let me know if you have questions.

Day 3
1:30-3:00pm
Duplicate Content & Multiple Site Issues – duplicate content is a major concern for SEOers and webmasters, especially people with multiple sites, or big companies with dynamic sites. How can you avoid getting penalized? Do you really have to rewrite everything or hire that geeky programmer?

2 Speakers, with 3 major search engines present:

Shari Thurow, Omni Marketing Interactive
Mikkel deMib Svendsen, deMib.com
Peter Linsley, Ask.com
Greg Grothaus, Google
Priyank Garg, Yahoo! Search

Shari Thurow started first with “what exactly is duplicate content?” Duplicate content is hard to describe, its no longer an exact replica of text or pages. Duplicate content can now simply be similar pages or text, so as webmasters, we have to be very careful.

Search engines don’t like duplicate content because its bad for the user. Google and others will tend to cluster similar sites and run a “cluster” filter to avoid the same site being in all top 10 positions. You have probably noticed this – the results which are indented on Google’s SERP.

What do Search Filters consider when sniffing out your duplicate content?

1. Content properties. Searches strip your template, nav, and boilerplate to search your content for uniqueness. They will also strip java, javascript, flash, ads, etc. to get to the main meat of the page and ensure it has uniqueness.
2. Linkage properties – inbound links and outbound links to the pages. If you need to find linkage properties, go to Yahoo site explorer and click on the links pages. You can do this for your site OR a competitors site, in order to see which links are going where.
3. Content Evolution – Average page mutation, how frequently content changes.
4. Host name resolutions – how many hostnames point to the same domain name? Black hats will try to trick the system with this but the search engines are on to that game.
5. Shingle comparison – Andrei Broder Shingle says that every document has a unique fingerprint that can be broken down into “shingles”. Searches look at shingles to see if sites are duplicating content. For instance: If you have 3 web pages each with unique URLs, and the same “word sets” are available on all 3 pages, then you are duplicating content, no matter how you try to justify it.

This is very important to ecommerce sites sorting by price, color, etc and having unique URLs for each. Use your robots.txt file to EXCLUDE certain pages that have duplicate content for the search. Exclude printer friendly versions of your pages too! Use your web analytics to determine which pages are most important, and use the wild card in robots.txt to tell search engines what is the most important. Focus on quality not quantity. Think conversion duckets.

Sherri suggested to read Matt Cutts blog: http://www.mattcutts.com/blog/ . Matt Cutts works for Google and he’s a celebrity in the SEO world. Just take what he has to say as SEO Bible and you’re all set 🙂

Sherri referenced a case study of an International real estate website, having trouble getting all pages indexed by major searches. They used the robots.txt “wild card,” and within 3 days Google reacted. Within a month there was a 30% increase in natural traffic. By telling Google what the best pages were for the rank, and targeting the best audience, they benefited big time by getting QUALITY URL’s in the search 🙂 Its really not how many anymore, its what you are returning with what you have.

Most duplicate content is not considered spam, but beware if your duplicate content IS considered spam. It can take you 6 months plus to get back into searches. That’s 6 months of hard work, so don’t copy or duplicate content people, its not worth it!!! What do you do if you think people are duplicating your content? Use copyscape and the waybackmachine when you need to see if someone is a dirty copycat. Archive.org is also very handy!

Side note: I just have to laugh because I went back on the way back machine to our first Cal Coast website, the first AskAngie website, and the first Advanced Access website. I will not link to protect the innocent (well…maybe a little bit guilty :). We have all come so far. Rock on.

Back to reality. DMCA (Digital Millennium Copyright Act) policy with Google was referenced and I should probably link to it. Damn I can’t find the link. Just kidding.

More side notes…I searched for this term in Google yesterday and I couldn’t find it. Now today I type it in; and the crazy wikis at the top for me. Search is crazy.. you f*(^* wikis…do you sleep??? Thank you!

Next up was Mikkel, Red Suit as we love it, talking more in depth about some 301 geekiness. Angela and I also noted later in the evening after multiple glasses of wine that our hotel address is also 301. Scary spooky. So…..What are the common problems that come up which require 301 directs?

– multiple domains, sub or test domains
– with and without www, http, https or ww1
– session ids
– url rewriting
– sort order parameters
– bread crumb navigation

A lot of our clients use these things so this is why I am here taking notes for you 🙂 Its the 3rd day and my head is literally spinning. Its not rocket science though, we can do it. Its just boring. I will pay a friendly VA’s to do it 🙂

If you plan on using multiple domains to the same website, choose one domain as your brand, and use that and that only for links. All other forwarding domains need a 301 redirect to your brand.

For subdomains, you just need to make sure pages can only be accessed on one of the sub domains at a time. Do not put the same content on multiple sub domains!

Test domains, test and development domains should not be available at all – Mikkel recommends to password protect them. We agree. This can get embarassing. He told a story of a “Christmas in July” type accident, others have similiar.

What about www and no www? Major Search Engines know this now, and you should be ok. But what about your links? If people link sometimes to your domain with the w’s, and other times without, you are spreading your link equity too thin. Decide which you’re going with, and make sure link partners act accordingly. One way is not necessiarly better than the other….you just gotta pick something and stick with it.

If you are paranoid, You can solve w’s vs no w’s by using a 301 redirect. If people type in yourdomain.com and your brand is with the w’s, then redirect yourdomain.com to http://www.yourdomain.com/ … make sense?

Session ID’s, yikes! Dump all session information in a cookie for all users, or identify spiders and strip the session ID for them only. WTF? This is over my head. In any case, you need to handle this as a webmaster/programmer and not expect the searches to deal with your virtual mess. In laymans terms, you better have a GOOD programmer, don’t be scrimpin.

What about URL rewriting with WordPress? Use http://www.seoconsultants.com/tools/headers.asp as your guideline, and you can easily redirect your own URL. Also keep in mind different parameters. If people link to a post in your blog (through the blogger or wordpress link) and you have your blog hosted on your domain, then the search could see this as duplicate content. 301 redirect to solve the problem.

Sort order parameters:
Col 1: publisher
Col 2: site
Col 3: Unique visitors
Col 4: % +/-
There were more…I should write them down……but this crap bores me……

Breadcrumb Navigation, breadcrumbs are so nice for website users, but if you give different breadcrumbs through different sections of your site, which go to the same product or content, you will again need to use a 301 redirect. Have your programmer choose one “standard” breadcrumb format and even if the user did not drill down that way put the correct or best practice breadcrumb at the top of the page.

This stuff is complicated. If you want to run a test spider on your site try “xenu”, its a download don’t be scerd 🙂

Priyank Garg with Yahoo came up next, talking about dymamic URL rewriting, and what Yahoo is doing to solve the problem. They just announced this feature yesterday!! Yahoo used to not play well with dynamic urls or session ids.. or any of that fun stuff that helps us gauge conversions. Glad to see we’re makin some progress here 🙂

You can login to your Yahoo acct as a webmaster and instruct the spider using a BRAND NEW tool which tells Y! what parameters and actions your site is running. This way they better understand you. Its under the Actions Tab – go check it out. You should use this because:

Fewer duplicate URL’s are crawled
Better and deeper site coverage due to freed up crawl quote
More unique content discovered (can’t wait to try this for ComputerGiants)
Fewer chances of crawler traps.

For more info go to siteexplorer.search.yahoo.com or ysearchblog.com

Afterward…..
Ask.com did pipe up and say they don’t support wildcards, damn. They are considering it for future though. Spider your site AND your access logs to see if people are viewing or search engines can see duplicate pages.

Greg with Google said to make sure that the address in the bar needs to change too when you are redirecting (aka forward without a mask), but remember to do your 301 properly because Google can’t really tell which site you are trying to get found.

Yahoo said they don’t think of duplicate content as bad or evil all the time, but it may not get indexed because its not considered valuable for the user. That being said, don’t go copying copywritten stuff I said earlier if they DO penalize you, there’s major time and work involved to come back. Cal Coast has a one strike and you’re nuked policy with our VA’s.

Remember domains are NOT case sensitive, but directories are. If you accidentally rename a long link with a capital letter in one of your pages it could throw a loop in your plan. Be consistent with your names, it was recommended to link to the root folder not the full URL (aka www.domainname.com/relevantname/ –VS– www.domainname.com/relevantname/index.asp) if it is a deep link, so if your servers change from .html to .aspx you don’t have to do a million 301 directs. Yay…headache saver!! Smart programmers who plan ahead will save you thousands, we love them. They come at a cost, though 🙂

All in all it seems like duplicate content is not as serious with small businesses as it is with huge dynamic websites. You can go ahead and republish a valuable article if its relevant to your website, and you have permission. Searches are more concerned right now with all these dynamic site loops keeping their spiders tied up, not with all the “copycats” out there. Bad news for the copycats, they already know you copied and you are not likely to get indexed. If you blatently stole you may have some big issues to deal with though.

More…You can’t hire a garage programmer to make your dynamic site and expect it to be coded for the search engines. Your programmers NEED this session. Scott, let me know if there’s questions, I am sure your programmers will benefit from these notes. If you are a big company and your website is catching spiders, not only will you have irrelevant pages indexed, you could be ignored completely and lose thousands if not millions of dollars. Glad we know how to fix that leaky ducket 🙂

COMMENTS
Well again I have learned something new with this post. I’ve always considered duplicate content to be two exact pages, content or imagery. It’s good to know that duplicate content can be considered things that are not exact copies but content that is similar in some fashion.

It’s also good to find out that they are sniffing out Linkage properties and that links could contribute to some form of duplicate content. I also see that the evolution of your pages now helps keep your page or pages from being labeled as duplicates. Just one more good reason to constantly update your pages and fill them with unique content. It’s also kind of scary to learn that 3 pages with the same word sets (even if it’s justified) can hurt you!

301 redirects have always been a double edged sword in some respects, at least that’s what I’ve been taught and this post kind of reiterates that. You do have to be careful and know when and when not to use the redirects. As far as to have or not to have www’s, that’s kind of a no brainer but it was good to learn that search engines are now hip to either and it really doesn’t matter any more. Something else that was good to learn was that “the address in the bar needs to change too when you are redirecting (aka forward without a mask)”. Yet another item that would have never crossed my mind. Man, this is all confusing!

Wow so now, capitals in your URL can affect things? That’s pretty crazy! Linking to the root folder too instead of the full URL makes sense to me and I can see how that helps.

I don’t understand why or who though duplicate content isn’t as damning for small sites or businesses as it is for the larger websites. The big popular websites generally don’t duplicate anything, that’s one reason why they are on top!