Wednesday, 28 December 2016

Data Mining - Retrieving Information From Data

Data Mining - Retrieving Information From Data

Data mining definition is the process of retrieving information from data. It has become very important now days because data that is processed is usually kept for future reference and mainly for security purposes in a company. Data transforms is processed into information and it is mostly used in different ways depending on what information one is extracting and from where the person is extracting the information.

It is commonly used in marketing, scientific information and research work, fraud detection and surveillance and many more and most of this work is done using a computer. This definition can come in different terms data snooping, data fishing and data dredging all this refer to data mining but it depends in which department one is. One must know data mining definition so that he can be in a position to make data.

The method of data mining has been there for so many centuries and it is used up to date. There were early methods which were used to identify data mining there are mainly two: regression analysis and bayes theorem. These methods are never used now days because a lot of people have advanced and technology has really changed the entire system.

With the coming up or with the introduction of computers and technology, it becomes very fast and easy to save information. Computers have made work easier and one can be able to expand more knowledge about data crawling and learn on how data is stored and processed through computer science.

Computer science is a course that sharpens one skill and expands more about data crawling and the definition of what data mining means. By studying computer science one can be in a position to know: clustering, support vector machines and decision trees there are some of the units that are found on computer science.

It's all about all this and this knowledge must be applied here. Government institutions, small scale business and supermarkets use data.

The main reason most companies use data mining is because data assist in the collection of information and observations that a company goes through in their daily activity. Such information is very vital in any companies profile and needs to be checked and updated for future reference just in case something happens.

Businesses which use data crawling focus mainly on return of investments, and they are able to know whether they are making a profit or a loss within a very short period. If the company or the business is making a profit they can be in a position to give customers an offer on the product in which they are selling so that the business can be a position to make more profit in an organization, this is very vital in human resource departments it helps in identifying the character traits of a person in terms of job performance.

Most people who use this method believe that is ethically neutral. The way it is being used nowadays raises a lot of questions about security and privacy of its members. Data mining needs good data preparation which can be in a position to uncover different types of information especially those that require privacy.

A very common way in this occurs is through data aggregation.

Data aggregation is when information is retrieved from different sources and is usually put together so that one can be in a position to be analyze one by one and this helps information to be very secure. So if one is collecting data it is vital for one to know the following:

    How will one use the data that he is collecting?
    Who will mine the data and use the data.
    Is the data very secure when am out can someone come and access it.
    How can one update the data when information is needed
    If the computer crashes do I have any backup somewhere.

It is important for one to be very careful with documents which deal with company's personal information so that information cannot easily be manipulated.

source : http://ezinearticles.com/?Data-Mining---Retrieving-Information-From-Data&id=5054887

Monday, 19 December 2016

One of the Main Differences Between Statistical Analysis and Data Mining

One of the Main Differences Between Statistical Analysis and Data Mining

Two methods of analyzing data that are common in both academic and commercial fields are statistical analysis and data mining. While statistical analysis has a long scientific history, data mining is a more recent method of data analysis that has arisen from Computer Science. In this article I want to give an introduction to these methods and outline what I believe is one of the main differences between the two fields of analysis.

Statistical analysis commonly involves an analyst formulating a hypothesis and then testing the validity of this hypothesis by running statistical tests on data that may have been collected for the purpose. For example, if an analyst was studying the relationship between income level and the ability to get a loan, the analyst may hypothesis that there will be a correlation between income level and the amount of credit someone may qualify for.

The analyst could then test this hypothesis with the use of a data set that contains a number of people along with their income levels and the credit available to them. A test could be run that indicates for example that there may be a high degree of confidence that there is indeed a correlation between income and available credit. The main point here is that the analyst has formulated a hypothesis and then used a statistical test along with a data set to provide evidence in support or against that hypothesis.

Data mining is another area of data analysis that has arisen more recently from computer science that has a number of differences to traditional statistical analysis. Firstly, many data mining techniques are designed to be applied to very large data sets, while statistical analysis techniques are often designed to form evidence in support or against a hypothesis from a more limited set of data.

Probably the mist significant difference here, however, is that data mining techniques are not used so much to form confidence in a hypothesis, but rather extract unknown relationships may be present in the data set. This is probably best illustrated with an example. Rather than in the above case where a statistician may form a hypothesis between income levels and an applicants ability to get a loan, in data mining, there is not typically an initial hypothesis. A data mining analyst may have a large data set on loans that have been given to people along with demographic information of these people such as their income level, their age, any existing debts they have and if they have ever defaulted on a loan before.

A data mining technique may then search through this large data set and extract a previously unknown relationship between income levels, peoples existing debt and their ability to get a loan.

While there are quite a few differences between statistical analysis and data mining, I believe this difference is at the heart of the issue. A lot of statistical analysis is about analyzing data to either form confidence for or against a stated hypothesis while data mining is often more about applying an algorithm to a data set to extract previously unforeseen relationships.

Source:http://ezinearticles.com/?One-of-the-Main-Differences-Between-Statistical-Analysis-and-Data-Mining&id=4578250

Tuesday, 13 December 2016

Web Data Extraction Services

Web Data Extraction Services

Web Data Extraction from Dynamic Pages includes some of the services that may be acquired through outsourcing. It is possible to siphon information from proven websites through the use of Data Scrapping software. The information is applicable in many areas in business. It is possible to get such solutions as data collection, screen scrapping, email extractor and Web Data Mining services among others from companies providing websites such as Scrappingexpert.com.

Data mining is common as far as outsourcing business is concerned. Many companies are outsource data mining services and companies dealing with these services can earn a lot of money, especially in the growing business regarding outsourcing and general internet business. With web data extraction, you will pull data in a structured organized format. The source of the information will even be from an unstructured or semi-structured source.

In addition, it is possible to pull data which has originally been presented in a variety of formats including PDF, HTML, and test among others. The web data extraction service therefore, provides a diversity regarding the source of information. Large scale organizations have used data extraction services where they get large amounts of data on a daily basis. It is possible for you to get high accuracy of information in an efficient manner and it is also affordable.

Web data extraction services are important when it comes to collection of data and web-based information on the internet. Data collection services are very important as far as consumer research is concerned. Research is turning out to be a very vital thing among companies today. There is need for companies to adopt various strategies that will lead to fast means of data extraction, efficient extraction of data, as well as use of organized formats and flexibility.

In addition, people will prefer software that provides flexibility as far as application is concerned. In addition, there is software that can be customized according to the needs of customers, and these will play an important role in fulfilling diverse customer needs. Companies selling the particular software therefore, need to provide such features that provide excellent customer experience.

It is possible for companies to extract emails and other communications from certain sources as far as they are valid email messages. This will be done without incurring any duplicates. You will extract emails and messages from a variety of formats for the web pages, including HTML files, text files and other formats. It is possible to carry these services in a fast reliable and in an optimal output and hence, the software providing such capability is in high demand. It can help businesses and companies quickly search contacts for the people to be sent email messages.

It is also possible to use software to sort large amount of data and extract information, in an activity termed as data mining. This way, the company will realize reduced costs and saving of time and increasing return on investment. In this practice, the company will carry out Meta data extraction, scanning data, and others as well.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services&id=4733722

Wednesday, 7 December 2016

Data Mining vs Screen-Scraping

Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source: http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813

Saturday, 3 December 2016

Three Common Methods For Web Data Extraction

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.
- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.
- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).
- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.
- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.
- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.
- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.
- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).
- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.
- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.
- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.
- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.
- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.
- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.
- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Monday, 21 November 2016

How Xpath Plays Vital Role In Web Scraping Part 2

How Xpath Plays Vital Role In Web Scraping Part 2

Here is a piece of content on  Xpaths which is the follow up of How Xpath Plays Vital Role In Web Scraping

Let’s dive into a real-world example of scraping amazon website for getting information about deals of the day. Deals of the day in amazon can be found at this URL. So navigate to the amazon (deals of the day) in Firefox and find the XPath selectors. Right click on the deal you like and select “Inspect Element with Firebug”:

If you observe the image below keenly, there you can find the source of the image(deal) and the name of the deal in src, alt attribute’s respectively.

So now let’s write a generic XPath which gathers the name and image source of the product(deal).

  //img[@role=”img”]/@src  ## for image source
  //img[@role=”img”]/@alt   ## for product name

In this post, I’ll show you some tips we found valuable when using XPath in the trenches.

If you have an interest in Python and web scraping, you may have already played with the nice requests library to get the content of pages from the Web. Maybe you have toyed around using Scrapy selector or lxml to make the content extraction easier. Well, now I’m going to show you some tips I found valuable when using XPath in the trenches and we are going to use both lxml and Scrapy selector for HTML parsing.

Avoid using expressions which contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set(collection of nodes).and when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

from scrapy import Selector
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>”””
sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()           # Let’s type this only once
print xp(‘//a//text()’)                                       # Take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output of above command
print xp(‘string(//a//text())’)                           # convert it to a string
  [u’Click here to go to the ‘]                           # output of the above command

Let’s do the above one by using lxml then you can implement XPath by both lxml or Scrapy selector as XPath expression is same for both methods.

lxml code:

from lxml import html
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>””” # Parse the text into a tree
parsed_body = html.fromstring(html_code)  # Perform xpaths on the tree
print parsed_body(‘//a//text()’)                      # take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output
print parsed_body(‘string(//a//text())’)              # convert it to a string
[u’Click here to go to the ‘]                    # output

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp(‘//a[1]’)  # selects the first a node
[u'<a href=”#”>Click here to go to the <strong>Next Page</strong></a>’]

>>> xp(‘string(//a[1])’)  # converts it to string
[u’Click here to go to the Next Page’]

Beware of the difference between //node[1] and (//node)[1]//node[1] selects all the nodes occurring first under their respective parents and (//node)[1] selects all the nodes in the document, and then gets only the first of them.

from scrapy import Selector

html_code = “””<ul class=”list”>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>

<ul class=”list”>
<li>4</li>
<li>5</li>
<li>6</li>
</ul>”””

sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()

xp(“//li[1]”) # get all first LI elements under whatever it is its parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//li)[1]”) # get the first LI element in the whole document

[u'<li>1</li>’]

xp(“//ul/li[1]”)  # get all first LI elements under an UL parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//ul/li)[1]”) # get the first LI element under an UL parent in the document

[u'<li>1</li>’]

Also,

//a[starts-with(@href, ‘#’)][1] gets a collection of the local anchors that occur first under their respective parents and (//a[starts-with(@href, ‘#’)])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary.

If you want to select elements by a CSS class, the XPath way to do the same job is the rather verbose:

*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ someclass ‘)]

Let’s cook up some examples:

>>> sel = Selector(text='<p class=”content-author”>Someone</p><p class=”content text-wrap”>Some content</p>’)

>>> xp = lambda x: sel.xpath(x).extract()

BAD: because there are multiple classes in the attribute

>>> xp(“//*[@class=’content’]”)

[]

BAD: gets more content than we need

 >>> xp(“//*[contains(@class,’content’)]”)

     [u'<p class=”content-author”>Someone</p>’,
     u'<p class=”content text-wrap”>Some content</p>’]

GOOD:

>>> xp(“//*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ content ‘)]”)
[u'<p class=”content text-wrap”>Some content</p>’]

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(“.content”).extract()
[u'<p class=”content text-wrap”>Some content</p>’]

>>> sel.css(‘.content’).xpath(‘@class’).extract()
[u’content text-wrap’]

Learn to use all the different axes.

It is handy to know how to use the axes, you can follow through these examples.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents: 

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from the script and style tags and also skip whitespace-only text nodes.

Tools & Libraries Used:

Firefox
Firefox inspect element with firebug
Scrapy : 1.1.1
Python : 2.7.12
Requests : 2.11.0

 Have questions? Comment below. Please share if you found this helpful.

Source: http://blog.datahut.co/how-xpath-plays-vital-role-in-web-scraping-part-2/

Wednesday, 19 October 2016

Web Scraping with Python: A Beginner’s Guide

Web Scraping with Python: A Beginner’s Guide

In the Big Data world, Web Scraping or Data extraction services are the primary requisites for Big Data Analytics. Pulling up data from the web has become almost inevitable for companies to stay in business. Next question that comes up is how to go about web scraping as a beginner.

Data can be extracted or scraped from a web source using a number of methods. Popular websites like Google, Facebook, or Twitter offer APIs to view and extract the available data in a structured manner.  This prevents the use of other methods that may not be preferred by the API provider. However, the demand to scrape a website arises when the information is not readily offered by the website. Python, an open source programming language is often used for Web Scraping due to its simple and rich ecosystem. It contains a library called “BeautifulSoup” which carries on this task. Let’s take a deeper look into web scraping using python.

Setting up a Python Environment:

To carry out web scraping using Python, you will first have to install the Python Environment, which enables to run code written in the python language. The libraries perform data scraping;

Beautiful Soup is a convenient-to-use python library. It is one of the finest tools for extracting information from a webpage. Professionals can scrape information from web pages in the form of tables, lists, or paragraphs. Urllib2 is another library that can be used in combination with the BeautifulSoup library for fetching the web pages. Filters can be added to extract specific information from web pages. Urllib2 is a Python module that can fetch URLs.

For MAC OSX :

To install Python libraries on MAC OSX, users need to open a terminal win and type in the following commands, single command at a time:

sudoeasy_install pip

pip install BeautifulSoup4

pip install lxml

For Windows 7 & 8 users:

Windows 7 & 8 users need to ensure that the python environment gets installed first. Once, the environment is installed, open the command prompt and find the way to root C:/ directory and type in the following commands:

easy_install BeautifulSoup4

easy_installlxml

Once the libraries are installed, it is time to write data scraping code.

Running Python:

Data scraping must be done for a distinct objective such as to scrape current stock of a retail store. First, a web browser is required to navigate the website that contains this data. After identifying the table, right click anywhere on it and then select inspect element from the dropdown menu list. This will cause a window to pop-up on the bottom or side of your screen displaying the website’s html code. The rankings appear in a table. You might need to scan through the HTML data until you find the line of code that highlights the table on the webpage.

Python offers some other alternatives for HTML scraping apart from BeautifulSoup. They include:

    Scrapy
    Scrapemark
    Mechanize

 Web scraping converts unstructured data from HTML code into structured form such as tabular data in an Excel worksheet. Web scraping can be done in many ways ranging from the use of Google Docs to programming languages. For people who do not have any programming knowledge or technical competencies, it is possible to acquire web data by using web scraping services that provide ready to use data from websites of your preference.

HTML Tags:

To perform web scraping, users must have a sound knowledge of HTML tags. It might help a lot to know that HTML links are defined using anchor tag i.e. <a> tag, “<a href=“http://…”>The link needs to be here </a>”. An HTML list comprises <ul> (unordered) and <ol> (ordered) list. The item of list starts with <li>.

HTML tables are defined with<Table>, row as <tr> and columns are divided into data as <td>;

    <!DOCTYPE html> : A HTML document starts with a document type declaration
    The main part of the HTML document in unformatted, plain text is defined by <body> and </body> tags
    The headings in HTML are defined using the heading tags from <h1> to <h5>
    Paragraphs are defined with the <p> tag in HTML
    An entire HTML document is contained between <html> and </html>

Using BeautifulSoup in Scraping:

While scraping a webpage using BeautifulSoup, the main concern is to identify the final objective. For instance, if you would like to extract a list from webpage, a step wise approach is required:

    First and foremost step is to import the required libraries:

 #import the library used to query a website

import urllib2

#specify the url wiki = “https://”

#Query the website and return the html to the variable ‘page’

page = urllib2.urlopen(wiki)

#import the Beautiful soup functions to parse the data returned from the website

from bs4 import BeautifulSoup

#Parse the html in the ‘page’ variable, and store it in Beautiful Soup format

soup = BeautifulSoup(page)

    Use function “prettify” to visualize nested structure of HTML page
    Working with Soup tags:

Soup<tag> is used for returning content between opening and closing tag including tag.

    In[30]:soup.title

 Out[30]:<title>List of Presidents in India till 2010 – Wikipedia, the free encyclopedia</title>

    soup.<tag>.string: Return string within given tag
    In [38]:soup.title.string
    Out[38]:u ‘List of Presidents in India and Brazil till 2010 in India – Wikipedia, the free encyclopedia’
    Find all the links within page’s <a> tags: Tag a link using tag “<a>”. So, go with option soup.a and it should return the links available in the web page. Let’s do it.
    In [40]:soup.a

Out[40]:<a id=”top”></a>

    Find the right table:

As a table to pull up information about Presidents in India and Brazil till 2010 is being searched for, identifying the right table first is important. Here’s a command to scrape information enclosed in all table tags.

all_tables= soup.find_all(‘table’)

Identify the right table by using attribute “class” of table needs to filter the right table. Thereafter, inspect the class name by right clicking on the required table of web page as follows:

    Inspect element
    Copy the class name or find the class name of right table from the last command’s output.

 right_table=soup.find(‘table’, class_=’wikitable sortable plainrowheaders’)

right_table

That’s how we can identify the right table.

    Extract the information to DataFrame: There is a need to iterate through each row (tr) and then assign each element of tr (td) to a variable and add it to a list. Let’s analyse the Table’s HTML structure of the table. (extract information for table heading <th>)

To access value of each element, there is a need to use “find(text=True)” option with each element.  Finally, there is data in dataframe.

There are various other ways to scrape data using “BeautifulSoup” that reduce manual efforts to collect data from web pages. Code written in BeautifulSoup is considered to be more robust than the regular expressions. The web scraping method we discussed use “BeautifulSoup” and “urllib2” libraries in Python. That was a brief beginner’s guide to start using Python for web scraping.

Source: https://www.promptcloud.com/blog/web-scraping-python-guide

Friday, 30 September 2016

Easy Web Scraping using PHP Simple HTML DOM Parser Library

Easy Web Scraping using PHP Simple HTML DOM Parser Library

Web scraping is only way to get data from website when  website don’t provide API to access it’s data. Web scraping involves following steps to get data:

    Make request to web page
    Parse/Extract data that you want to scrape from website.
    Store data for final output (excel, csv,mysql database etc).

Web scraping can be implemented in any language like PHP, Java, .Net, Python and any language that allows to make web request to get web page content (HTML text) in to variable. In this article I will show you how to use Simple HTML DOM PHP library to do web scraping using PHP.
PHP Simple HTML DOM Parser

Simple HTML DOM is a PHP library to parse data from webpages, in short you can use this library to do web scraping using PHP and even store data to MySQL database.  Simple HTML DOM has following features:

    The parser library is written in PHP 5+
    It requires PHP 5+ to run
    Parser supports invalid HTML parsing.
    It allows to select html tags like Jquery way.
    Supports Xpath and CSS path based web extraction
    Provides both the way – Object oriented way and procedure way to write code

Scrape All Links

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific URL
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('a') as $element)
   echo $element->href . '<br>';

?>

Scrape images

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific url
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('img') as $element)
   echo $element->src . '<br>';

?>

This is just little idea how you can do web scraping using PHP.Keep in mind that Xpath can make your job simple and fast. You can find all methods available in SimpleHTMLDom documentation page.

Source: http://webdata-scraping.com/web-scraping-using-php-simple-html-dom-parser-library/

Tuesday, 20 September 2016

Web Scraping – A trending technique in data science!!!

Web Scraping – A trending technique in data science!!!

Web scraping as a market segment is trending to be an emerging technique in data science to become an integral part of many businesses – sometimes whole companies are formed based on web scraping. Web scraping and extraction of relevant data gives businesses an insight into market trends, competition, potential customers, business performance etc.  Now question is that “what is actually web scraping and where is it used???” Let us explore web scraping, web data extraction, web mining/data mining or screen scraping in details.

What is Web Scraping?

Web Data Scraping is a great technique of extracting unstructured data from the websites and transforming that data into structured data that can be stored and analyzed in a database. Web Scraping is also known as web data extraction, web data scraping, web harvesting or screen scraping.

What you can see on the web that can be extracted. Extracting targeted information from websites assists you to take effective decisions in your business.

Web scraping is a form of data mining. The overall goal of the web scraping process is to extract information from a websites and transform it into an understandable structure like spreadsheets, database or csv. Data like item pricing, stock pricing, different reports, market pricing, product details, business leads can be gathered via web scraping efforts.

There are countless uses and potential scenarios, either business oriented or non-profit. Public institutions, companies and organizations, entrepreneurs, professionals etc. generate an enormous amount of information/data every day.

Uses of Web Scraping:

The following are some of the uses of web scraping:

  •     Collect data from real estate listing
  •     Collecting retailer sites data on daily basis
  •     Extracting offers and discounts from a website.
  •     Scraping job posting.
  •     Price monitoring with competitors.
  •     Gathering leads from online business directories – directory scraping
  •     Keywords research
  •     Gathering targeted emails for email marketing – email scraping
  •     And many more.

There are various techniques used for data gathering as listed below:

  •     Human copy-and-paste – takes lot of time to finish when data is huge
  •     Programming the Custom Web Scraper as per the needs.
  •     Using Web Scraping Softwares available in market.

Are you in search of web data scraping expert or specialist. Then you are at right place. We are the team of web scraping experts who could easily extract data from website and further structure the unstructured useful data to uncover patterns, and help businesses for decision making that helps in increasing sales, cover a wide customer base and ultimately it leads to business towards growth and success.

We have got expertise in all the web scraping techniques, scraping data from ajax enabled complex websites, bypassing CAPTCHAs, forming anonymous http request etc in providing web scraping services.

Source: http://webdata-scraping.com/web-scraping-trending-technique-in-data-science/

Wednesday, 7 September 2016

How Web Scraping for Brand Monitoring is used in Retail Sector

How Web Scraping for Brand Monitoring is used in Retail Sector

Structured or unstructured, business data always plays an instrumental part in driving growth, development, and innovation for your dream venture. Irrespective of industrial sectors or verticals, big data, seems to be of paramount significance for every business or enterprise.

The unsurpassed popularity and increasing importance of big data gave birth to the concept of web scraping, thus enhancing growth opportunities for startups. Large or small, every business establishment will now achieve successful website monitoring and tracking.
How web scraping serves your branding need?

Web scraping helps in extracting unorganized data and ordering it into organized and manageable formats. So if your brand is being talked about in multiple ways (on social media, on expert forums, in comments etc.), you can set the scraping tool algorithm to fetch only data that contains reference about the brand. As an outcome, marketers and business owners around the brand can gauge brand sentiment and tweak their launch marketing campaign to enhance visibility.

Look around and you will discover numerous web scraping solutions ranging from manual to fully automated systems. From Reputation Tracking to Website monitoring, your web scraper can help create amazing insights from seemingly random bits of data (both in structured as well as unstructured format).
Using web scraping

The concept of web scraping revolutionizes the use of big data for business. With its availability across sectors, retailers are on cloud nine. Here’s how the retail market is utilizing the power of Web Scraping for brand monitoring.

Determining pricing strategy

The retail market is filled with competition. Whether it is products or pricing strategies, every retailer competes hard to stay ahead of the growth curve. Web scraping techniques will help you crawl price comparison sites’ pricing data, product descriptions, as well as images to receive data for comparison, affiliation, or analytics.

As a result, retailers will have the opportunity to trade their products at competitive prices, thus increasing profit margins by a whopping 10%.

Tracking online presence

Current trends in ecommerce herald the need for a strong online presence. Web scraping takes cue from this particular aspect, thus scraping reviews and profiles on websites. By providing you a crystal clear picture of product performance, customer behavior, and interactions, web scraping will help you achieve Online Brand Intelligence and monitoring.
Detection of fraudulent reviews

Present-day purchasers have this unique habit of referring to reviews, before finalizing their purchase decisions. Web scraping helps in the identification of opinion-spamming, thus figuring out fake reviews. It will further extend support in detecting, reviewing, streamlining, or blocking reviews, according to your business needs.
Online reputation management

Web data scraping helps in figuring out avenues to take your ORM objectives forward. With the help of the scraped data, you learn about both the impactful as well as vulnerable areas for online reputation management. You will have the web crawler identifying demographic opinions such as age group, gender, sentiments, and GEO location.

Social media analytics

Since social media happens to be one of the most crucial factors for retailers, it will be imperative to Scrape Social Media websites and extract data from Twitter. The web scraping technology will help you watch your brand in Social Media along with fetching Data for social media analytics. With social media channels such as Twitter monitoring services, you will strengthen your firm’s’ branding even more than before.
Advantages of BM

As a business, you might want to monitor your brand in social media to gain deep insights about your brand’s popularity and the current consumer behavior. Brand monitoring companies will watch your brand in social media and come up with crucial data for social media analytics. This process has immense benefits for your business, these are summarized over here –

Locate Infringers

Leading brands often face the challenge thrown by infringers. When brand monitoring companies keep a close look at products available in the market, there is less probability of a copyright infringement. The biggest infringement happens in the packaging, naming and presentation of products. With constant monitoring and legal support provided by the Trademark Law, businesses could remain protected from unethical competitors and illicit business practices.

Manage Consumer Reaction and Competitor’s Challenges

A good business keeps a check on the current consumer sentiment in the targeted demographic and positively manages the same in the interest of their brand. The feedback from your consumers could be affirmative or negative but if you have a hold on the social media channels, web platforms and forums, you, as a brand will be able to propagate trust at all times.

When competitor brands indulge in backbiting or false publicity about your brand, you can easily tame their negative comments by throwing in a positive image in front of your target audience. So, brand monitoring and its active implementation do help in positive image building and management for businesses.
Why Web scraping for BM?

Web scraping for brand monitoring gives you a second pair of eyes to look at your brand as a general consumer. Considering the flowing consumer sentiment in the market during a specific business season, you could correct or simply innovate better ways to mold the target audience in your brand’s favor. Through a systematic approach towards online brand intelligence and monitoring, future business strategies and possible brand responses could be designed, keeping your business actively prepared for both types of scenarios.

For effective web scraping, businesses extract data from Twitter that helps them understand ‘what’s trending’ in their business domain. They also come closer to reality in terms of brand perception, user interaction and brand visibility in the notions of their clientele. Web scraping professionals or companies scrape social media websites to gather relevant data related to your brand or your competitor’s that has the potential to affect your growth as a business. Management and organization of this data is done to extract out significant and reference building facts. Future strategy for your brand is designed by brand monitoring professionals keeping in mind the facts accumulated through web scraping. The data obtained through web scraping helps in –

Knowing the actual brand potential,
Expanding brand coverage,
Devising brand penetration,
Analyzing scope and possibilities for a brand and
Design thoughtful and insightful brand strategies.

In simple words, web scraping provides a business enough base of information that could be used to devise future plans and to make suggestive changes in the current business strategy.

Advantages of Web scraping for BM

Web scraping has made things seamless for businesses involved in managing their brands and active brand monitoring. There is no doubt, that web scraping for brand monitoring comes with immense benefits, some of these are –

Improved customer insight

When you have in hand and factual knowledge about your consumer base through social media channels, you are in a strong position to portray your positive image as a brand. With more realistic data on your hands, you could develop strategies more effectively and make realistic goals for your brand’s improvement. Social media insights also allows marketers to create highly targeted and custom marketing messages – thus leading to better likelihood of sales conversion.

Monitoring your Competition

Web scraping helps you realize where your brand stands in the market among the competition. The actual penetration of your brand in the targeted segment helps in getting a clear picture of your present business scenario. Through careful removal of competition in your concerned business category, you could strengthen your brand image.

Staying Informed

When your brand monitoring team is keeping track of all social media channels, it becomes easier for you to stay informed about latest comments about your business on sites like Facebook, Twitter and social forums etc. You could have deep knowledge about the consumer behavior related to your brand and your competitors on these web destinations.

Improved Consumer Satisfaction and Sales

Reputation tracking done through web scraping helps in generating planned response at times of crisis. It also mends the communication gap between consumer and the brand, hence improving the consumer satisfaction. This automatically translates into trust building and brand loyalty improving your brand’s sales.

To sign off

By granting opportunities to monitor your social media data, web scraping is undoubtedly helping retail businesses take a significant step towards perfect branding. If you are one of the key players in this sector, there’s reason for celebration ahead!

Source: https://www.promptcloud.com/blog/How-Web-Scraping-for-Brand-Monitoring-is-used-in-Retail-Sector

Monday, 29 August 2016

How to use Social Media Scraping to be your Competitors’ Nightmare

How to use Social Media Scraping to be your Competitors’ Nightmare

Big data and competitive intelligence have been in the limelight for quite some time now. The almost magical power of big data to help a company make just the right decisions have been talked about a lot. When it comes to big data, the kind of benefits that a business can get totally depends upon the sources they acquire it from. Social media is one of the best sources from where you can get data that helps your business in a multitude of ways. Now that every business is deep rooted on the internet, social media data becomes all the more relevant and crucial. Here is how you can use data scraped from social media sites to get an edge in the competition.

Keeping watch on your competitors

Social media is the best place to watch your competitors’ activity and take counter initiatives to keep up or take over them. If you want to know what your competitors are up to, a social media scraping setup for scraping the posts that mention your competitors’ brand/product names can do the trick. This can also be used to learn a thing or two from their activities on social media so that you can take respective measures to stay ahead of them. For example, you could know if your competitor is running a special promotional offer at the moment and come up with something better than theirs to keep up. This can do wonders if you are in a highly competitive industry like Ecommerce where the competition is intense. If you are not using some help from web scraping technology to keep a close watch on your competitors, you could easily get left over in this fast-paced business scene.

Solving customer issues at the earliest

Customers are vocal about their experience with different products and services on social media sites these days. If you have a customer whose issue was left unsolved, there is a good chance that he/she will take it to the social media to vent the frustration. Watching out for such instances and giving them prompt support should be something you should do if you want to retain these customers and stop them from ruining your brand’s image. By scraping social media sites for posts that mention your product/service, you can easily find out if there are such grievances from customers. This can make sure to an extent that you don’t let unhappy customers stay that way, which eventually hurts your business in the long run. Customers can make or break your company, so using social media scraping to serve the customers better can help you succeed eventually.

Sentiment analysis

Social media data can play a good job at helping you understand user sentiments. With the help of social media scraping, a business can get the big picture about general perception of their brand by their users. This can go a long way since this level of feedback can help you fix unnoticed issues with your company and service quickly. By rectifying them, you can make your brand more appealing to the customers. Sentiment analysis will provide you with the opportunity to transform your business into how customers want it to be. Social media scraping is the one and only way to have access to this user sentiment data which can help you optimize your business for the customers.

Web crawling for social media data

When social media data possess so much value to businesses, it makes sense to look for efficient ways to gather and use this data. Manually scrolling through millions of tweets doesn’t make sense, this is why you should use social media scraping to aggregate the relevant data for your business. Besides, web scraping technologies make it possible to handle huge amounts of data with ease. Since the size of data is huge when it comes to business related requirements, web scraping is the only scalable solution worth considering. To make things even simpler, there are reliable web scraping solutions that offer social media scraping services for brand monitoring.

Bottom line

Since social media has become an integral part of online businesses, the data available on these sites possess immense value to companies in every industry. Social media scraping can be used for brand monitoring and gaining competitive intelligence that can be used to optimize your business model for maximum effectiveness. This will in turn make your company stand out from the competition and the added advantage of insights gained from social media data will help you to take over your competitors.

Source: https://www.promptcloud.com/blog/social-media-scraping-for-competitive-intelligence

Saturday, 20 August 2016

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

Data scrape is the process of extracting data from web by using software program from proven website only. Extracted data any one can use for any purposes as per the desires in various industries as the web having every important data of the world. We provide best of the web data extracting software. We have the expertise and one of kind knowledge in web data extraction, image scrapping, screen scrapping, email extract services, data mining, web grabbing.

Who can use Data Scraping Services?

Data scraping and extraction services can be used by any organization, company, or any firm who would like to have a data from particular industry, data of targeted customer, particular company, or anything which is available on net like data of email id, website name, search term or anything which is available on web. Most of time a marketing company like to use data scraping and data extraction services to do marketing for a particular product in certain industry and to reach the targeted customer for example if X company like to contact a restaurant of California city, so our software can extract the data of restaurant of California city and a marketing company can use this data to market their restaurant kind of product. MLM and Network marketing company also use data extraction and data scrapping services to to find a new customer by extracting data of certain prospective customer and can contact customer by telephone, sending a postcard, email marketing, and this way they build their huge network and build large group for their own product and company.

We helped many companies to find particular data as per their need for example.

Web Data Extraction

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. We help you to create a kind of API which helps you to scrape data as per your need. We provide quality and affordable web Data Extraction application

Data Collection

Normally, data transfer between programs is accomplished using info structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all. That's why the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user.

Email Extractor

A tool which helps you to extract the email ids from any reliable sources automatically that is called a email extractor. It basically services the function of collecting business contacts from various web pages, HTML files, text files or any other format without duplicates email ids.

Screen scrapping

Screen scraping referred to the practice of reading text information from a computer display terminal's screen and collecting visual data from a source, instead of parsing data as in web scraping.

Data Mining Services

Data Mining Services is the process of extracting patterns from information. Datamining is becoming an increasingly important tool to transform the data into information. Any format including MS excels, CSV, HTML and many such formats according to your requirements.

Web spider

A Web spider is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web Grabber

Web grabber is just a other name of the data scraping or data extraction.

Web Bot

Web Bot is software program that is claimed to be able to predict future events by tracking keywords entered on the Internet. Web bot software is the best program to pull out articles, blog, relevant website content and many such website related data We have worked with many clients for data extracting, data scrapping and data mining they are really happy with our services we provide very quality services and make your work data work very easy and automatic.

Source: http://ezinearticles.com/?How-Web-Data-Extraction-Services-Will-Save-Your-Time-and-Money-by-Automatic-Data-Collection&id=5159023

Tuesday, 9 August 2016

Difference between Data Mining and KDD

Difference between Data Mining and KDD

Data, in its raw form, is just a collection of things, where little information might be derived. Together with the development of information discovery methods(Data Mining and KDD), the value of the info is significantly improved.

Data mining is one among the steps of Knowledge Discovery in Databases(KDD) as can be shown by the image below.KDD is a multi-step process that encourages the conversion of data to useful information. Data mining is the pattern extraction phase of KDD. Data mining can take on several types, the option influenced by the desired outcomes.

Knowledge Discovery in Databases Steps
Data Selection

KDD isn’t prepared without human interaction. The choice of subset and the data set requires knowledge of the domain from which the data is to be taken. Removing non-related information elements from the dataset reduces the search space during the data mining phase of KDD. The sample size and structure are established during this point, if the dataset can be assessed employing a testing of the info.
Pre-processing

Databases do contain incorrect or missing data. During the pre-processing phase, the information is cleaned. This warrants the removal of “outliers”, if appropriate; choosing approaches for handling missing data fields; accounting for time sequence information, and applicable normalization of data.
Transformation

Within the transformation phase attempts to reduce the variety of data elements can be assessed while preserving the quality of the info. During this stage, information is organized, changed in one type to some other (i.e. changing nominal to numeric) and new or “derived” attributes are defined.
Data mining

Now the info is subjected to one or several data-mining methods such as regression, group, or clustering. The information mining part of KDD usually requires repeated iterative application of particular data mining methods. Different data-mining techniques or models can be used depending on the expected outcome.
Evaluation

The final step is documentation and interpretation of the outcomes from the previous steps. Steps during this period might consist of returning to a previous step up the KDD approach to help refine the acquired knowledge, or converting the knowledge in to a form clear for the user.In this stage the extracted data patterns are visualized for further reviews.
Conclusion

Data mining is a very crucial step of the KDD process.

For further reading aboud KDD and data mining ,please check this link.

Source: http://nocodewebscraping.com/difference-data-mining-kdd/

Thursday, 4 August 2016

Three Common Methods For Web Data Extraction

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Monday, 1 August 2016

Best Alternative For Linkedin Data Scraping

Best Alternative For Linkedin Data Scraping

When I started my career in sales, one of the things that my VP of sales told me is that ” In sales, assumptions are the mother of all f**k ups “. I know the F word sounds a bit inappropriate, but that is the exact word he used. He was trying to convey the simple point that every prospect is different, so don’t guess, use data to come up with decisions.

I joined Datahut and we are working on a product that helps sales people. I thought I should discuss it with you guys and take your feedback.

Let me tell you how the idea evolved itself. At Datahut, we get to hear a lot of problems customers want to solve. Almost 30 percent of all the inbound leads ask us to help them with lead generation.

Most of them simply ask, “Can you scrape Linkedin for me”?

Every time, we politely refused.

But not anymore, we figured out a way to solve their problem without scraping Linkedin.

This should raise some questions in your mind.

1) What problem is he trying to solve?– Most of the time their sales team does not have the accurate data about the prospects. This leads to a total chaos. It will end up in a waste of both time and money by selling the leads that are not sales qualified.

2) Why do they need data specifically from Linkedin? – LinkedIn is the world’s largest business network. In his view, there is no better place to find leads for his business than Linkedin. It is right in a way.

3) Ok, then what is wrong in scraping Linkedin? – Scraping Linkedin is against its terms and it can lead to legal issues. Linkedin has an excellent anti-scraping mechanism which can make the scraping costly.

4) How severe is the problem? – The problem has a direct impact on the revenues as the productivity of the sales team is too low. Without enough sales, the company is a joke.

5) Is there a better way? – Of course yes. The people with profiles in LinkedIn are in other sites too. eg. Google plus, CrunchBase etc. If we can mine and correlate the data, we can generate leads with rich information. It will have better quality than scraping LinkedIn.

6) What to do when the machine intelligence fails? – We have to use human intelligence. Period!

Datahut is working on a platform that can help you get leads that match your ideal buyer persona. It will be a complete Business intelligence platform powered by machine and human intelligence for an efficient lead research & discovery.We named it Leadintel. We’ve also established some partnerships that help to enrich the data and saves the trouble of lawsuits.

We are opening our platform for beta users. You can request an invitation using the contact form. What do you think about this? What are your suggestions?

Thanks for reading this blog post. Datahut offers affordable data extraction services (DaaS) . If you need help with your web scraping projects let us know and we will be glad to help.

Source:http://blog.datahut.co/best-alternative-for-linkedin-data-scraping/

Monday, 11 July 2016

Web Scraping Best Practices

Extracting data from the World Wide Web has several challenges as more webmasters are working day and night to lower cases of scraping and crawling of their data in order to survive in the competitive world. There are various other problems you may face when web scraping and most of them can be avoided by adapting and implementing certain web scraping best practices as discussed in this article.

Have knowledge of the scraping tools

Acquiring adequate knowledge of hurdles that may be encountered during web scraping, you will be able to have a smooth web scraping experience and be on the safe side of the law. Conduct a thorough research on the types of tools you will use for scraping and crawling. Firsthand knowledge on these tools will help you find the data you need without being blocked.

Proper proxy software that acts as the middle party works well when you know how to work around HTTP and HTML protocols. Use tools that can change crawling patterns, URLs and data retrieved even when you are crawling on one domain. This will help you abide to the rules and regulations that come with web scraping activities and escaping any legal issues.

Conduct your scraping activities during off-peak hours

You may opt to extract data during times that less people have access for instance over the weekends, during late night hours, public holidays among others. Visiting a website on several instances to retrieve the same type of data is a waste of bandwidth. It is always advisable to download the entire site content to your computer and thereafter you can access it whenever need arises.

Hide your scrapping activities

There is a thin line between ethical and unethical crawling hence you should completely evade being on the top user list of a particular website. Cover up your track as best as you can by making use of proxy IPs to avoid any legal problems. You may also use multiple IP addresses or VPN services to conceal your scrapping activities and lower chances of landing on a website’s blacklist.

Website owners today are very protective of their data and any other information existing under their unique url. Be keen when going through the terms and conditions indicated by websites as they may consider crawling as an infringement of their privacy. Simple etiquette goes a long way. Your web scraping efforts will be fruitful if the site owner supports the idea of sharing data.

Keep record of your activities

Web scraping involves large amount of data.Due to this you may not always remember each and every piece of information you have acquired, gathering statistics will help you monitor your activities.

Load data in phases

Web scraping demands a lot of patience from you when using the crawlers to get needed information. Take the process in a slow manner by loading data one piece at a time. Several parallel request to the same domain can crush the entire site or retrace the scrapping attempts back to your local machine.

Loading data small bits will save you the hustle of scrapping afresh in case that your activity has been interrupted because you will have already stored part of the data required. You can reduce the loading data on an individual domain through various techniques such as caching pages that you have scrapped to escape redundancy occurrences. Use auto throttling mechanisms to increase the amount of traffic to the website and pause for breaks between requests to prevent getting banned.

Conclusion

Through these few mentioned web scraping best practices you will be able to work around website and gather the data required as per clients’ request without major hurdles along the way. The ultimate goal of every web scraper is to be able to access vital information and at the same time remain on the good side of the law.

Source URl : http://nocodewebscraping.com/web-scraping-best-practices/

Content Scrapers – How to Find Out Who is Stealing Your Content & What to Do About It

If you have been blogging for a while, chances are you are familiar with content scrapers. Content scrapers are websites that steal your content for their own blogs without your permission. Some content scrapers will just copy the content off of your blog, but most use automated software that takes the content from your RSS feed and posts your content to their site like it is a new post.

In this post, we are going to look at some potential link building benefits to content scrapers, how to find out what sites are scraping your content, and what you can do if you want to either benefit from the linking standpoint or have them take it down.

Linking Benefits of Content Scrapers

Last week, I was happy to see that I was listed in ProBlogger’s 20 Bloggers to Watch in 2012. Within 24 hours, I received a notification in my WordPress dashboard that a page on my blog had been linked to in the post on ProBlogger’s site.

After receiving the original notification from the ProBlogger post, I also received another 18 trackbacks from sites that had stolen the content in their post verbatim. Trackbacks are WordPress’ way of letting you know that another website has linked to a post on your blog. In this case, these 18 sites had posted the content exactly like the original post – with the links back to my blog still intact.

It was then that I started contemplating the potential link building benefits of content scrapers. These are not by any means quality links – the highest Google PageRank was a PR 2 domain, many were stealing content in a variety of languages, and one even had the nerve to use some kind of redirection script to take away the link juice of outgoing links! So while these links didn’t have the same authority that the original post had, they still count as links.

How to Catch Content Scrapers

Unfortunately, unless you want to continuously search for your post titles in Google, you’ll only be able to easily track down sites that keep your in-content links active. If you want to know what websites are scraping your content, here are a few tips to sniff them out.

Copyscape

Copyscape is a simple search engine that allows you to enter the URL of your content to find out if there are duplicates of it on the Internet. You can get a few results using their free search, or you can pay for a premium account to check up to 10,000 pages on your site and more.

Trackbacks

The first way is through your trackbacks in WordPress (as shown in the image above). Many of these will show up in the spam folder if you use Akismet. The key to getting trackbacks to appear from content scrapers is to always include links to other posts in your content. Be sure those links have great anchor text too, if you’re going for a little extra link juice. And even if you are not, internal linking with strong anchor text is good for your on-site optimization too!

Anyone thinking about link building benefits at this point is probably noting the sheer volume of links from these sites, some of which are content scrapers. Essentially any site that is linking to a lot of your posts that isn’t a social network, social bookmarking site, or a die-hard fan who just loves linking to you is potentially a content scraper. You’ll have to go to their website to be sure. To find your links on their site, click on one of the domains to see the details of what pages on your site they are linking to specifically.

You can see here that they are just blatantly copying my posts titles. When I visited one of the links, sure enough, they are copying my entire posts in their full glory onto their site.

Google Alerts

If you don’t post often or want to keep up with any mentions of your top blog posts on other websites, you can create a Google Alert using the exact match for your post’s title by putting the title in quotation marks.

I deliver all of my Google Alerts to an RSS feed so I can manage them in Google Reader, but you can also have them delivered regularly by email. You’ll even get an instant preview of the types of results you will get.

How to Get Credit for Scraped Posts

If you use WordPress, then you definitely want to try out the RSS footer plugin. This plugin allows you to place a custom piece of text at the top or bottom of your RSS feed content.
As you can see, even if you aren’t using it for the purpose of getting credit back to your posts when content thieves steal it, you can still use it for a little extra bit of advertising with the possible benefit of people who subscribe to your RSS feed clicking through to your website or social profiles. And when someone does scrape your content from your RSS feed, it shows up there too

So in the event that someone finds your scraped content, they will hopefully notice the credit before assuming it was created by the blog that stole it. If you don’t have WordPress, you can simply include a note at the top or bottom of your content that includes the same information.

How to Stop Content Scrapers

If you’re not interested in anyone copying your content, then you have a few options to choose from. You can start by contacting the site that is stealing your content and sending them a notice that you want all of your content removed immediately. You can do this through the site’s contact form, email address, or post it to any social accounts they list.

If there is no contact information on the website stealing your content, you can do a Whois Lookup to (hopefully) find out who owns the domain.

If it is not privately registered, you should find an administrative contact’s email address. If not, you should at least see the domain registrar which, in this case, is GoDaddy and/or the hosting company for the website which, in this case, is HostGator. You can try to contact both companies (HostGator has a DMCA form and GoDaddy has an email) and let them know that the domain in question is stealing copyrighted content in hopes that the website will be suspended or removed.

You can also visit the DMCA and use their takedown services to remove anyone who is copying your photos, video, audio, blog, or other content. They even offer a WordPress plugin to incorporate a DMCA protected badge on your site to warn potential thieves.

Have you ever dealt with content scrapers and thieves? Do you leave it alone for the link benefits, or do you fight back? What other tools, services, or other preventative tactics do you use to block content scrapers? Please share your thoughts and experiences in the comments!

Source URL : https://blog.kissmetrics.com/content-scrapers/

Sunday, 10 July 2016

Data Scraping – Will Definitely Benefit a Business Startup

With increasingly data shared using internet, the data collected as well as the usage cases are increasing with an unbelievable pace. We’ve entered into the “Big Data” age and data scraping is among the resources to supply big data engines, the latest data for analytical analytics, contest monitoring, or just to steal the data.

From the technology viewpoint, competent data scraping is fairly complicated. It has many open-source projects that allow anybody to run a web data scraper through him. Nevertheless it’s the entire different story while it needs to be an interior of the business as well as that you require not only maintaining your scrapers but also scaling them as well as extract the data smartly as you need.

That is the reason why different services are selling the “data scraping” as service. Their work is taking care about all the technical characteristics so that you can have the data required without any industrial knowledge. Fundamentally all these startups pay attention for collecting the data and then extract its value for selling it to the customers.

Let’s take some examples:

• Sales Intelligence – The scrapers monitor competitors, marketplaces, online directories, and data from the public markets to discover leads. For instance, some tool’s track websites that drop or add JavaScript tags from the competitors therefore you can call them as eligible leads.
• Price Intelligence – A very ordinary use is the price monitoring. If this is in with e-commerce, travel, or property industry monitoring competitors’ prices as well as adjusting yours consequently is generally the key. All these services monitor the prices and using the analytical algorithms they may provide you advice about where the puck can be.
• Marketing – Data scraping may also be used for monitoring how the competitors are doing. From the reviews they have on the marketplaces to get coverage as well as financially published data one can find out a lot. Concerned about marketing, there is a development hacking class which teaches how to use scraping for the marketing objectives.

Finance intelligence, economic intelligence, etc have more and more financial, political, and economical data accessible online with the newer type of services that collect and add up of that, are increasing.

Let’s go through some points concerned with the market:

• It’s tough to evaluate how huge the data scraping market is as this is with the intersection of many big industries like sales, IT security, finance and marketing intelligence. This method is certainly a small part of all these industries however is expected to increase in the coming years.
• It’s a secured bet to indicate that increasingly SaaS will get pioneering applications for the web data scraping as well as progressively startups will use data scraping services from the safety viewpoint.
• As all the startups are generally entering huge markets using niche products / approaches (web data scraping isn’t a solution of everything, it’s more like a feature) they are expected to be obtained by superior players (within the safety, sales, or marketing tools industries). The technological barriers are also there.

Source URL : http://www.3idatascraping.com/data-scraping-will-definitely-benefit-a-business-startup.php

Thursday, 7 July 2016

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source URL :  http://yellowpagesdatascraping.blogspot.in/2015/06/web-scraping-services-making-modern.html