Scraping Yahoo Finance: 2014

Sunday, 28 December 2014

Scraping By

In his classic 1976 Chesapeake portrait, Beautiful Swimmers, William Warner described the scrape boat as "a workboat unlike any other I had ever seen on the Bay." Seeming half as wide as it was long, he said, it looked like a "a miniature battleship." There's a reason for that, of course. It's a classic case of form following function; the boat evolved for one purpose, to ply the Bay's grassy shallows for shedding blue crabs.

Said to "float on a heavy dew," scrape boats run from 26 to 30 feet long and 9 to 10 feet wide. The hull is a shallow-V deadrise that quickly flattens toward the stern, enabling the boat to pull its twin scrapes—rectangular steel frames, each with a trailing mesh bag—in knee-deep waters. The broad beam might sound ungainly, but the hull tapers toward the stern—betraying its sailboat origins. And it has a graceful sheer, flowing from a bow height of a few feet to little more than a foot above the water amidships.

And you want a low freeboard when you spend the whole day hoisting aboard scrapes, which weigh 50 pounds apiece, not including the load of sea grass and crabs that come in too. Low sides or not, there's a higher than average inci-dence of back problems among scrape boat crabbers. They spend long days bending in precisely the position back doctors say puts undue pressure on the lower back as they sort through rolls of grasses to pluck out the peelers and softies. And that alone may be why crab potting is now the far more common way of catching soft crabs.

Some people think that's good, assuming that dragging a scrape across the Bay's beleaguered grass flats must be destructive. But the smooth bar of the scrape, unlike a toothed dredge, doesn't uproot grasses. In fact, where scraping is traditional, the grass beds seem relatively resilient. I've often thought if Maryland and Virginia had stuck with scraping as the major legal way to soft-crab, overfishing might not have become a problem. Pots can be deployed everywhere and by the thousands, whereas scraping is limited to grass beds and to ground covered at three miles per hour; and even the sturdiest waterman can only pull two of them by hand. But peeler pots seem here to stay, and other soft crabbers have taken to using a single, large scrape operated from larger workboats by hydraulic power.

The bottom line is that these lovely, superbly functional expressions of Chesapeake crabbing culture now number only in the dozens, if you count working, wooden models. There are some fiberglass scrape boat hulls in service, and a Carolina skiff or two has been adapted for the task. They are functional, but have little art to them.

It is probably a sign of how fast scrape boats are going that the Smithsonian Institution recently took the lines off Darlene, a scraper worked by Morris Marsh of Smith Island, for its archives. You can see photos of scrape boats, and learn more about the 140-year old history of scraping, from Paula Johnson's fine book, The Workboats of Smith Island. Mr. Marsh, still going strong in his late 60s, is the scraper who took Warner out nearly 40 years ago when he was researching Beautiful Swimmers.

Indeed, scraping seems to win over those who master it. Marsh's father-in-law, Ed Harrison, scraped for almost 70 years, nearly wearing through the cross-planked bottom of his boat—from the inside—with decades of walking the planks, tending his scrapes. And an islander who scrapes with Marsh today, David Laird, says he is 71—one year younger than Scotty Boy, the scrape boat he took over from his dad in 1958. "I wouldn't even know how to crab in another boat," Laird says.

Soft crabs may well be caught—or farmed—a century from now on the Chesapeake; but no one will devise a way to take them so intimately and beautifully from the shallowest marsh edges and tiniest crevices in the shore as the scrapers do.

Source:http://www.articlesbase.com/culture-articles/scraping-by-1560919.html

Thursday, 25 December 2014

Choose Mining Wear Parts Wisely

It is important to choose a reputable supplier of mining wear parts; one that has been acknowledged as a leader in mining expertise. You will want to research and seek out a company that specializes in the engineering, manufacturing, procurement and design of mining wear parts and who has access to a multitude of patterns and templates to choose from.

It is vital to find a company that invites you to put them to the test; a company that is committed to selling more than just a product, standing behind the parts that they design and manufacture with an unprecedented industry guarantee. Some companies are so confident in their products that each wear part is stamped with their logo, identifying it as a superior product.

You will also want to find a company that takes pride in establishing strong customer relationships and who employs people who are as equally committed to providing outstanding service with customer satisfaction a priority. Your research will help you find a mining wear parts company that guarantees that if they do not have the part available, that they will find it for you or are capable of custom designing products to your exact specifications.

If you stop to consider the ramifications of an equipment malfunction or breakdown on production quotas, the significance of reliable parts becomes readily apparent. The impact can be far reaching if it halts production while the necessary repairs are completed. The ugly reality is that downtime incurs financial losses.

While the cost of aftermarket replacement mining wear parts is one factor, the installation of the part is equally as important. It is vital that aftermarket parts are built to a rugged standard to endure the rigorous industrial demands placed on them. Mining wear parts are routinely subjected to high stress abrasion and impact. The fabricated parts need to have the structural strength to be wear resistant with extended usage. Hardened manganese is the preferred material of choice to impart added strength and avoid premature breakage and replacement. Using inferior quality parts may result in the necessity of replacing them prematurely if they do not withstand the wear and tear that they are subjected to daily. While a few dollars may be saved initially by purchasing inferior mining wear parts, production costs can dramatically increase if frequent breakdowns occur and manpower hours are wasted in the field. Efficient use of manpower is an important budget consideration. Reliability is an absolute necessity w
hen you have production deadlines to meet and operations can quickly grind to a standstill when production is halted.

Quality assurance management monitors the consistency of the parts, demanding that they are machined within precise measurements. In addition, they focus on striving to improve the quality of parts as new technology becomes available. Using precision made, high quality wear parts can make your business more competitive, giving you an advantage and improving your bottom line.

Source:http://ezinearticles.com/?Choose-Mining-Wear-Parts-Wisely&id=6691631

Monday, 22 December 2014

Scraping table from any web page with R or CloudStat

Scraping table from any web page with R or CloudStat:

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

    US Airline Customer Score.
    World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from

http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’

airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

> library(XML)

Warning message:

package "XML" was built under R version 2.14.1

> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table

                     Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1          Southwest        78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2         All Others        NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3           Airlines        72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4        Continental        67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5           American        70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6             United        71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7         US Airways        72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8              Delta        77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines        69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61

11 PreviousYear%Change FirstYear%Change

1 81                 2.5              3.8
3 65                -1.5             -9.7
4 64                -9.9             -4.5
5 63                 0.0            -10.0
7 61                -1.6            -15.3
8 56                -9.7            -27.3
9 #                 N/A              N/A

>

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

> chess = "http://ratings.fide.com/top.phtml?list=men"
> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table

     Rank                       Name Title Country Rating Games B-Year

1      1           Carlsen, Magnus    g    NOR 2835   17 1990
2      2            Aronian, Levon    g    ARM 2805   25 1982
3      3         Kramnik, Vladimir    g    RUS 2801   17 1975
4      4        Anand, Viswanathan    g    IND 2799   17 1969
5      5         Radjabov, Teimour    g    AZE 2773    9 1987
6      6          Topalov, Veselin    g    BUL 2770    9 1975
7      7          Karjakin, Sergey    g    RUS 2769   16 1990
8      8         Ivanchuk, Vassily    g    UKR 2766   16 1969
9      9     Morozevich, Alexander    g    RUS 2763    6 1977
10    10           Gashimov, Vugar    g    AZE 2761    9 1986
11    11       Grischuk, Alexander    g    RUS 2761    8 1983
12    12          Nakamura, Hikaru    g    USA 2759   17 1987
13    13            Svidler, Peter    g    RUS 2749   17 1976
14    14    Mamedyarov, Shakhriyar    g    AZE 2747    9 1985
15    15       Tomashevsky, Evgeny    g    RUS 2740    0 1987
16    16            Gelfand, Boris    g    ISR 2739    9 1968
17    17          Caruana, Fabiano    g    ITA 2736   19 1992
18    18       Nepomniachtchi, Ian    g    RUS 2735   16 1990
19    19                 Wang, Hao    g    CHN 2733    6 1989
20    20              Kamsky, Gata    g    USA 2732    0 1974
21    21 Dominguez Perez, Leinier    g    CUB 2730    6 1983
22    22         Jakovenko, Dmitry    g    RUS 2729    0 1983
23    23        Ponomariov, Ruslan    g    UKR 2727   13 1983
24    24          Vitiugov, Nikita    g    RUS 2726    1 1987
25    25            Adams, Michael    g    ENG 2724   17 1971
26    26               Leko, Peter    g    HUN 2720    9 1979
27    27            Almasi, Zoltan    g    HUN 2717    8 1976
28    28               Giri, Anish    g    NED 2714   15 1994
29    29            Le, Quang Liem    g    VIE 2714    0 1991
30    30             Navara, David    g    CZE 2712    8 1985
31    31            Shirov, Alexei    g    LAT 2710   13 1972
32    32             Polgar, Judit    g    HUN 2710    0 1976
33    33     Riazantsev, Alexander    g    RUS 2710    0 1985
34    34       Wojtaszek, Radoslaw    g    POL 2706    8 1987
35    35      Moiseenko, Alexander    g    UKR 2706    7 1980
36    36   Vallejo Pons, Francisco    g    ESP 2705   15 1982
37    37        Malakhov, Vladimir    g    RUS 2705    0 1980
38    38            Jobava, Baadur    g    GEO 2704   23 1983
39    39           Bacrot, Etienne    g    FRA 2704   14 1983
40    40          Laznicka, Viktor    g    CZE 2704    8 1988
41    41            Sutovsky, Emil    g    ISR 2703    8 1977
42    42        Naiditsch, Arkadij    g    GER 2702   14 1985
43    43         Movsesian, Sergei    g    ARM 2700    9 1978
44    44       Sasikiran, Krishnan    g    IND 2700    9 1981
45    45   Vachier-Lagrave, Maxime    g    FRA 2699   13 1990
46    46            Dreev, Aleksey    g    RUS 2698    6 1969
47    47           Efimenko, Zahar    g    UKR 2695    8 1985
48    48         Volokitin, Andrei    g    UKR 2695    0 1986
49    49                 Wang, Yue    g    CHN 2694    6 1987
50    50        Fressinet, Laurent    g    FRA 2693   17 1981
51    51                Li, Chao b    g    CHN 2693    6 1989
52    52            Grachev, Boris    g    RUS 2693    0 1986
53    53      Nielsen, Peter Heine    g    DEN 2693    0 1973
54    54            Van Wely, Loek    g    NED 2692   13 1972
55    55    Bruzon Batista, Lazaro    g    CUB 2691   19 1982
56    56           McShane, Luke J    g    ENG 2691    8 1984
57    57            Eljanov, Pavel    g    UKR 2690   10 1983
58    58      Kasimdzhanov, Rustam    g    UZB 2689   14 1979
59    59         Inarkiev, Ernesto    g    RUS 2689    6 1985
60    60         Zvjaginsev, Vadim    g    RUS 2688    8 1976
61    61         Andreikin, Dmitry    g    RUS 2688    0 1990
62    62    Areshchenko, Alexander    g    UKR 2688    0 1986
63    63         Rublevsky, Sergei    g    RUS 2686    0 1974
64    64         Akopian, Vladimir    g    ARM 2685    8 1971
65    65          Potkin, Vladimir    g    RUS 2684    0 1982
66    66       Sargissian, Gabriel    g    ARM 2683   15 1983
67    67            Berkes, Ferenc    g    HUN 2682   16 1985
68    68           Bologan, Viktor    g    MDA 2680   15 1971
69    69          Bauer, Christian    g    FRA 2679   24 1977
70    70          Tiviakov, Sergei    g    NED 2677   22 1973
71    71            Short, Nigel D    g    ENG 2677   15 1965
72    72        Motylev, Alexander    g    RUS 2677    6 1979
73    73         Gharamian, Tigran    g    FRA 2676    0 1984
74    74          Kobalia, Mikhail    g    RUS 2673    0 1978
75    75              Meier, Georg    g    GER 2671    9 1987
76    76       Onischuk, Alexander    g    USA 2670   13 1975
77    77              Bu, Xiangzhi    g    CHN 2670    6 1985
78    78          Alekseev, Evgeny    g    RUS 2670    0 1985
79    79            Azarov, Sergei    g    BLR 2667    0 1983
80    80        Kryvoruchko, Yuriy    g    UKR 2666    0 1986
81    81             Balogh, Csaba    g    HUN 2665    8 1987
82    82           Harikrishna, P.    g    IND 2665    6 1986
83    83       Khismatullin, Denis    g    RUS 2664    8 1984
84    84   Nguyen, Ngoc Truong Son    g    VIE 2662    6 1990
85    85           Fridman, Daniel    g    GER 2660   11 1976
86    86              Smirin, Ilia    g    ISR 2660    7 1968
87    87               Ding, Liren    g    CHN 2660    6 1992
88    88         Sadler, Matthew D    g    ENG 2660    3 1974
89    89            Korobov, Anton    g    UKR 2660    0 1985
90    90          Cheparinov, Ivan    g    BUL 2659   18 1986
91    91          Timofeev, Artyom    g    RUS 2659    0 1985
92    92           Georgiev, Kiril    g    BUL 2658   17 1965
93    93           Bartel, Mateusz    g    POL 2658    9 1985
94    94          Zhigalko, Sergei    g    BLR 2658    8 1989
95    95         Feller, Sebastien    g    FRA 2658    0 1991
96    96            Ragger, Markus    g    AUT 2655   17 1988
97    97         Jones, Gawain C B    g    ENG 2653   27 1987
98    98                So, Wesley    g    PHI 2653    5 1993
99    99              Milov, Vadim    g    SUI 2653    0 1972
100 100           Gupta, Abhijeet    g    IND 2652    9 1989
101 101            Postny, Evgeny    g    ISR 2652    8 1981
102 102             Roiz, Michael    g    ISR 2652    6 1983
103 103           Gyimesi, Zoltan    g    HUN 2652    4 1977
104 104          Nikolic, Predrag    g    BIH 2652    2 1960

>

Done. You had successfully scraping data from any web page with R or CloudStat.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Source: http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/

Thursday, 18 December 2014

Extracting Wisdom Teeth Tips

It is believed that due to evolution, our jaws are now smaller than our ancient ancestors'. For this reason, our mouths often do not have adequate room to accommodate the third molars, making them basically useless and in some cases detrimental. Even if they are not impacted, wisdom teeth may be hard to clean, and therefore require removal to reduce the probability of caries and infection.

As part of your routine dental visits, your dentist will likely take X-rays to monitor the development of your third molars. Your dentist will likely recommend removing them as soon as possible to avoid any complications. The extraction of wisdom teeth can sometimes be a costly and daunting procedure; for these reasons many patients delay having them extracted. However, if the impacted teeth become infected, it is important to see your dental professional at once. Symptoms of infection due to impacted wisdom teeth include;

•    Pain in the gums and surrounding areas
•    Red or inflamed gums
•    Tender or bleeding gums
•    Inflammation around the face and jaw
•    Bad breath (halitosis)
•    Frequent headaches

If a single molar needs to be extracted, local anesthetic will be used. In the case where several or all the teeth need extraction, the patient will usually be "put under" using a general anesthetic. If you have an infection or medical complications that put you at a higher than normal risk, the surgery may be performed at a hospital. Extraction of the wisdom teeth is a day surgery, and patients are usually able to return to normal activities in a day or so. You may be prescribed antibiotics prior to the surgery, and you will likely be asked not to eat or drink the night before the surgery.

During the surgery, your dentist makes an incision in the gum tissue covering the tooth. Once the tooth is exposed, the dentist may cut the tooth into smaller pieces to make extraction easier. After the extraction you will be given stitches to mend the gum tissue. You may need to return a few days later to have the stitches removed. You will be monitored after the surgery to ensure that you are not bleeding excessively.

The best time for extraction is when the patient is in their late teens to avoid unnecessary complications. Wisdom teeth extractions performed later in life are still beneficial, but the removal may be more difficult and healing may take longer. Therefore it is wise to have a conversation with your dentist regarding your wisdom teeth as early as possible.

Most people will experience the emergence of their wisdom teeth at some point in their life, and extraction is sometimes necessary as a preventative measure or to fix an actual problem or to prevent problem. It is best to deal with any problems regarding your wisdom teeth as soon as possible to avoid unnecessary difficulties.

Source:http://ezinearticles.com/?Extracting-Wisdom-Teeth-Tips&id=7788863

Wednesday, 17 December 2014

Importance of Data Mining Services in Business

Data mining is used in re-establishment of hidden information of the data of the algorithms. It helps to extract the useful information starting from the data, which can be useful to make practical interpretations for the decision making.

It can be technically defined as automated extraction of hidden information of great databases for the predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making. Although data mining is a relatively new term, the technology is not. It is thus also known as Knowledge discovery in databases since it grip searching for implied information in large databases.

It is primarily used today by companies with a strong customer focus - retail, financial, communication and marketing organizations. It is having lot of importance because of its huge applicability. It is being used increasingly in business applications for understanding and then predicting valuable data, like consumer buying actions and buying tendency, profiles of customers, industry analysis, etc. It is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, e-commerce, customer relationship management and financial services.

However, the use of some advanced technologies makes it a decision making tool as well. It is used in market research, industry research and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, scientific tests, genetics, financial services and utilities.

Data mining consists of major elements:

•    Extract and load operation data onto the data store system.
•    Store and manage the data in a multidimensional database system.
•    Provide data access to business analysts and information technology professionals.
•    Analyze the data by application software.
•    Present the data in a useful format, such as a graph or table.

The use of data mining in business makes the data more related in application. There are several kinds of data mining: text mining, web mining, relational databases, graphic data mining, audio mining and video mining, which are all used in business intelligence applications. Data mining software is used to analyze consumer data and trends in banking as well as many other industries.

Source:http://ezinearticles.com/?Importance-of-Data-Mining-Services-in-Business&id=2601221

Monday, 15 December 2014

Autoscraping casts a wider net

We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our capacity to accommodate these users.

Natalia made a screencast to help our new users get started:

It’s also a great introduction to what this service can do.

We released slybot as an open source integration of the scrapely extraction library and the scrapy framework. This is the core technology behind the autoscraping service and we will make it easy to export autoscraping spiders from Scrapinghub and run them completely with slybot – allowing our users to have the flexibility and freedom provided by open source.

Source:http://blog.scrapinghub.com/2012/02/27/autoscraping-casts-a-wider-net/

Saturday, 13 December 2014

ScraperWiki: A story about two boys, web scraping and a worm

“It’s like a buddy movie.” she said.
Not quite the kind of story lead I’m used to. But what do you expect if you employ journalists in a tech startup?
“Tell them about that computer game of his that you bought with your pocket money.”
She means the one with the risqué name.
I think I’d rather tell you about screen scraping, and why it is fundamental to the nature of data.

About how Julian spent almost a decade scraping himself to death until deciding to step back out and build a tool to make it easier.

I’ll give one example.
Two boys
In 2003, Julian wanted to know how his MP had voted on the Iraq war.
The lists of votes were there, on the www.parliament.uk website. But buried behind dozens of mouse clicks.
Julian and I wrote some software to read the pages for us, and created what eventually became TheyWorkForYou.

We could slice and dice the votes, mix them with some knowledge from political anaroks, and create simple sentences. Mini computer generated stories.

“Louise Ellman voted very strongly for the Iraq war.”
You can see it, and other stories, there now. Try the postcode of the ScraperWiki office, L3 5RF.

I remember the first lobbiest I showed it to. She couldn’t believe it. Decades of work done in an instant by a computer. An encyclopedia of data there in a moment.

Web Scraping

It might seem like a trick at first, as if it was special to Parliament. But actually, everyone does this kind of thing.

Google search is just a giant screen scraper, with one secret sauce algorithm guessing its ranking data.
Facebook uses scraping as a core part of its viral growth to let users easily import their email address book.

There’s lots of messy data in the world. Talk to a geek or a tech company, and you’ll find a screen scraper somewhere.

Why is this?
It’s Tautology

On the surface, screen scrapers look just like devices to work round incomplete IT systems.

Parliament used to publish quite rough HTML, and certainly had no database of MP voting records. So yes, scrapers are partly a clever trick to get round that.

But even if Parliament had published it in a structured format, their publishing would never have been quite right for what we wanted to do.

We still would have had to write a data loader (search for ‘ETL’ to see what a big industry that is). We still would have had to refine the data, linking to other datasets we used about MPs. We still would have had to validate it, like when we found the dead MP who voted.

It would have needed quite a bit of programming, that would have looked very much like a screen scraper.

And then, of course, we still would have had to build the application, connecting the data to the code that delivered the tool that millions of wonks and citizens use every year.

Core to it all is this: When you’re reusing data for a new purpose, a purpose the original creator didn’t intend, you have to work at it.

Put like that, it’s a tautology.
A journalist doesn’t just want to know what the person who created the data wanted them to know.
Scrape Through
So when Julian asked me to be CEO of ScraperWiki, that’s what went through my head.
Secrets buried everywhere.

The same kind of benefits we found for politics in TheyWorkForYou, but scattered across a hundred countries of public data, buried in a thousand corporate intranets.

If only there was a tool for that.
A Worm
And what about my pocket money?
Nicola was talking about Fat Worm Blows a Sparky.
Julian’s boss’s wife gave it its risqué name while blowing bubbles in the bath. It was 1986. Computers were new. He was 17.

Fat Worm cost me £9.95. I was 12.
[Loading screen]
I was on at most £1 a week, so that was ten weeks of savings.
Luckily, the 3D graphics were incomprehensibly good for the mid 1980s. Wonder who the genius programmer is.
I hadn’t met him yet, but it was the start of this story.

Source:https://blog.scraperwiki.com/2011/05/scraperwiki-a-story-about-two-boys-web-scraping-and-a-worm/

Thursday, 11 December 2014

Seven tools for web scraping – To use for data journalism & creating insightful content

I’ve been creating a lot of (data driven) creative content lately and one of the things I like to do is gathering as much data as I can from public sources. I even have some cases it is costing to much time to create and run database queries and my personal build PHP scraper is faster so I just wanted to share some tools that could be helpful. Just a short disclaimer: use these tools on your own risk! Scraping websites could generate high numbers of pageviews and with that, using bandwidth from the website you are scraping.

1. Scraper (Chrome plugin)

Scraper is a simple data mining extension for Google Chrome™ that is useful for online research when you need to quickly analyze data in spreadsheet form.

You can select a specific data point, a price, a rating etc and then use your browser menu: click Scrape Similar and you will get multiple options to export or copy your data to Excel or Google Docs. This plugin is really basic but does the job it is build for: fast and easy screen scraping.

2. Simple PHP Scraper
PHP has a DOMXpath function. I’m not going to explain how this function works, but with the script below you can easily scrape a list of URLs. Since it is PHP, use a cronjob to hourly, daily or weekly scrape the desired data. If you are not used to creating Xpath references, use the Scraper for Chrome plugin by selecting the data point and see the Xpath reference directly.

scraper-example

– Click here to download the example script.

3. Kimono Labs

Kimono has two easy ways to scrape specific URLs: just paste the URL into their website or use their bookmark. Once you have pointed out the data you need, you can set how often and when you want the data to be collected. The data is saved in their database. I like the facts that their learning curve is not that steep and it doesn’t look like you need a PHD in engineering to use their software. The disadvantage of this tool is the fact you can’t upload multiple URLs at once.

4. Import.io

Import.io is a browser based web scraping tool. By following their easy step-by-step plan you select the data you want to scrape and the tool does the rest. It is a more sophisticated tool compared to Kimono. I like it because of the fact it shows a clear overview of all the scrapers you have active and you can scrape multiple URLs at once.

5. Outwit Hub

I will start with the two biggest differences compared to the previous tool: it is a softwarepackage to use on your PC or laptop and to use its full potential it will cost you 75 USD. The free version can only scrape 100 rows of data. What I do like is the number of preprogrammed options to scrape which makes it easy to start and learn about web scraping.

6. ScraperWiki

This tool is really for people wanting to scrape on a massive scale. You can code your own scrapers (in PHP, Ruby & Python) and pricing is really cheap looking to what you can get: 29USD / month for 100 datasets. You are completely free in using libraries and timers. And if your programming skills are not good enough, they can help you out (paid service though). Compared to other tools, this is the most advanced tool that offers the basics of web scraping.

7. Fminer.com
This tool made it possible to finally scrape all the data inside Google Webmaster Tools since it can deal with JavaScript and AJAX interfaces. Read my extensive review on this page: Scraping Webmaster Tools with FMiner!

But on the end, building your individual project scrapers will always be more effective than using predefined scrapers. Am I missing any tools in this sum up of tools?

Source: http://www.notprovided.eu/7-tools-web-scraping-use-data-journalism-creating-insightful-content/

Thursday, 4 December 2014

Multiple Listing Service Gets Favorable Appellate Ruling in Scraping Lawsuit

This is a follow-up to our massive post on anti-scraping lawsuits in the real estate industry from New Year’s Eve 2012 (Note: the portion on MRIS is about halfway through the post, labeled “Same Writ, Different Plaintiff”).

AHRN is a California real estate broker that owns and operates NeighborCity.com. The site gets its data in part by scraping from MLS databases–in this case, MRIS. As part of the scraping, however, AHRN had collected and displayed copyrighted photographs among the bits and pieces of general textual information about the properties. MRIS sent a cease and desist letter to AHRN, and filed suit alleging various copyright claims after the parties failed to agree on a license to use the photographs. Ultimately, a district court in Maryland granted a motion made by MRIS for a preliminary injunction.

When we last left off, the district court had revised its preliminary injunction order to enjoin only AHRN’s use of MRIS’s photographs–not the compilation itself or any textual elements that may be considered a part of it. Since then, AHRN appealed the injunction. On July 18th, the Fourth Circuit Court of Appeals affirmed.

Background

shutterstock_108008486.jpgAHRN argued that MRIS failed to show a likelihood of success on its copyright infringement claim because MRIS: (1) failed to register its copyright in the individual photographs when it registered the database, and (2) did not have a copyright interest in the photographs because the subscribers’ electronic agreement to MRIS’s terms of use failed to transfer those rights.

MRIS Did Not Fail to Register Its Interest in the Photographs

This first question revolved around the scope of MRIS’s registrations. AHRN argued that MRIS’s collective work registrations did not cover the individual photographs because MRIS did not identify the names of the authors and titles of those works. MRIS argued that 17 U.S.C. §409 did not require any such identification when applied to collective works, and that its general description of the pre-existing photographs’ inclusion sufficed.

The court began its discussion by noting the “ambiguous” nature of §409’s language and its varying judicial interpretations. Some courts have barred infringement suits because the collective work registrant failed to list the authors, while others have allowed infringement suits where the registrant owns the rights to the component works as well as the collective work.

In this case, the court agreed with MRIS and found that the latter approach was more consistent with the relevant statutes and regulations:

    Adding impediments to automated database authors’ attempts to register their own component works conflicts with the general purpose of Section 409 to encourage prompt registration . . . and thwarts the specific goal embodied in Section 408 of easing the burden on group registrations[.]

As part of its decision, the court looked favorably upon the 3Taps case, in which Craigslist sued 3Taps and Padmapper for scraping and repackaging its online classified ads. In that case, the court reasoned that it would be “inefficient” to require registrants to list each author of an extremely large number of component works to which the registrant already had obtained an exclusive license.

Having found that MRIS’s general description satisfied § 409’s pre-suit registration requirement, the court moved on to the merits of MRIS’s infringement claim–more specifically, the question of whether MRIS’s Terms of Use actually transferred a copyright interest to its subscribers’ photographs.

E-SIGN Applies to Assignments of Copyrights and Overrides § 204

AHRN challenged MRIS’s ownership of the photographs by arguing that an MLS subscriber’s electronic agreement to MRIS’s Terms of Use does not operate as an assignment of rights under § 204, which requires a signed “writing.”

In a bad sign for AHRN, the court began its discussion by volunteering an argument that MRIS did not even bring up:

    [I]n situations where “the copyright [author] appears to have no dispute with its [assignee] on this matter, it would be anomalous to permit a third party infringer to invoke [Section 204(a)’s signed writing requirement] against the [assignee].”

With that in mind, the court went on to discuss the E-SIGN act’s impact on the conveyance of copyrights. After establishing the meaning of “e-signature,” the court focused on whether the act was limited from covering this type of situation.

    The Act provides that it “does not . . . limit, alter, or otherwise affect any requirement imposed by a statute, regulation, or rule of law . . . other than a requirement that contracts or other records be written, signed, or in nonelectric form[.]”

The court emphasized the phrase “other than,” reasoning that a plain reading of the E-SIGN language showed that Congress intended the provisions to limit § 204. It also noted that Congress did not list copyright assignments among the various agreements to which E-SIGN did not apply–nor was there a catchall that included such assignments.

The court then turned to the Hermosilla case, in which a district court in Florida upheld the validity of a copyright conveyance via e-mail. It emphasized the Hermosilla court’s reliance on the purpose of § 204–“to resolve disputes between copyright owners and transferees and to protect copyright holders from persons mistakenly or fraudulently claiming oral licenses or copyright ownership.” The appellate court agreed with the Hermosilla court that allowing assignment via e-mail actually helped cut down on these types of disputes.

    To invalidate copyright transfer agreements solely because they were made electronically would thwart the clear congressional intent embodied in the E-Sign Act.

All in all, the court basically said “we don’t see why E-SIGN shouldn’t apply.” Note that it did not pass judgment specifically on whether MRIS’s Terms of Use constituted a valid contract. It simply mentioned that AHRN waived that argument by not bringing it up sooner.

Source: http://blog.ericgoldman.org/archives/2013/07/multiple_listin_1.htm

Monday, 1 December 2014

The Roots of Web Scraping and the Wisdom behind It

You may be wondering how data mining came into existence. This effective and innovative trend in business and research is indeed something commendable and the genius behind it is worth great reward. To have a clear view of the origin of web scraping, the following important factors that contribute to the creation of this phenomenon called data collection or web scraping are considered.

Foundations

Unlike any other innovation, no specific date can be clearly pointed out as the birthdate of data mining. It has come into existence as a result of several problem solving processes in major data gathering and handling situations. It appears that cyber technology has opened a Pandora box of “anything can happen” experiences. Moreover, the shift from physical to virtual data collection has resulted in a bulk of database that needed to be organized, analyzed and utilized.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/roots-web-scraping-wisdom-behind/

Friday, 28 November 2014

Scraping Online Communities for your Outreach Campaigns

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.
What is Scraping and What Can it do?

I’m not going to go into great detail on what data scraping actually means, but to simplify this, here’s a definition from the Wikipedia page:

    “Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.”

Let me explain this with a little example…

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you’d need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It’d be pretty awesome, right?
Luckily for you, I’m going to show you how to do just that!
The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements
Each community is different and you’ll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

    Full name
    Job title
    Company name and URL
    Location
    Personal website URL
    Twitter URL, handle and follower/following stats
    Google+ URL, follower count and list of contributor URLs
    Profile image URL
    Facebook URL
    LinkedIn URL

With all of this information I’ll be able to get a huge amount of intelligence about the community members. I’ll also have a list of social media accounts to add and engage with.
On top of this, with all the information on their websites and sites that they write for, I’ll have a wealth of potential link building prospects to work on.

Inbound.org Profiles

You’ll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We’ll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

Sign Up To My Newsletter
Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

This is where you’ll need to install the SEO Tools plugin for Excel (it’s free). If you’ve not used this plugin before, don’t worry – I’ve put together a full tutorial here.

Once you’ve installed the plugin, you’re good to go on the actual scraping side of things…
Quick Note: Don’t worry if you don’t have a good knowledge of coding – you don’t need it. All you’ll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you’ll need to do a little Excel admin. Simply add in some column titles based around the data that you’re gathering. For example, with my example of Inbound.org, I had, ‘Name’, ‘Position’, ‘Company’, ‘Company URL’, etc. which you can see in the screenshot below. You’ll also want to add in a sample profile URL to work on building the template around.

spreadsheet admin
Now it’s time to start getting hands on with XPath.
How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I’m going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I’m simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here’s an example of an XPath query that gathers the name of the person within the profile that we’re scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we’re scraping. You’ll see in the screenshot above that this is Jason Acidre’s profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has ‘user-profile’ id attached to it. This could be a div, span, a or whatever.

Once it’s found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason’s name, which you’ll see in the screenshot below of the code:

website code

Don’t be put off at this stage because you don’t need to go manually trawling through code to build these queries, there’s a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click ‘Inspect element’.

inspect element

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you’ve drilled down on.

All you need to do now is right-click on it and press ‘Copy XPath’.
copy XPath

This will now copy the XPath code for your Excel formula to the clipboard. You’ll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use ‘Copy XPath’ it will wrap some parts of the code in double apostrophes (“) which you’ll need to change to single apostrophes. You’ll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:
=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user’s full name. Pretty good, right?

Check out the full video tutorial below that I’ve put together that talks you through this whole process:

[sws_blue_box box_size=””] Want more useful video tutorials? Subscribe to my YouTube channel now![/sws_blue_box]

XPath Examples for Grabbing Other Data

As you’re probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost – bonus!

Here’s a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it’s really long and complicated, but really I’ve just used another Excel formula to snip of the text ‘followers’ from the end.

=RIGHT(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"),LEN(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"))-10)

Google+ Follower Count

Like with the Twitter follower formula, you’ll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of ‘Contributor to’ URLs

I don’t think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I’m pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you’d like).

Also, you may notice that there is an additional section to the XPath query, “href”. This pulls in the link within the specific code block instead of the text.

As you’ll see in the full Inbound.org scraper template that I’ve made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You’ll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

=StringJoin(", ",XPathOnUrl(A2,"//a[@rel='contributor-to nofollow']","href"))

Twitter Profile Image URL
If you want to get a large version of someone’s Twitter profile image then I’ve got just the thing for you.
Again, you’ll just need to substitute A2 with their Twitter profile URL.
=XPathOnUrl(A2,"//*[@class='profile-picture media-thumbnail js-tooltip']","data-resolved-url-large")

Some Findings from the Data I’ve Gathered

With all big data sets will come some interesting findings. Here’s a few little things that I’ve found from the top 100 influential users on Inbound.org.

average followers chart

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I’ve looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Job titlesFinally, I had a look through all of the domains listed within each of the top 100 Inbound.org users’ Google+ ‘contributor to’ sections and mapped out the most frequently mentioned sites.

Here’s the spread of domains that were the most popular to be contributed to:

domain frequency
It Doesn’t Stop There

As you’ve probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you’ll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

[sdfile url=”http://www.matthewbarby.com/goodies/MatthewBarby-Inbound-Scraper.xlsx” redirect=”http://www.matthewbarby.com/thanks-downloading-inbound-scraper/”]

TL;DR

    Online communities hold valuable data on your target audiences – use it!
    Scale out your intelligence gathering by brushing up on your XPath.
    Download my Inbound.org scraper template and let it work its magic.

Source: http://www.matthewbarby.com/scraping-communities-with-xpath/

Wednesday, 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

Watch a short video that shows how web data extraction is done with automation software

Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Wednesday, 19 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

// Make JSON objects of every post in here, pushing to the posts[] array

});

From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:

    $(item).find('p').text()

Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');

jsdom.env(

"http://www.feedhenry.com/category/blog",

["http://code.jquery.com/jquery.js"],

function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ

      posts.push({

        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()

      });

    });

    console.log(posts);

}

);

A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();

app.use(cors());

app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);

app.use(mbaasExpress.fhmiddleware());

// Our /blog scraper route

app.get('/blog', function(req, res, next){

jsdom.env(

    "http://www.feedhenry.com/category/blog",

    ["http://code.jquery.com/jquery.js"],

    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ

        posts.push({

          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()

        });

      });

      return res.json(posts);

    }

);

});

app.use(mbaasExpress.errorHandler());

var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First, WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Monday, 17 November 2014

Get started with screenscraping using Google Chrome’s Scraper extension

How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.

What is a screenscraper?

I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.

In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.

Google Chrome’s Scraper

Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.

Image

Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.

The Task: Scraping the contact details of all Swedish MPs

Image

This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.

Understanding XPaths

At w3schools you’ll find a broader introduction to XPaths.

Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.

A typical XPath might look something like this:

    //div[@id="content"]/table[1]/tr

Which in plain English translates to:

    // - Search the whole document...

    div[@id="content"] - ...for the div tag with the id "content".

    table[1] - Select the first table.

    tr - And in that table, grab all rows.

Over to Scraper then. I’m given the following suggested XPath:

    //section[1]/div/div/div/dl/dt/a

The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.

Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.

And if we open the section tag we find the list of MPs in div tags.

We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.

Writing our XPaths

In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.

In our case this is the last tag that contains all the data we are looking for:

    //section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl

Click Scrape to test run the XPath. It should give you a list that looks something like this.

Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.

I have highlighted the parts that we want to extract. Grab them with the following XPaths:

    name: dt/a
    party: dd[1]
    region: dd[2]/span[1]
    seat: dd[2]/span[2]
    phone: dd[3]
    e-mail: dd[4]/span/a

Insert these paths in the Columns field and click Scrape to run the scraper.

Click Export to Google Docs to get the data into a spreadsheet.

Source: http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/

Sunday, 16 November 2014

Screen-scraping with WWW::Mechanize

Screen-scraping is the process of emulating an interaction with a Web site - not just downloading pages, but filling out forms, navigating around the site, and dealing with the HTML received as a result. As well as for traditional lookups of information - like the example we'll be exploring in this article - we can use screen-scraping to enhance a Web service into doing something the designers hadn't given us the power to do in the first place. Here's an example:

I do my banking online, but get quickly bored with having to go to my bank's site, log in, navigate around to my accounts and check the balance on each of them. One quick Perl module (Finance::Bank::HSBC) later, and now I can loop through each of my accounts and print their balances, all from a shell prompt. Some more code, and I can do something the bank's site doesn't ordinarily let me - I can treat my accounts as a whole instead of individual accounts, and find out how much money I have, could possibly spend, and owe, all in total.

Another step forward would be to schedule a crontab every day to use the HSBC option to download a copy of my transactions in Quicken's QIF format, and use Simon Cozens' Finance::QIF module to interpret the file and run those transactions against a budget, letting me know whether I'm spending too much lately. This takes a simple Web-based system from being merely useful to being automated and bespoke; if you can think of how to write the code, then you can do it. (It's probably wise for me to add the caveat, though, that you should be extremely careful working with banking information programatically, and even more careful if you're storing your login details in a Perl script somewhere.)

Back to screen-scrapers, and introducing WWW::Mechanize, written by Andy Lester and based on Skud's WWW::Automate. Mechanize allows you to go to a URL and explore the site, following links by name, taking cookies, filling in forms and clicking "submit" buttons. We're also going to use HTML::TokeParser to process the HTML we're given back, which is a process I've written about previously.

The site I've chosen to demonstrate on is the BBC's Radio Times site, which allows users to create a "Diary" for their favorite TV programs, and will tell you whenever any of the programs is showing on any channel. Being a London Perl M[ou]nger, I have an obsession with Buffy the Vampire Slayer. If I tell this to the BBC's site, then it'll tell me when the next episode is, and what the episode name is - so I can check whether it's one I've seen before. I'd have to remember to log into their site every few days to check whether there was a new episode coming along, though. Perl to the rescue! Our script will check to see when the next episode is and let us know, along with the name of the episode being shown.

Here's the code:

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTML::TokeParser;

If you're going to run the script yourself, then you should register with the Radio Times site and create a diary, before giving the e-mail address you used to do so below.

my $email = ";
die "Must provide an e-mail address" unless $email ne ";

We create a WWW::Mechanize object, and tell it the address of the site we'll be working from. The Radio Times' front page has an image link with an ALT text of "My Diary", so we can use that to get to the right section of the site:

my $agent = WWW::Mechanize->new();
$agent->get("http://www.radiotimes.beeb.com/");
$agent->follow("My Diary");

The returned page contains two forms - one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize's list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,'2.')

$agent->form(2);

Now we can fill in our e-mail address for the '<INPUT name="email" type="text">' field, and click the submit button. Nothing too complicated.

$agent->field("email", $email);
$agent->click();

WWW::Mechanize moves us to our diary page. This is the page we need to process to find the date details from. Upon looking at the HTML source for this page, we can see that the HTML we need to work through is something like:

<input>
<tr><td></td></tr>
<tr><td></td><td></td><td class="bluetext">Date of episode</td></tr>
<td></td><td></td>
<td class="bluetext">Time of episode</td></tr>
<a href="page_with_episode_info"></a>

This can be modeled with HTML::TokeParser as below. The important methods to note are get_tag - which will move the stream on to the next opening for the tag given - and get_trimmed_text, which returns the text between the current and given tags. For example, for the HTML code "Bold text here", my $tag = get_trimmed_text("/b") would return "Bold text here" to $tag.

Also note that we're initializing HTML::TokeParser on '\$agent->{content}' - this is an internal variable for WWW::Mechanize, exposing the HTML content of the current page.

my $stream = HTML::TokeParser->new(\$agent->{content});
my $date;
 # <input>
$stream->get_tag("input");
# <tr><td></td></tr><tr>
$stream->get_tag("tr"); $stream->get_tag("tr");
# <td></td><td></td>
$stream->get_tag("td"); $stream->get_tag("td");
# <td class="bluetext">Date of episode</td></tr>
my $tag = $stream->get_tag("td");
if ($tag->[1]{class} and $tag->[1]{class} eq "bluetext") {
 $date = $stream->get_trimmed_text("/td");
 # The date contains ' ', which we'll translate to a space.
 $date =~ s/\xa0/ /g;
}
 # <td></td><td></td>
$stream->get_tag("td");
# <td class="bluetext">Time of episode
$tag = $stream->get_tag("td");
if ($tag->[1]{class} eq "bluetext") {
 $stream->get_tag("b");
 # This concatenates the time of the showing to the date.
 $date .= ", from " . $stream->get_trimmed_text("/b");
}
# </td></tr><a href="page_with_episode_info"></a>
$tag = $stream->get_tag("a");
# Match the URL to find the page giving episode information.
$tag->[1]{href} =~ m!src=(http://.*?)'!;

We have a scalar, $date, containing a string that looks something like "Thursday 23 January, from 6:45pm to 7:30pm.", and we have an URL, in $1, that will tell us more about that episode. We tell WWW::Mechanize to go to the URL:

$agent->get($1);

The navigation we want to perform on this page is far less complex than on the last page, so we can avoid using a TokeParser for it - a regular expression should suffice. The HTML we want to parse looks something like this:

 Episode The Episode Title 

We use a regex delimited with '!' in order to avoid having to escape the slashes present in the HTML, and store any number of alphanumeric characters after some whitespace, all between tags after the Episode header:

$agent->{content} =~ m! Episode \s+?(\w+?) !;

$1 now contains our episode, and all that's left to do is print out what we've found:

my $episode = $1;
print "The next Buffy episode ($episode) is on $date.\n";

And we're all set. We can run our script from the shell:

$ perl radiotimes.pl

The next Buffy episode (Gone) is Thursday Jan. 23, from 6:45 to 7:30 p.m.
I hope this gives a light-hearted introduction to the usefulness of the modules involved. As a note for your own experiments, WWW::Mechanize supports cookies - in that the requestor is a normal LWP::UserAgent object - but they aren't enabled by default. If you need to support cookies, then your script should call "use HTTP::Cookies; $agent->cookie_jar(HTTP::Cookies->new);" on your agent object in order to enable session-volatile cookies for your own code.
Happy screen-scraping, and may you never miss a Buffy episode again.

Source: http://www.perl.com/pub/2003/01/22/mechanize.html