Tutorial scrapping web en python avec scrappy

Note utilisateur: 4 / 5

Détails: Création : samedi 28 avril 2018 10:20; Écrit par ConsultingIT

Tutorial scrapping web en python avec scrappy, proof of concept

D'habitude je crée des bots de scrapping (en javascript ) SEO pour scrapper les articles/commentaires des boutiques wordpress , pour promouvoir le développement d'applications mobiles Android ou IOS, en augmentant les visites quotidiennes des utilisateurs (vérifié via Google Analytics) . Le client travaille avec Scrapy, je dois donc adapter ma méthode de scrapping....

Une question? Posez-la ici

Aide au développement d'applications

Qu'est-ce que Scrapy?

Scrapy est un Framework open-source basé sur le langage Python permettant de créer des robots d’indexation pour les différentes collectes de données (Adresses email, biographie, localisation…etc).

La puissance de Scrapy réside dans le fait qu’il soit modulable, et cela nous aidera à aller au-delà de certaines barrières mises en place par les sites Web.
Scrapy se base principalement sur des classes nommées Spiders, définissant la façon par laquelle les données d’un site seront traitées et organisées.
À côté des spiders viennent certaines classes procurant une certaine liberté au niveau de la configuration :
- Une classe Settings pour gérer certains paramètres (auxquels on reviendra par la suite)
- Une classe Items ou l’on définira l’ensemble des variables ou seront stockées les données.
- Une classe pipelines ou seront traitées les données collectées ayant été stockées dans les items.
Ajoutant à cela que les spiders restent largement personnalisables via des extensions qui permettent de modifier leur comportement.

ou des outils en python, scrapy pour le crawling. Idée : développer un module scrapy pour collecter de l’information.

Installation de Scrapy, la mauvaise méthode:

Avec un peu de veille, j'ai trouvé des liens sympas, que voici:

https://scraperwiki.com/

https://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website-2/
https://www.youtube.com/watch?v=tVNIqytHUJQ

https://scrapy.org/

Pycrust
Python 2.7

http://www.edwardhk.com/os/linux/scrapy-web-crawling-setup/

https://docs.scrapy.org/en/latest/intro/tutorial.html
http://installion.co.uk/kali/kali/main/p/python-scrapy/install/index.html

Le script visiblement fait exprès pour installer en même temps Python et Scrapy, mais qui ne fonctionne pas, pire, qui casse les outils qui utilisent Python sur votre machine! A ne surtout pas taper :

sudo apt-get install python-scrapy

je fais un sudo pour des raisons de sécurité, je n’utilise pas le compte root, et j’ai mis mon user dans le groupe des sudoers

Je l'ai tapé pour tester, parce que personne ne m'a dit que ça allait tout casser. On m'aurait menti? !-PPP

Reading package lists... Done
Building dependency tree
Reading state information... Done
Package python-scrapy is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Et bim une installation Debian/Kali crashée! Encore un coup des hackers russes...

trololo troll

Après reinstallation et reparamétrage, voici l'

Installation de Scrapy, la bonne méthode

créer un "virtual environment". Installer Scapy dans un environnement virtuel pour ne pas casser les outils actuels qui fonctionnent avec des versions spécifiques de Python.

voir: https://docs.scrapy.org/en/latest/intro/install.html#using-a-virtual-environment-recommended

https://virtualenv.pypa.io/en/stable/installation/

sudo pip install virtualenv
« Successfully installed virtualenv-15.2.0 »

sudo virtualenv ENV
« New python executable in /etc/test/ENV/bin/python
Installing setuptools, pip, wheel...done. »

supprimer un environnement :
rm -r /path/to/ENV
sudo rm -r /etc/test/ENV/

Ouvrir un terminal
Cd Documents
sudo virtualenv ENV4

New python executable in /username/Documents/ENV4/bin/python
Installing setuptools, pip, wheel...done.

Cd ENV4
sudo source bin/activate
cd bin
source activate

le curseur se change en ENV4, on est dans le nouvel environnement
(ENV4)....

On peut maintenant installer Scrapy avec python 3 ou python 2 au choix.

Sudo pip install scrapy
« Successfully installed PyDispatcher-2.0.5 cssselect-1.0.3 parsel-1.4.0 queuelib-1.5.0 scrapy-1.5.0 w3lib-1.19.0 »

Ca y'est, scrapy 1.5.0 est installé !

on passe à la création du spider:

On va scraper ce site web : https://consutingit.fr pour récupérer les pages

tutos python sympa :
http://www.diveintopython3.net/
ou
https://docs.python.org/3/tutorial

Une question? Posez-la ici

Aide au développement d'applications

Création du projet Scrapy

Création du projet avec répertoire qui va bien :

sudo scrapy startproject consultingit_fr

On met un underscore, car on ne peut pas mettre de points dans le nom du projet

New Scrapy project 'consultingit_fr', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
/username/Documents/ENV4/bin/tutorial/tutorial/spiders/consultingit_fr

You can start your first spider with:
cd consultingit_fr
scrapy genspider consultingit_fr consultingit.fr

Je tape la commande :
sudo scrapy genspider scrappons consultingit.fr

Created spider 'scrappons' using template 'basic' in module:
consultingit_fr.spiders.scrappons

Dans consultingit_fr, on trouve l’arborescence :

consultingit_fr/
scrapy.cfg # deploy configuration file

consultingit_fr/ # project's Python module, you'll import your code from here
__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders
__init__.py

Dans le repertoire «consultingit_fr/spiders » on va éditer le spider «scrappons.py »
sudo gedit scrappons.py

on voit le code source :

# -*- coding: utf-8 -*-
import scrapy

class ScrapponsSpider(scrapy.Spider):
name = 'scrappons'
allowed_domains = ['consultingit.fr']
start_urls = ['http://consultingit.fr/']

def parse(self, response):
pass

On modidfie ce code source de base, on le remplace par celui ci :

import scrapy

class QuotesSpider(scrapy.Spider):
name = "scrappons"

def start_requests(self):
urls = [
'https://consultingit.fr/fr/programmation-applications-mobiles-android-1',
'https://consultingit.fr/fr/programmation-applications-mobiles-android-2',
'https://consultingit.fr/fr/programmation-applications-mobiles-android-3',
'https://consultingit.fr/fr/formation-devenez-developpeur-java-4',
'https://consultingit.fr/fr/formation-devenez-developpeur-android-5',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'consultingitfr-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

On sauve

Une question? Posez-la ici

Aide au développement d'applications

Lancement du spider «scrappons» en paramètres de Scrapy

Il suffit de lancer cette commande:

sudo scrapy crawl scrappons

On récupère ensuite ce log en output qui nous montre que tout a été bien scrappé:

2018-04-28 15:13:42 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: consultingit_fr)
2018-04-28 15:13:42 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14+ (default, Mar 13 2018, 15:23:44) - [GCC 7.3.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Linux-4.14.0-kali3-amd64-x86_64-with-Kali-kali-rolling-kali-rolling
2018-04-28 15:13:42 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'consultingit_fr.spiders', 'SPIDER_MODULES': ['consultingit_fr.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'consultingit_fr'}
2018-04-28 15:13:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-04-28 15:13:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-28 15:13:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-28 15:13:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-28 15:13:42 [scrapy.core.engine] INFO: Spider opened
2018-04-28 15:13:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-28 15:13:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-28 15:13:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/robots.txt> from <GET https://consultingit.fr/robots.txt>
2018-04-28 15:13:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/robots.txt> (referer: None)
2018-04-28 15:13:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-1/> from <GET https://consultingit.fr/fr/programmation-applications-mobiles-android-1>
2018-04-28 15:13:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/robots.txt> (referer: None)
2018-04-28 15:13:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-2/> from <GET https://consultingit.fr/fr/programmation-applications-mobiles-android-2>
2018-04-28 15:13:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-3/> from <GET https://consultingit.fr/fr/programmation-applications-mobiles-android-3>
2018-04-28 15:13:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/fr/formation-devenez-developpeur-java-4/> from <GET https://consultingit.fr/fr/formation-devenez-developpeur-java-4>
2018-04-28 15:13:43 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.consultingit.fr/fr/formation-devenez-developpeur-android-5/> from <GET https://consultingit.fr/fr/formation-devenez-developpeur-android-5>
2018-04-28 15:13:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-1> (referer: None)
2018-04-28 15:13:43 [scrappons] DEBUG: Saved file consultingitfr-programmation-applications-mobiles-android-1.html
2018-04-28 15:13:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/fr/formation-devenez-developpeur-java-4> (referer: None)
2018-04-28 15:13:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-3> (referer: None)
2018-04-28 15:13:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/fr/programmation-applications-mobiles-android-2> (referer: None)
2018-04-28 15:13:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.consultingit.fr/fr/formation-devenez-developpeur-android-5> (referer: None)
2018-04-28 15:13:44 [scrappons] DEBUG: Saved file consultingitfr-formation-devenez-developpeur-java-4.html
2018-04-28 15:13:44 [scrappons] DEBUG: Saved file consultingitfr-programmation-applications-mobiles-android-3.html
2018-04-28 15:13:44 [scrappons] DEBUG: Saved file consultingitfr-programmation-applications-mobiles-android-2.html
2018-04-28 15:13:44 [scrappons] DEBUG: Saved file consultingitfr-formation-devenez-developpeur-android-5.html
2018-04-28 15:13:44 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-28 15:13:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4056,
'downloader/request_count': 13,
'downloader/request_method_count/GET': 13,
'downloader/response_bytes': 84197,
'downloader/response_count': 13,
'downloader/response_status_count/200': 7,
'downloader/response_status_count/301': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 28, 13, 13, 44, 286668),
'log_count/DEBUG': 19,
'log_count/INFO': 7,
'memusage/max': 53669888,
'memusage/startup': 53669888,
'response_received_count': 7,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2018, 4, 28, 13, 13, 42, 644867)}
2018-04-28 15:13:44 [scrapy.core.engine] INFO: Spider closed (finished)

On a récupéré avec ls le contenu du répertoire qui contient les fichiers suivants :

consultingitfr-formation-devenez-developpeur-android-5.html
consultingitfr-formation-devenez-developpeur-java-4.html
consultingitfr-programmation-applications-mobiles-android-1.html
consultingitfr-programmation-applications-mobiles-android-2.html
consultingitfr-programmation-applications-mobiles-android-3.html

ces fichiers contiennent le code html des pages web scrappées

Pour visualiser le contenu scrappé, on tappe la commande:

sudo gedit consultingitfr-programmation-applications-mobiles-android-3.html

Resultat du SCRAPPING

Voilà, nous avons scrappé, "aspiré" toute la liste des uses cases :

Match behaviors Predict Geo-locations Proximite events Report Detect neighbors Trace routes Reverse geocode to real address Historise location Find the best itinerary Find points of interests along the way get notices about nearby friends get notices about friends deal with our current positions offer information about places we have been keep in touch with users we’ve met Avoid filling interfaces as much as possible forecast my trips forecasting my future positions Telling where i will be Integrate predictive models into apps Make matchings between mobile users Make matchings between points of interest Activate geolocation functions with only a few line of code API gives mobile developers access to geolocation functionality analytics Provide geolocation Provide routing integrate the API with 6 lines of code integrate the API in a few minutes detect preferential moving targets easily find neighbors nearby proactively predict users’ movements organize user meetings offer ideal matches be affordable to all developers easily accessible to all developers. estimate in real time where the user is actually going estimate where the user will go next estimate when the user will go next Run geolocation in the mobile’s background preserve the user’s battery Notify the user Detect motion Encrypt data Tell the developer to use a laptop request for “matches” between users request for “matches” between static points-of-interest detects a match sends a push notification to the mobile device sends a push notification to the servers match users using their past positions match users using their current positions match users using current destinations match users using future proximity match users using future intent-to-travel Tag every user within his “Birthdate” parameter Request matches of similarity for ages Request matches of proximity of users feed matching algorithms match users with non-user objects feed database locate point-of-interest describe the objects with similar tags that users have request for matching between users request for the future positions of users Send encrypted tags Send matchings Send events Send reports on matchings Send locations Send interests Send POI infos Send stocks Accept that their personal data is required Collects geolocation data collect location history collect paths collect journeys Encrypt data to databases Store Encryption Key on the telephone Store Encryption Key in in-house secret server OpenSource IOS code OpenSource ANDROID code Notify the user of what we are doing with the geolocation Show the user what is known about him Delete all user data Create white zones when application goes asleep Create white zones where application goes asleep Make functions “ready to use” Make functions “ customizable ” Make applicaiton in native iOS (Swift and ObjectiveC) Make application in native Android (Java) Embed application with a few lines (duplicate use case) geo-track in the background Make all the self-dependent functions work immediately Generate API Key Provide code examples Use the interface add points-of-interest and tag them with your specific application data request information about users (geolocations, geo-predictions, geo-correlations) request for backend notifications in case of events or matches trigger events into your application mobile nodes request for user-privacy data to be fully removed from our backend Use console generate and manage API keys for each of your applications visualize the movements of your users (or groups of users) over time test and validate Geo-correlation matching syntax</td>

Pour aller plus loin...

Ensuite, nous pouvons approfondir avec un bot qui scrappe Facebook… Nous verrons ça dans une 2eme partie

Pour intérroger l’API Facebook, il faut un login/pass, jeton d’authentification, clée privée.
https://developers.facebook.com/docs/facebook-login/access-tokens

Exemple :

https://www.forbes.com/sites/thomasbrewster/2018/04/16/huge-facebook-facial-recognition-database-built-by-ex-israeli-spies/

scrapper une page

http://minimaxir.com/2015/07/facebook-scraper/

Facebook Graph API doc:https://developers.facebook.com/docs/graph-api/reference

Page de base à scrapper pour récupérer les infos:
https://developers.facebook.com/docs/graph-api/reference/page

Ensuite on récupères les infos sur le fil :
https://developers.facebook.com/docs/graph-api/reference/v2.4/page/feed

On a un « id » unique par utilisateur, qui permet de créer un lien vers l’info intéréssante :
exemple :

Elements interessants à scrapper :

page_id
link
created_time
message
type
likes
comments
shares

Allez, c'est à, vous: vous pouvez voter et mettre 5 étoiles!

Voilà, j'espère que ça vous a plu. Vos commentaires/remarques sont les bienvenus