Archives de catégorie : Informatique

Web scraping with python (part 1 : crawling)

Example One : I am looking for my next job. So I subscribe to many job sites in order to receive notifications by email of new job ads (example = Monster…). But I’d rather check these in my RSS aggregator instead of my mailbox. Or in some sort of aggregating Web platform. Thus, I would be able to do many filtering/sorting/ranking/comparison operations in order to navigate through these numerous job ads.

Example Two : I want to buy a digital camcorder. So I want to compare the available models. Such a comparison implies that I rank the most common models according to their characteristics. Unfortunately, the many sites providing reviews or comparisons of camcorders are not often comprehensive and they don’t offer me the capability of comparing them with respect to my way of ranking and weighting the camcorder features (example = dvspot). So I would prefer pumping all the technical stuff from these sites and manipulate this data locally on my computer. Unfortunately, this data is merged within HTML. And it may be complex to extract it automatically from all the presentation code.

These are common situations : interesting data spread all over the web and merged in HTML presentation code. How to consolidate this data so that you can analyze and process it with your own tools ? In some near future, I expect this data will be published so that it is directly processable by computers (this is what the Semantic Web is intending to do). For now, I was used to do it with Excel (importing Web data, then cleaning it and the like) and I must admit that Excel is fairly good at it. But I’d like some more automation for this process. I’d like some more scripting for this operation so that I don’t end with inventing complex Excel macros or formulas just to automate Web site crawling, HTML extraction and data cleaning. With such an itch to scratch, I tried to address this problem with python.

This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would output RDF instead of CSV. So there remains much room for further improvement (see RDF Web Scraper for a similar but approach). Anyway… Here is part One : how to crawl complex web sites with Python ?. The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.

My crawlers are fully based on the John L. Lee’s mechanize framework for python. There are other tools available in Python. And several other approaches are available when you want to deal with automating the crawling of web sites. Note that you can also try to scrape the screens of legacy terminal-based applications with the help of python (this is called “screen scraping”). Some approaches of web crawling automation rely on recording the behaviour of a user equipped with a web browser and then reproduce this same behaviour in an automated session. That is an attractive and futuristic approach. But this implies that you find a way to guess what the intended automatic crawling behaviour will be from a simple example. In other words, with this approach, you have either to ask the user to click on every web link (all the job postings…) and this gives no value to the automation of the task. Or your system “guesses” what automatic behaviour is expected just by recording a sample of what a human agent would do. Too complex… So I preferred a more down-to-earth solution implying that you write simple crawling scripts “by hand”. (You may still be interested in automatically record user sessions in order to be more productive when producing your crawling scripts.) As a summary : my approach is fully based on mechanize so you may consider the following code as example of uses of mechanize in “real-world” situations.

For purpose of clarity, let’s first focus on the code part that is specific to your crawling session (to the site you want to crawl) . Let’s take the example of the dvspot.com site which you may try to crawl in order to download detailed description of camcorders :

    # Go to home page
    #
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.response.url
     next_element = 0
     while next_element >= 0:
      try:
       b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
       next_element = next_element + 1
       print save_response(b,"dvspot_camera_"+str(next_element))
       # go back to home page
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You certainly reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
      print "processing Next Page"
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1

You noticed that the structure of this code (conditional loops) depends on the organization of the site you are crawling (paginated results, …). You also have to specify the rule that will trigger “clicks” from your crawler. In the above example, your script first follows every link containing “cameraDetail” in its URL (url_regex). Then it follows every link containing “Next Page” in the hyperlink text (text_regex).

This kind of script is usually easy to design and write but it can become complex when the web site is improperly designed. There are two sources of difficulties. The first one is bad HTML. Bad HTML may crash the mechanize framework. This is the reason why you often have to pre-process the HTML either with the help of a HTML tidying library or with simple but string substitutions when your tidy library breaks the HTML too much (this may be the case when the web designer improperly decided to used nested HTML forms). Designing the proper HTML pre-processor for the Web site you want to crawl can be tricky since you may have to dive into the faulty HTML and the mechanize error tracebacks in order to identify the HTML mistakes and workaround them. I hope that future versions of mechanize would implement more robust HTML parsing capabilities. The ideal solution would be to integrate the Mozilla HTML parsing component but I guess this will be some hard work to do. Let’s cross our fingers.

Here are useful examples of pre-processors (as introduced by some other mechanize users and developpers) :

class TidyProcessor(BaseProcessor):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding='utf8',
                   input_encoding='latin1',
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseProcessor):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace('"image""','"image"')
          r = r.replace('"','"')
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())

The second source of difficulties comes from non-RESTful sites. As an example the APEC site (a French Monster-like job site) is based on a proprietary web framework that implies that you cannot rely on links URLs to automate your browsing session. It took me some time to understand that, once loggin in, every time you click on a link, you are presented with a new frameset referring to the URLs that contain the interesting data you are looking for. And these URLs seem to be dependent on your session. No permalink, if you prefer. This makes the crawling process even more tricky. In order to deal with this source of difficulty when you write your crawling script, you have to open both your favorite text editor (to write the script) and your favorite web browser (Firefox of course !). One key knowledge is to know mechanize “find_link” capabilities. These capabilities are documented in _mechanize.py source code, in the find_link method doc strings. They are the arguments you will provide to b.follow_link in order to automate your crawler “clicks”. For more convenience, let me reproduce them here :

text: link text between link tags: <a href=”blah”>this bit</a> (as
returned by pullparser.get_compressed_text(), ie. without tags but
with opening tags “textified” as per the pullparser docs) must compare
equal to this argument, if supplied
text_regex: link text between tag (as defined above) must match the
regular expression object passed as this argument, if supplied
name, name_regex: as for text and text_regex, but matched against the
name HTML attribute of the link tag
url, url_regex: as for text and text_regex, but matched against the
URL of the link tag (note this matches against Link.url, which is a
relative or absolute URL according to how it was written in the HTML)
tag: element name of opening tag, eg. “a”
predicate: a function taking a Link object as its single argument,
returning a boolean result, indicating whether the links
nr: matches the nth link that matches all other criteria (default 0)

Links include anchors (a), image maps (area), and frames (frame,iframe).

Enough with explanations. Now comes the full code in order to automatically download camcorders descriptions from dvspot.com. I distribute this code here under the GPL (legally speaking, I don’t own the copyleft of this entire code since it is based on several snippets I gathered from the web and wwwsearch mailing list). Anyway, please copy-paste-taste !

from mechanize import Browser,LinkNotFoundError
from ClientCookie import BaseProcessor
from StringIO import StringIO
# import tidy
#
import sys
import re
from time import gmtime, strftime
#
# The following two line is specific to the site you want to crawl
# it provides some capabilities to your crawler for it to be able
# to understand the meaning of the data it is crawling ;
# as an example for knowing the age of the crawled resource
#
from datetime import date
# from my_parser import parsed_resource
#
"""
 Let's declare some customized pre-processors.
 These are useful when the HTML you are crawling through is not clean enough for mechanize.
 When you crawl through bad HTML, mechanize often raises errors.
 So either you tidy it with a strict tidy module (see TidyProcessor)
 or you tidy some errors you identified "by hand" (see MyProcessor).
 Note that because the tidy module is quite strict on HTML, it may change the whole
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce
 unintended results such as closing the form too early or making it empty. This is the reason
 you may have to use MyProcessor instead of TidyProcessor.
"""
#
class FakeResponse:
      def __init__(self, resp, nudata):
          self._resp = resp
          self._sio = StringIO(nudata)
#
      def __getattr__(self, name):
          try:
              return getattr(self._sio, name)
          except AttributeError:
              return getattr(self._resp, name)
#
class TidyProcessor(BaseProcessor):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding='utf8',
                   input_encoding='latin1',
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseProcessor):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace('"image""','"image"')
          r = r.replace('"','"')
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
#
""""
 Let's declare some utility methods that will enhance mechanize browsing capabilities
"""
#
def find(b,searchst):
    b.response.seek(0)
    lr = b.response.read()
    return re.search(searchst, lr, re.I)
#
def save_response(b,kw='file'):
    """Saves last response to timestamped file"""
    name = strftime("%Y%m%d%H%M%S_",gmtime())
    name = name + kw + '.html'
    f = open('./'+name,'w')
    b.response.seek(0)
    f.write(b.response.read())
    f.close
    return "Response saved as %s" % name
#
"""
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.
"""
#
def dvspot_crawl():
    """
     Here starts the browsing session.
     For every move, I could have put as a comment an equivalent PBP command line.
     PBP is a nice scripting layer on top of mechanize.
     But it does not allow looping or conditional browsing.
     So I preferred scripting directly with mechanize instead of using PBP
     and then adding an additional layer of scripting on top of it.
    """
#
    MAX_NR_OF_ITEMS_PER_SESSION = 500
    #
    # Go to home page
    #
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.response.url
     next_element = 0
     while next_element >= 0:
      try:
       b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
       next_element = next_element + 1
       print save_response(b,"dvspot_camera_"+str(next_element))
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
      print "processing Next Page"
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1
    #
    return
#
#
#
if __name__ == '__main__':
#
    """ Note that you may need to specify your proxy first.
    On windows, you do :
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080
    """
    #
    dvspot_crawl()

In order to run this code, you will have to install mechanize 0.0.8a, pullparser 0.0.5b, clientcookie 0.4.19, clientform 0.0.16 and utidylib. I used Python 2.3.3. Latest clientcookie’s version was to be integrated into Python 2.4 I think. In order to install mechanize, pullparser, clientcookie and clientform, you just have to do the usual way :

python setup.py build
python setup.py install
python setup.py test

Last but not least : you should be aware that you may be breaking some terms of service from the website you are trying to crawl. Thanks to dvspot for providing such valuable camcorders data to us !

Next part will deal with processing the downloaded HTML pages and extract useful data from them.

Identity Management et Customer Relationship Management

Un nouveau concept vient d’être introduit par le Gartner dans le domaine de la gestion de la relation client : le Customer Interaction Hub désigne l’intégration en une seule plate-forme de toutes les applications liées à la gestion de la relation client.
Selon Phil Windley, la mise en oeuvre de ces concentrateurs de la relation clients suppose des méthodes de travail identiques à celles qui permettent l’intégration des identités électroniques (synchronisation, fédération, provisioning, …).

The three laws of identity

Microsoft has definetely not been the leader in identity management systems. For sure, MS Active Directory is being widely deployed. But it is an authentication and administration infrastructure, not an identity management solution. Because of MS ADAM (lightweight LDAP directory) and MS IIS (directory synchronization aka metadirectory) lacking features (compared to their competitors), I often thought that Microsoft was lacking some visionary approach of identity management. I was definitely wrong. They’ve got a guy who seems to have perfectly identified the (long term) future of identity management.

RDF vu par un pouet

La technologie RDF est la brique de base du web sémantique. Dans une grange, un poète vous rappelle qu’à l’école maternelle, vous saviez déjà faire du RDF avec des ours funambules.

Firefox gagne 5% des parts de marché de MS Internet Explorer

Le navigateur web Firefox, de la fondation Mozilla, a pris 5% des parts de marché des navigateurs web à son concurrent, Microsoft Internet Explorer. Et pourtant cela ne fait que quelques semaines que la version 1.0 de Firefox a été publiée. Mais la supériorité de Firefox sur IE est déjà largement vantée par la presse, ce qui accélère le mouvement de migration…
Ces données de part de marché sont publiées par OneStat, le “leader mondial des statistiques du web en temps réel”, bref une source a priori fiable. Plus de commentaires, sur Slashdot.

Alexandre Gueniot va trouver un job

C’est sûr ! Alexandre Gueniot va trouver un job, pas de doute vu que son CV, en plus, d’être original, amusant et bien fait, est en train de faire le tour de la planète par messagerie interposée. Vous avez bien un collègue/ami plus ou moins informaticien qui vous envoie de temps en temps des blagues ou autres fichiers .pps par mail ? Et bien il va bientôt vous envoyer un lien vers le CV d’Alexandre. Ne manquez pas d’aller consulter ce CV pas comme les autres !

Carnets web en entreprise : suite

Pour faire suite à mes deux derniers messages sur les carnets web en entreprise, ne pas louper la lecture des articles connexes de Padawan et de Loic Le Meur. Au passage, bonjour à Gilles (celui qui est en vrac et en ligne) qui s’intéresse québecquoisement aux mêmes sujets ! :)

Portails / CMS en J2EE

Pour créer un portail d’entreprise en J2EE, il y a le choix entre acheter un coûteux portail propriétaire (IBM ou BEA pour ne citer que les leaders des serveurs d’application J2EE) ou recourir à un portail J2EE open source. Mais autant l’offre open source en matière de serveurs d’application J2EE (JBoss, Jonas) atteint une certaine maturité qui la rend crédible pour des projets de grande envergure, autant l’offre open source en matière de portails J2EE semble largement immature. Ceci semble fermer à l’open source le marché des portails et de la gestion de contenu des grandes entreprises pour encore de nombreuses années.

Aux yeux de la communauté J2EE, des cabinets de conseil du secteur et des gros éditeurs, le meilleur produit du marché sera nécessairement celui qui supportera au moins les deux standards du moment : JSR 168 pour garantir la portabilité des portlets d’un produit à l’autre, et WSRP pour garantir l’interopérabilité des portlets distantes entre leur serveur d’application et le portail qui les agrège et les publie. Il y a donc dans cette gamme de produit une course à celui qui sera le plus dans la mode de la “SOA” (Service-Oriented Architecture). Comme portails J2EE open source, on cite fréquemment Liferay et Exo. Cette offre open source n’est pas étrangère à la fanfaronnade SOA (il faut bien marketer les produits, eh oui…). Du coup, l’effort de développement des portails J2EE open source semble davantage porter sur l’escalade de la pile SOA que sur l’implémentation de fonctionnalités utiles. C’est sûrement ce qui amène la communauté J2EE à constater que les portails J2EE open source manquent encore beaucoup de maturité et de richesse fonctionnelle surtout lorsqu’on les compare à Plone, leader du portail / CMS open source. En effet, Plone s’appuie sur un serveur d’application Python (Zope) et non Java (a fortiori non J2EE) ; il se situe donc hors de la course à JSR168 et semble royalement ignorer le bluff WSRP.

Nombreuses sont les entreprises qui s’évertuent à faire de J2EE une doctrine interne en matière d’architecture applicative. Confrontées au choix d’un portail, elles éliminent donc rapidement l’offre open source J2EE (pas assez mûre). Et, plutôt que de choisir un portail non J2EE reconnu comme plus mûr, plus riches en fonctionnalités et moins coûteux, elles préfèrent se cantonner à leur idéologie J2EE sous prétexte qu’il n’y a point de salut hors J2EE/.Net. Pas assez buzzword compliant, mon fils… Pfff, ne suivez pas mon regard… :-(

Blogs, klogs, plogs… en entreprise

On les appelle couramment des weblogs, ou blogs pour faire plus court, ou carnets web pour faire plus francophone. Certains carnets web s’étant spécialisés, on a poussé les néologismes : klogs désigne les carnets web dédiés au partage de connaissance (k~~nowledge~~logs) ; plogs désigne les carnets de bord d’équipes projets (p~~roject~~ logs) ; moblogs désigne les carnets web dont la mise à jour s’effectue depuis un PDA ou un téléphone portable (mob~~ile~~ logs). Sans compter les photologs et autres vlogs (v~~ideo~~ logs). Et puisque les carnets Web font leur entrée dans le monde de l’entreprise, on en revient à dire que blog = business-log.

Un journaliste du magazine CIO (dédié aux directeurs informatiques) confirme que parmi les plus grosses entreprises au monde (Fortune 500), un nombre significatif utilisent des blogs au sein de leurs départements informatiques notamment en tant que p~~roject~~ logs pour coordonner et commenter l’avancement de projets informatiques.

Il cite dans son article les motivations de ces entreprises, et ses lecteurs en ajoutent quelques unes :

[leur] mélange de commentaires critiques est vu davantage comme constructif que l’inverse
si j’étais un gestionnaire des ventes d’un géant pharmaceutique, j’apprécierais de pouvoir de temps en temps parcourir le carnet de mon interlocteur informatique qui installe un système d’automatisation des forces de vente
[on peut] difficilement imaginer un meilleur moyen d’ancrer les nouveaux membres d’un service informatique dans un même contexte
un plog donne l’occasion à un leader d’observer dans son ensemble le “storyboard continu” [de son projet] pour évaluer si les actions ou les réflexions en cours vont permette de produire les livrables attendus pour le projet

Il peut ainsi réagir comme le ferait un réalisateur ou un metteur en scène
contrairement aux approches top-down habituelles du knowledge management,

les plogs et leurs cousins permettent au savoir de rester proche du créateur de ce savoir

Les carnets web sont des outils individuels et qui valorisent la contribution de l’individu plutôt que de le noyer dans la masse
les blogs prennent le relais électronique de la machine à café

L’auteur de cet article, et ceux qui l’ont commenté, citent également divers risques qu’il s’agit de gérer intelligemment dans l’adoption des carnets web en entreprise :

rester constructif : la motivation des lecteurs d’un carnet de projet doit davantage être la curiosité (savoir où en est le projet par exemple) que la volonté d’interférer
prévenir les ingérances indues : à la lecture d’un plog, grande peut être la tentation de devenir un micromanager qui interfère indûment dans les affaires en cours
prévenir les crises de blogorrhée :

la ligne entre la libre expression et l’auto-indulgence est effroyablement fine

et les carnetiers peuvent avoir tendance à verser dans l’auto-promotion ou le noyage de leurs lecteurs potentiels dans une prose égocentrique qui n’intéresse qu’eux-mêmes
éviter de communiquer plutôt que de travailler : il arrive qu’à force de prendre du plaisir à communiquer avec ses collègues, on en perde le sens des priorités !
ne pas se laisser abuser par une belle communication : un plog peut devenir un outil de politique de couloirs, une caisse de résonance pour ceux qui savent que leur manager n’est pas capable de distinguer les vantards des collaborateurs efficaces
être efficace : pour que les carnets web ne soient pas “encore une autre tentive de gérer les connaissances”, il s’agit que leur adoption soit guidée par le pragmatisme et les usages qu’en font les utilisateurs pilotes et non par les concepts ou les outils

Dave Pollard avait quant à lui, sur son carnet Web, réuni un certain nombre de (bons) conseils pour mettre en oeuvre un politique de carnettage dans une entreprise :

Les blogs sont individuels (non aux carnets d’équipes)
La taxonomie d’un blog doit rester propre à son auteur (et ne pas se perdre dans une politique bureaucratique ou technocratique de classements/catégorisation de concepts !). Elle reprendra typiquement la manière dont l’auteur organise le répertoire “Mes documents” de son poste de travail ou bien sa boîte aux lettres ou plus simplement son armoire.
Les meilleurs candidats au carnettage en entreprise sont ceux qui ont déjà l’habitude de publier abondamment en entreprise : éditeurs de newsletters, experts, communiquants. Ce sont ceux que l’on citera spontanément en répondant à la question : “lequel de vos collaborateurs a des fichiers dont le contenu vous serait le plus utile dans votre travail ?”
Pour chaque possesseur d’un carnet web, demandez à vos informaticiens de convertir en HTML et de mettre en ligne dans son carnet Web l’ensemble de ses fichiers bureautique, pour constituer une archive qui apportera une valeur immédiate à ses lecteurs.
Avec l’aide de vos équipes marketing, créer chez vos clients l’envie d’accéder à certains carnets web de vos collaborateurs, comme si il s’agissait d’un canal privilégié de relation avec l’entreprise.

Le journaliste de CIO.com estime que

les organisations IT qui utilisent efficacement les blogs comme outils de management (ou comme ressources pour la communication) sont probablement des environnements de développement [humain] qui prennent au sérieux les personnes et les idées.

Il estime enfin que

lorsqu’un développeur ou un manager ou un chargé de support clientèle réussit à produire un plog qui suscite l’attention, qui sensibilise et qui suscite le changement, alors c’est une compétence qui mérite reconnaissance et récompense.

Linux vu par le Gartner

Le Gartner Group a récemment réalisé une enquête auprès de reponsables de centres de données. Voici quelques extraits des résultats de cette enquête : 42% des interrogés sont encore en train d’expérimenter Linux pour en évaluer l’intérêt pour eux, 34% ont adopté Linux dans leur centre après avoir reconnu ses avantages et sa maturité, 9% en sont déjà à une phase où ils en tirent des bénéfices tangibles. Chose plus étonnante, 30% d’entre eux prévoient de déployer Linux pour supporter des applications départementales ou sectorielles (et non pas seulement en tant qu’OS pour les serveurs d’infrastructure). L’adoption de Linux se fait principalement aux dépends d’Unix propriétaires mais également en partie aux dépends de serveurs Windows. Les trois fournisseurs préférés de service pour l’accompagnement de déploiements Linux en datacenters sont IBM, Red Hat et HP. Les principaux freins à l’adoption de Linux dans les datacenters sont surtout le manque de compétences des personnels mais aussi le manque d’applications et le manque d’outils d’administration et de supervision (la gestion de datacenters implique de forts besoin en la matière).

Mozilla Firefox plutôt que Microsoft Internet Explorer ?

Dans les grandes entreprises comme ailleurs, la sécurité est un objectif prioritaire (si ce n’est L’OBJECTIF prioritaire) des directions informatiques. A l’heure où certaines se demandent si il est opportun d’inscrire le navigateur Internet Explorer comme standard interne obligatoire pour le Web, la concurrence entre navigateurs semble enfin se réveiller.

Le marché est actuellement dominé par Internet Explorer, l’offre de Microsoft (environ 95% des parts de marché ?). Mais les innombrables failles de sécurité qui ont été identifiées cette année dans IE ont sapé la confiance de nombre d’informaticiens envers ce produit. A tel point que le CERT, observatoire de la sécurité informatique dépendant du ministère de l’intérieur américain recommande aux utilisateurs d’envisager l’abandon d’IE au profit de navigateurs alternatifs, tels que Firefox, de la fondation Mozilla. Depuis lors, les recommandations d’experts au sujet de Firefox se sont multipliées dans la presse spécialisée (voir les articles de 01Net et du Journal du Net) mais aussi dans la presse professionnelle généraliste aux USA (voir ci-après). Et les observateurs ont constaté que, pour la première fois depuis longtemps, le monopole d’IE sur le marché des navigateurs semble vaciller.

On a ainsi pu lire dans Business Week :

Bien que la part de marché du navigateur Web de Mozilla soit encore petite, sa croissance tire partie de ceux qui sont soucieux des failles de sécurité du produit dominant de chez Microsoft […] Pour la première fois depuis plus de sept ans, Microsoft perd des parts dans le marché des navigateurs Web. Et il ne s’agit pas seulement d’un effet de bord. […] Firefox est-il supérieur à IE ? Les analystes qui ont comparé les deux disent que Firefox s’affiche plus rapidement et ouvre les pages Web plus rapidement. Il a quelques fonctionnalités qu’IE n’a pas […].

On a également pu lire dans le Wall Street Journal :

[Comment] naviguer en sécurité : je suggère d’abandonner le navigateur Web Internet Explorer de Microsoft qui a accumulé les failles de sécurité. Je recommande de le remplacer par Mozilla Firefox, qui est gratuit et librement disponible sur www.mozilla.org. Il s’agit d’un produit non seulement plus sûr mais également plus moderne et plus avancé avec la navigation par onglets qui permet à de multiples pages de s’afficher sur un même écran et un bloqueur de publicité “pop-up” meilleur que celui que Microsoft a tardivement ajouté à IE.

Parmi les autres innovations apportées par Firefox, on peut également retenir Live Bookmarks, une solution de suivi d’actualités en ligne (Yahoo, BBC, presse française, carnets Web, …) et de partage de bookmarks en ligne.

Le site news.com rapporte que les parts de marché de Firefox ont progressé au détriment de Microsoft. Sur ce site d’actualité informatique grand public, les utilisateurs Firefox seraient passés de 8% en janvier 2004 à 18% en septembre 2004. D’après une entreprise de mesure d’audience, 1,8% des utilisateurs des sites de e-commerce auraient abandonné IE au profit de Firefox entre juin et septembre 2004, soit une part de marché de 5,2% pour Firefox en septembre 2004.

Le responsable de la sécurité informatique chez Microsoft a même admis (malencontreusement) dans la presse (magazine Wired) qu’il avait dû abandonner IE au profit de Firefox afin de protéger son ordinateur contre une menace de piratage.

Pour la récente publication de la version 1.0 PREVIEW de firefox, début septembre 2004, le site de Mozilla a enregistré plus d’un million de téléchargements en moins de 100 heures.

La vision de chez Mac Donald Bradley au sujet du web sémantique

J’ai été très impressionné par la qualité de la vision du directeur scientifique de chez Mc Donald Bradley au sujet du web sémantique. Il présente non seulement de très justes illustrations de la vision de Tim Berner’s Lee mais il la remet également de manière très pertinente dans le contexte général de l’évolution de l’informatique sur les dernières décennies, à travers notamment la perspective d’applications concrètes pour l’entreprise. Sa déclaration d’indépendance des données laisse présager un avenir excellent pour la nouvelle discipline informatique qu’est l’architecture de l’information. McDonald Bradley est une entreprise que je trouve d’autant plus intéressante qu’elle se positionne sur des marchés verticaux clairement délimités, au sein du secteur public (et donc précurseurs en matière d’open source) : les services de renseignement, la défense, la sécurité, les finances publiques et les collectivités locales. A rapprocher des interrogations de Kendall Grant Clark au sujet de l’appropriation du web sémantique par les communautés du libre ? Malheureusement, je crains qu’il n’existe pas d’entreprise équivalente en France…

Web scraping with Python

Here is a set of resources for scraping the web with the help of Python. The best solution seems to be Mechanize plus Beautiful Soup.

Solution open source de gestion des identités

Linagora propose une solution complète de gestion des identités électroniques appuyée sur des logiciels open source : InterLDAP (s’appuyant notamment sur AACLS).
Avantages : une couverture fonctionnelle très large (WebSSO, gestion de listes de diffusion, administration déléguée, infrastructure d’annuaires avec réplication, flexibilité extrême dans la gestion des règles de contrôle d’accès, user provisioning…), un coût de licences nul, des choix technologiques prudents et “industriels” (J2EE, OpenLDAP, respect des standards ouverts…).
Inconvénients : encore peu “packagée” d’un point de vue marketing, pas de grosse référence dans le secteur privé, difficile à acheter par un acteur privé (cf. ci-dessous), difficilement comparable avec un produit propriétaire concurrent.

Pour reprendre des thèmes évoqués avec Nicolas Chauvat de Logilab lors d’une récente conversation, ce type d’offres souffre de défauts communs à la plupart des offres produits open source lorsqu’il s’agit de les vendre notamment au secteur privé. Comment les acheteurs de grandes sociétés du CAC 40 peuvent-ils évaluer leur performance (et donc leur prime de fin d’année ?), eux qui ont l’habitude de la mesurer à l’aune du différentiel entre le prix public des licences et le prix négocié ? Les modes d’achats logiciels des gros acteurs privés ne sont vraiment pas adaptés à l’achat de solutions open source. En effet, le coût de licence étant nul, c’est l’évaluation d’un coût total de possession qui peut seul permettre la comparaison entre une offre open source et une offre propriétaire. Or, outre le fait que les modèles de TCO sont généralement peu fiables ou alors très coûteux à mettre en place, il est difficile de prévoir le coût des développements spécifiques/personnalisation. L’achat d’un développement au forfait suppose normalement que l’acheteur soit capable de fournir un cahier des charges fonctionnel du besoin détaillé et stable pour que le fournisseur en concurrence puisse s’engager sur un coût prévisionnel et prenne à sa charge le coût du risque lié à cette prévision. Mais le problème, c’est que l’acheteur est généralement incapable de fournir un tel dossier car les services informatiques sont généralement trop contraints par les délais et l’imprévisibilité des maîtrises d’ouvrage pour pouvoir formaliser ce cahier des charges dans de bonnes conditions. Cela entraîne des pratiques d’achat dans lesquelles on compare d’une part un coût de licence et de support et d’autre part un coût du jour.homme pour le développement spécifique. Dès lors comment comparer un produit propriétaire pour lequel l’essentiel du coût présenté à l’achat est celui de la licence avec un produit open source pour lequel l’essentiel du coût présenté à l’achat est celui de l’intégration ?

Etant données la marge d’incertitude liée aux spécifications fonctionnelles (imprécises, peu stables), la marge d’incertitude liée au modèle de calcul du TCO et la marge d’incertitude liée à l’évaluation de l’adéquation d’un produit au besoin fonctionnel, il paraît relativement vain aux acteurs privés de vouloir considérer des solutions open source dans leurs appels d’offres. Ce n’est que lorsque la nature de la relation du client à son fournisseur est un élément stratégique de décision que le choix open source paraît devenir évident. Lorsque le client cherche une solution qui lui donne plus d’autonomie vis-à-vis de l’éditeur, c’est alors qu’il peut voir la valeur d’une offre open source. Mais, même dans ce cas, peut-on réellement quantifier cette valeur ? Est-elle défendable devant des acheteurs et des décideurs du secteur privé ? Sans doute. Pourtant, je n’ai encore jamais vu d’argumentaire marketing soutenu par un modèle financier solide et reconnu par la presse professionnelle qui arrive à démontrer cette valeur. Les entreprises françaises croient encore qu’open source signifie soit gratuit, soit bricolé par des non professionnels, soit “non industriel” ou, dans le meilleur des cas, croient que l’open source signifie “prétenduement vraiment moins cher”. Plus encore, je pense que la mentalité générale consiste à considérer l’open source comme “non gérable”. Lorsque cette mentalité aura changé, c’est que les entreprises du secteur de l’open source auront réussi.

PS : A ces difficultés, il faut ajouter le fait que la plupart des SS2L regroupent des passionnés des technologies qui n’ont pas forcément les compétences marketing et de vente que réunissent habituellement les éditeurs propriétaires. Mais ce dernier point n’est sans doute qu’un péché de jeunesse ?

XML.com: Implementing REST Web Services: Best Practices and Guidelines

On me reproche parfois d’être un peu trop théorique. Alors, concernant le style architectural REST, voici quelques bonnes pratiques et guide de conduite pour construire des services Web conformes au style REST.

Plone as a semantic aggregator

Here is an output of my imagination (no code, sorry, just a speech) : what if a CMS such as Plone could be turned into a universal content aggregator. It would become able to retrieve any properly packaged content/data from the Web and import it so that it can be reused, enhanced, and processed with the help of Plone content management features. As a universal content aggregator, it would be able to “import” (or “aggregate”) any content whatever its structure and semantic may be. Buzzwords ahead : Plone would be a schema-agnostic aggregator. It would be a semantic-enabled aggretor

Example : On site A, beer-lovers gather. Site A’s webmaster has setup a specific data schema for the description of beers, beer flabours, beer makers, beer drinkers, and so on. Since site A is rich in terms of content and its community of users is enthusiastic, plenty of beers have been described there. Then site B, powered by a semantic aggregator (and CMS), is interested in any data regarding beverages and beverages impact on human’s health. So site B retrieves beer data from site A. In fact it retrieves both the description of beer1, beer2, beerdrinker1, … and the description of what a beer is, how data is structured when it describes a beer, what the relationship is between a beer and a beer drinker. So site B now knows many things about beer in general (data structure = schema) and many beers specifically (beers data). All this beer data on site B is presented and handled as specific content types. Site B’s users are now able to handle beer descriptions as content items, to process them through workflows, to rate them, to blog on them, and so on. And finallly to republish site B’s own output in such a way it can be aggregated again from other sites. That would be the definitive birth of the semantic web !

There are many news aggregators (RSSBandit, …) that know how to retrieve news items from remote sites. But they are only able to aggregate news data. They only know one possible schema for retrievable data : the structure of a news item (a title + a link + a description + a date + …). This schema is specified in the (many) RSS standard(s).

But now that CMS such as Plone are equipped with schema management engines (called “Archetypes” for Plone), they are able to learn new data schema specified in XML files. Currently, Plone’s archetypes is able to import any schema specified in the form of an XMI file output by any UML modelizing editor.

But XMI files are not that common on the Web. And the W3C published some information showing that any UML schema (class diagram I mean) is the equivalent of an RDF-S schema. And there even is a testbed converter from RDF-S to XMI. And there even are web directories inventoring existing RDF schemas as RDF-S files. Plus RSS 1.0 is based on RDF. Plus Atom designers designed it in such a way it is easily converted to RDF.

So here is my easy speech (no code) : let’s build an RDF aggregator product from Plone. This product would retrieve any RDF file from any web site. (It would store it in the Plone’s triplestore called ROPE for instance). It would then retrieve the associated RDF-S file (and store it in the same triplestore). It would convert it to an XMI file and import it as an Archetypes content type with the help of the ArchGenXML feature. Then it would import the RDF data as AT items conforming to the newly created AT content type. Here is a diagram summarizing this : Plone as a semantic aggregator

By the way, Gillou (from Ingeniweb) did not wait for my imagination output to propose a similar project. He called it ATXChange. The only differences I see between his proposal and what is said above are, first, that Gillou might not be aware about RDF and RDF-S capabilities (so he might end with a Archetypes-specific aggregator inputting and outputting content to and from Plone sites only) and that Gillou must be able to provide code sooner or later whereas I may not be !

Last but not least : wordpress is somewhat going in the same direction. The semweb community is manifesting some interest in WP structured blogging features. And some plugins are appearing that try to incorporate more RDF features in WP (see also seeAlso).

Is the Semantic Web stratospheric enough ?

Did you think the Semantic Web is a stratospheric concept for people smoking too many HTTP connections ? If so, don’t even try to understand what Pierre Levy is intending to do. He and the associatied network of people say they are preparing the next step after the Semantic Web. Well… In fact, I even heard Pierre Levy saying he is preparing the next step in the evolution of mankind, so this is not such a surprise. The worst point in this story is that his ambitious work may be extremely relevant and insightful for all of us, mortals. :)

Free Text-to-speech technologies

Here is a short review of freely available (open source or not) “text-to-speech” technologies. I digged in this topic because I wanted to check whether anyone invented some software package turning my RSS aggregator into a personalized radio. More precisely, while I am doing some other task (feeding one of my kids, brushing my teeth, having my breakfast, …) I would like to be able to check my favorite blogs for news without having to read stuff. My conclusion : two packages come near to the expected result.

Regarding features, the most advanced one is NewsAloud from nextup.com. It acts as a simple and limited news aggregator combined with a text-to-speech engine that reads selected newsfeeds loud. But it still lacks some important features (loading my OPML subscription file so that I don’t have to enter my favorites RSS feeds one by one, displaying a scrolling text as it is read, …) and worst : it is NOT open source.

The second nice-looking package going in the expected direction is just a nice hack called BlogTalker and enabling any IBlogExtension-compatible .Net aggregator (RSSBandit, NewsGator…) to read any blog entry. But it is just a proof-of-concept since it cannot be setup so that it reads a whole newsfeed nor any set of newsfeeds. It seems to me that adding TTS abilities to existing news aggregators is the way to go (compared to NewsAloud which is coming from TTS technologies and trying to build a news aggregator from there). And BlogTalker passes successfully the “is it open source ?” test.

Both packages depend on third party text-to-speech engines (the “voices” you install on your system). As such, they are dependent on the quality of the underlying TTS engine. For example, if you are a Windows user, you can freely use some Microsoft voices (Mike, Mary, Sam, robot voices, …) or Lernout & Hauspie voices or many other freely available TTS engines that support the Microsoft Speech API version 4 or 5 (or above ?). The problem is that these voices do not sound good enough to me. As a native French speaker, I am comfortable with the LH Pierre or LH Veronique French voices even if they still sound like automat voices. But for listening to English newsfeeds on the long run, the MS, LH or other voices are not good enough. Fortunately, AT&T invented its “natural voices” which sound extremely … natural according to the samples provided online. Unfortunately, you have to purchase them. I will wait for this new kind of natural voices to become commoditized.

Meanwhile, I have to admit that TTS-enabled news aggregators are not ready for end-users. You can assemble a nice proof-of-concept but the quality is still lacking with the above three issues : aggregators are not fully mature (from a usability point-of-view), high-quality TTS engines are still rare, nobody has achieved to integrate them well one with the other yet. With the maturation of audio streaming technologies, I expect some hacker some day to TTS-enable my favorite CMS : Plone. With the help some of the Plone aggregation modules (CMFFeed, CMFSin, …), it would be able to stream personalized audio newsfeed directly to WinAmp… Does it sound like a dream ? Not sure…

During my tests, I encountered several other TTS utilities that are open source (or free or included in Windows) :

Windows Narrator is a nice feature that reads any Windows message box for more accessibility. It seems to be bundled in all the recent Windows releases. Windows TTS features are also delivered with the help of the friendly-but-useless Microsoft Agents.
Speakerdaemon‘s concept is simple : it monitors any set of local files or URLs and it speaks a predefined message at any change in the local or remote resource (“Your favorite weblog has been updated !”). Too bad it cannot read the content or excerpts (think regular expressions) of these resources.
SayzMe sits in your icon tray and reads any text that is pasted by Windows into the clipboard. Limited but easy.
Clip2Speech offer the same simple set of features as SayzMe plus it allows you to convert text to .WAV files.
Voxx Open Source is somewhat ambitious. It offers both TTS features (read any highlighted text when you hit Ctrl-3, read message boxes, read any text file, convert text to .WAV or .MP3, …) and speech recognition. Once again, it is “just” a packaging (front-end) of third party speech recognition engines. As such, it uses by default Microsoft Speech recognizer which is not available in French (but in U.S. English, Chinese and Japanese if I remember properly). I have still to try it in its U.S. English with a headset microphone since my laptop microphones catches too much noise for it to be usable. The speech recognition feature allows the user to dictate a text or to command Voxx or Windows via voice. So it is an open source competitor to IBM ViaVoice or ScanSoft Dragon Naturally Speaking.
PhantomSpeech is middleware that plugs into TTS engines and allows application developers to add TTS capabilities to their applications. It is said to be distributed with addins for Office 2000. Indeed I could display a PhantomSpeech toolbar in Word 2003. It could read a text but only using the female Microsoft voice. And this toolbar had unexpected behaviors and errors within Office. Not reliable as a front-end application. Anyway, the use and configuration of speech engines is really a mess. The result is that PhantomSpeech does not look as really intended for end-users but maybe just for developers.
CHIPSpeaking is a nice utility for “the vocally disabled” (people who cannot speak). It allows the user to dictate sentences with a virtual keyboard and to record predefined sentences that are read aloud with one click.
ReadPlease (the free version) is just a nice simple text reader made by developers who played too much with Warcraft (click on the faces and you’ll see why). The word being read is highlighted. Simple options allow users to change the voices with one click (which is cool when you switch between several languages) or to customize the size of the text, …
Spacejock’s yRead is another text reader that includes a pronunciation editor (read km as “kilometers” please) and also allows the download of public domain texts available from Project Gutenberg. The phrase being read is highlighted, you can easily switch from one voice (and language) to another. Too bad its Window always sucks the focus when it reads a new phrase.
For the *nix-inclined people, I should also mention the famous Festival suite of TTS components (Festival, FLite, Festvox). For the java-inclined people, don’t miss the FreeTTS engine (that is based on Festival Lite !) and the associated end-user applications. An example of an end-user application based on Festival is the CMU Communicator, see its sample conversation as a demo.
Last but not least, do not miss Euler and the underlying MBROLA package. Euler is a simple open source reading machine based on MBROLA that implements a huge number of voices in many many languages plus these voices can include very natural intonations and vocal stresses. Euler + MBROLA were produces by an academic research program. They are free for non-commercial use and their source code is available (BTW, it is said that MBROLA could not be distributed under an open source license because of a France Telecom software patent !). Beware : the installation of MBROLA may be quite tricky. First, download the MBROLATools Windows binaries package, download patch #1 and read the instructions included, (I had problems when trying patch #2 so I did not use it), download as many MBROLA voices as you want (wow ! that many languages supported !), then download Win Euler (or any other MBROLA compatible TTS engine from third parties ; note that MBROLA is supported by festival).

Further ranting about TTS engines : I feel like the ecosystem of speech engines is really not mature enough. Sure several vendors provide speech engines. But they are not uniformly supported by the O.S.. There was a Microsoft S.A.P.I. version 4 (SDK available here) which is now version 5.1 but people even mention v.6 (included in Office 2003 U.S. ?) and a v.7 to be included in LongHorn (note that there also is another TTS API : the Java Speech API 1.0 – JSAPI– as implemented by FreeTTS… bridgeable with MS SAPI ?). But as any Microsoft standard, these API are … not that standardized (i.e. they seem to be Microsoft-specific). Even worst : they seem rather unstable since the installation of various speech engines give strange results : some software detects most of the installed TTS engines, other only detect SOME of the SAPI v.4 TTS engines, some other display a mix of some of your SAPI4 TTS engines and some of your SAPI5 TTS engines…. In order to be able to use SAPI5 engines and control panel I had to install Microsoft Reader and to TTS-enable it (additional download). What a mess ! The result is that you cannot easily control which voices you will be using on your computer (which one will be supported or not ?). As a further example, I could not install and use the free CMU Kal Diphone voice and I still don’t know why. Is it the API fault ? the engine’s fault ? Don’t know… Last remark on this point : Festival seems to be the main open source stream in the field of TTS technologies but it does not seem to be fully mature ; and the end-user applications based on it seem to be quite rare. Let’s wait some more years before it becomes a mainstream, user-friendly and free technology.

More precisely, the TTS puzzle seems to be made with the following parts :

a TTS engine made with three parts :
- a text processing system that takes a text as input and produces phonetic and prosodic (duration of phonemes and a piecewise linear description of pitch) commands
- a speech synthesizer that transforms phonemes plus a prosody (think “speech melody”) into a speech
- a “voice” that is a reference database that allows the speech to be synthesized according to the tone, characteristics and accent of a given voice
an A.P.I. that hosts this engine and publishes its features toward end-user applications and may provide some general features such as a control panel
an end-user application (a reading machine, a file monitor with audio alerts, a audio news aggregator, …) that invokes the dedicated speech API

You can get more detailed information from the MRBOLA project site.

These were my notes and ranting about text-to-speech technologies. Please drop me a comment if you feel like my explanations were wrong or biased as I don’t know this field in details and I may have made a lot of errors here. Thanks !

RSS : The Next Big Thing Online

Voici un papier blanc (si, si…) au sujet de RSS (à nouveau via l’excellent Outils Froids). Enfin un document qui présente l’écosystème RSS en des termes marketing compréhensible par une D.S.I. de grande entreprise… enfin j’espère. Je testerai ce document sur mes collègues et supérieurs à mon retour de congés.

Mathemagenic: learning and KM insights – Thursday, June 10, 2004

Voici une explication illustrée des usages de ces outils qu’on appelle les blogs ou carnets Web (via Outils Froids du Web). En dehors du fond très juste de ces illustrations, je trouve que leur forme permet de manière élégante d’appréhender des usages technologiques. L’outil, c’est bien. Mais ce sont les usages que chacun batît autour qui en font une technologie.
(Oh ! ben tiens, j’ai réussi à publier un message sur mon carnet entre deux biberons ! auto-félicitation !) :)

Bytes for good

Innover, Servir, Entreprendre !