Here is my kind of a HOW-TO documentation if you want to setup a portal
aggregating news from remote web sites (by the way of RDF/RSS
syndication or by the way of search engines « screen scraping »). Want
your own news aggregator and portal, hey ?
I will use the following products : Plone + ZEO + CMFNewsFeed +
CMFWebAgent. Personnally, I installed them on a Windows 2000 platform.
And I have to say that this installation process is rather long and
tricky… I would welcome a Plone distribution that would include the
right version of ZEO and the configuration of Plone as a ZEO client. It
should also include the XML library needed for CMFNewsFeed.
- Download everything you will need
- Download Plone 1.0.1 from http://prdownloads.sourceforge.net/plone/Plone-1.0.1.exe?download
- Download ZEO in its latest CVS version from within the ZODB
3.3.1 CVS at
http://cvs.zope.org/ZEO/ZEO/ZEO.tar.gz?tarball=1&only_with_tag=ZODB3-3_1-branch (because ZEO 2.0 cannot run with Zope 2.6.x which is included
in Plone 1.0.1) - Download CMFNewsFeed 1.1 and CMFWebAgent 1.0 from
http://sourceforge.net/projects/collective - Download the PyXML library, version 0.8.1 as a tar.gz file at
http://sourceforge.net/project/showfiles.php?group_id=6473 (Note that the 0.8.1 exe versions are specific either to Python 2.1 or to Python 2.2 - Download my plone_conf.zip file which includes some config files
I gathered mainly from CMFNewsFeed distributions I suppose.
- Install, unzip and move everything to its right place
- Install Plone to C:\Plone (do not ask Plone to start
automatically and do not start it manually either) - Add C:\Plone\Python to your environment variable PATH if Plone
installer did not do it - Unzip ZEO to C:\Plone\ZEO
- Unzip CMFNewsFeed to C:\Plone\CMFNewsFeed-r1_1 which you then
rename to C:\Plone\CMFNewsFeed for more ease - Unzip CMFWebAgent to C:\Plone\CMFWebAgent-r0_1 which you then
rename to C:\Plone\CMFWebAgent for more ease - Read C:\Plone\ZEO\docs\ZopeREADME.txt
- Move C:\Plone\ZEO\ZEO to C:\Plone\Zope\lib\python\ZEO
- Unzip PyXML-0.8.1.tar.gz into C:\Plone\PyXML-0.8.1
- Read C:\Plone\PyXML-0.8.1\README
- Within a commandline, go to C:\Plone\PyXML-0.8.1 and do a
« python setup.py build ». You will run into some erros, but that’s not
that important for our purpose - Move all the files and directories included in
C:\Plone\PyXML-0.8.1\build\lib.win32-2.1\_xmlplus to
C:\Plone\Python\lib\xml, replacing every existing file (I know it must
be a very dirty way to install this but I don’t know an easy way to do
it better since I did not want to install a standalone python
distribution outside Plone) - Unzip plone_conf.zip file to C:\
- Install Plone to C:\Plone (do not ask Plone to start
- Startup ZEO and Plone
- Execute C:\Plone\1.start_zeo.bat
- Wait a few seconds (or more…) and check
C:\Plone\Data\var\ZEO_Server.log to see if ZEO properly started (you
should see several lines explaining that ZEO created a StorageServer,
and so on) - Set Plone’s emergency user with the Windows « Plone
controller » - Execute C:\Plone\2.start_plone.bat
- Wait a few seconds (or more…) and check
C:\Plone\Data\var\debug.log to see if Zope (Plone) properly
started - Bring your browser to http://localhost then to
http://localhost:8080/manage to see if Plone works properly and you can
log into Plone management interface as your emergency user. It should
work (well, it works for me…).
- Setup and start CMFNewsFeed as a ZEO client
- Go to http://localhost and register as a new user called
« newsfeed » : this will be the username CMFNewsfeed uses for retrieving
content from the Net and posting it into Plone. - Log into http://localhost:8080/manage with your emergency user
and give « newsfeed » the « Reviewer » role (go into Plone/acl_users, click
on newsfeed and give it the Reviewer role). I suppose newsfeed should
still keep its « Member » role. - Open C:\Plone\Data\getnews.conf and set the member_name variable
as ‘newsfeed’ (the default value is ‘rssfeeder’) - Set a new RSS source as follow : go to http://localhost and
login as ‘newsfeed’, click on the « my folder » link (in the navigation
bar), ; then create a new folder : you select Folder in the list box
and click on the « add a new element » button and fill in the form
(« my_slashdot_source » as id/name and « My Slashdot RSS source » as
title), validate. Then create a link into this new folder. It should be
named ‘RDF’ (mandatory), it could be titled ‘the Slashdot RSS link’ and
its URL points to the RSS file you want to be retrieved
(http://slashdot.org/slashdot.rdf). - Hack CMFNewsFeed to adapt it to Plone : open
C:\Plone\CMFNewsFeed\CMFFeedApp.py and replace ‘Portal Folder’ with
‘Plone Folder’. Still in CMFFeedApp.py, find the line containing
« _edit » and, just below it, comment out the « description=description, »
line then add a « new_link.description = description » line below the
« new_link.title = title » - Open a commandline and get to C:\Plone\Data then execute this :
« python C:\Plone\CMFNewsFeed\getnews.py » (or just run the 3rd .bat file I prepared in my plone_conf.zip file - You should find your new news items under the « my_slashdot_source » folder. If they don’t display (but the getnews.py command line affirmed they were retrieved), it may be a ZEO cache issue. Quick and bad fix for this is restarting your plone. But, of course, you may have to fix your zope.conf file in order to avoid this kind of issue. For the moment, I don’t know how to fix that. I’ll try later.
- Schedule a ‘cmd.exe « C:\Plone\Python\python C:\Plone\CMFNewsFeed\getnews.py »‘ to run once a day (never run it more frequently than 30 minutes or you may be banned by the news sources) so that your news are fresh everyday. You may use a Windows version of cron to do this.
- Go to http://localhost and register as a new user called
- Setup and start CMFWebAgent as a ZEO client
- OK. You are a big boy/girl now. So try and follow similar steps to make CMFWebAgent run. You may have to fix some CMFWebAgent search engines scripts since their web interface may have changed since CMFWebAgent (and this doc) were released. Dirty hacks on sight…
- Last but not least : please drop a comment here to tell me if this works for you, how hard it was to setup and so on… Or maybe you know of a better way to make these damned CMFstuffagents work !
Ping : AkaSig » Web scraping with python (part 1 : crawling)