39 Commits

Author SHA1 Message Date
ec2b84df2a Add requirements
To make setup of environment for this module easier
2017-09-23 21:09:58 +02:00
88848cb084 Prepare Version test-v4 for release
Add a README.md file for this project
2017-09-23 20:32:13 +02:00
5057aed0d3 Merge branch 'fs#157-lowercase-title' into develop 2017-09-09 21:47:03 +02:00
02e53475f1 Prevent lowercase article titles in Parser
Since real lowercase article titles are not allowed, make sure to
convert all first letters of article titles to uppercase. This is
neccessary since pywikibot will return article titles like this.

Related Task: [FS#157](https://fs.golderweb.de/index.php?do=details&task_id=157)
2017-09-09 21:35:36 +02:00
d6f9b460c9 Merge branch 'fs#156-dbapi-charset' into develop 2017-09-02 22:13:20 +02:00
ff03ca8f13 Explicitly set charset for PyMySQL-Connection
Since PyMySQL-Connection otherwise uses charset 'latin-1', explicitly
set connection charset to 'utf8'

http://docs.sqlalchemy.org/en/rel_1_0/dialects/mysql.html#charset-selection
http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html?highlight=url#sqlalchemy.engine.url.URL

Related Task: [FS#156](https://fs.golderweb.de/index.php?do=details&task_id=156)
2017-09-02 22:10:25 +02:00
88692ca678 Merge branch 'fs#155-article-surouding-space' into develop 2017-09-02 22:08:31 +02:00
d9b4fcc0bd Strip spaces before adding articles to redfam
Some article links have surounding spaces in their linktext. Remove them
before adding article to RedFam to have a cannonical title

Related Task: [FS#155](https://fs.golderweb.de/index.php?do=details&task_id=155)
2017-09-02 22:06:30 +02:00
22ff78ea98 Merge branch 'fs#154-categorie-colons-missing' into develop 2017-09-02 16:02:45 +02:00
b3cfcdc259 Improve title detection to get correct behaviour
Make sure that categorie links are starting with colon and non article
pages are returned with namespace.

Related Task: [FS#154](https://fs.golderweb.de/index.php?do=details&task_id=154)
2017-09-02 15:59:34 +02:00
b3e0ace2f4 Merge branch 'fs#153-nested-templates' into develop 2017-09-02 14:25:21 +02:00
f8002c85da Do not search for templates recursivly
Since nested templates did not get an index in global wikicode object
searching for index of an nested template results in ValueError

Related Task: [FS#153](https://fs.golderweb.de/index.php?do=details&task_id=153)
2017-09-02 14:23:25 +02:00
49bc05d29b Merge branch 'fs#151-normalize-article-titles-anchor' into develop 2017-09-02 13:36:17 +02:00
8a26b6d92a Normalize article titles with anchors
In our db article titles with anchors are stored with underscores in
anchor string. Therefore we need to replace spaces in anchor string
given by pywikibot.Page.title().

Related Task: [FS#151](https://fs.golderweb.de/index.php?do=details&task_id=151)
2017-08-25 18:11:41 +02:00
49a8230d76 Merge branch 'fs#141-place-notice-after-comment' into develop 2017-08-25 17:11:28 +02:00
31c10073a2 Prevent index errors searching for comments
Make sure not to exceed existing indexes of wikicode object while trying
to search for comments

Related Task: [FS#141](https://fs.golderweb.de/index.php?do=details&task_id=141)
2017-08-25 17:09:38 +02:00
642a29b022 Improve regex for blank lines
Do not match consecutive linebreaks as one

Related Task: [FS#141](https://fs.golderweb.de/index.php?do=details&task_id=141)
2017-08-24 18:47:18 +02:00
2f90751dc2 Merge branch 'fs#146-famhash-generator' into develop 2017-08-24 12:27:54 +02:00
024be69fe1 Use famhash as generator
If famhash is defined, fetch explicitly that redfam from db and work
only on this

Related Task: [FS#146](https://fs.golderweb.de/index.php?do=details&task_id=146)
2017-08-24 12:27:13 +02:00
b6d7268a7f select by famhash: Add methods to get param in bot
We need a method as callback to get bot specific params passed through
to our bot class.
Introduce -famhash parameter to work on specific famhash

Related Task:[FS#146](https://fs.golderweb.de/index.php?do=details&task_id=146)
2017-08-24 12:27:13 +02:00
526184c1e1 Merge branch 'fs#148-articles-mixed-up' into develop 2017-08-24 12:26:53 +02:00
3aa6c5fb1c Disable PreloadingGenerator temporarily
PreloadingGenerator mixes up yielded Pages. This is very unconvenient
for semi-automatic workflow with manual checks as the articles of the
RedFams were not following each other.

Related Task: [FS#148](https://fs.golderweb.de/index.php?do=details&task_id=148)
2017-08-24 12:23:17 +02:00
ec8f459db5 Merge branch 'fs#138-marked-articles-shown-again' into develop 2017-08-24 12:19:24 +02:00
3b2cb95f36 Do not fetch marked redfams from db
Exclude marked Redfams from DB-Query to prevent marking them again

Related Task: [FS#138](https://fs.golderweb.de/index.php?do=details&task_id=138)
2017-08-24 12:09:43 +02:00
41e5cc1a9d Merge branch 'fs#141-place-notice-after-comment' into develop 2017-08-24 12:06:03 +02:00
9b9d50c4d2 Improve detection of empty lines
Search with RegEx as empty lines could also contain spaces

Related Task: [FS#141](https://fs.golderweb.de/index.php?do=details&task_id=141)
2017-08-24 12:04:45 +02:00
a755288700 Merge branch 'fs#147-templates-in-heading' into develop 2017-08-23 14:55:43 +02:00
14ec71dd09 Rewrite get_disc_link to handle special cases
Use methods of pywikibot site-object and mwparser to get rid of any
special elements like templates or links in headings for construction
of our disc link.
Replace   by hand as it otherwise will occur as normal space and
wont work

Related Task: [FS#147](https://fs.golderweb.de/index.php?do=details&task_id=147)
2017-08-23 14:53:22 +02:00
e283eb78ac Merge branch 'fs#140-also-mark-redirects' into develop 2017-08-22 21:59:22 +02:00
cc02006fd2 Do not exclude redirects from beeing marked
In accordance with Zulu55 redirect discussion pages should also get
a notice, therefore do not exclude redirects.

Related Task: [FS#140](https://fs.golderweb.de/index.php?do=details&task_id=140)
2017-08-22 21:59:07 +02:00
37b0cbef08 Merge branch 'fs#138-marked-articles-shown-again' into develop 2017-08-22 21:58:22 +02:00
4137d72468 Look for existing notice by simple in-check
To detect maybe uncommented notices already present, check for them
using just a simple python x in y check over whole wikicode

Related Task: [FS#138](https://fs.golderweb.de/index.php?do=details&task_id=138)
2017-08-22 21:56:43 +02:00
cd87d1c2bb Fix already marked articles was reshown bug
Since we search for matching states for articles to include or exclude
in a loop, we could not control the outer loop via default break/
continue. Python docs recommend using Exceptions and try/except
structures to realise that most conveniently.

https://docs.python.org/3/faq/design.html#why-is-there-no-goto

Related Task: [FS#138](https://fs.golderweb.de/index.php?do=details&task_id=138)
2017-08-22 21:45:58 +02:00
456b2ba3d4 Merge branch 'fs#141-place-notice-after-comment' into develop 2017-08-21 22:11:51 +02:00
47b85a0b5e Add missing line break if there is no template
To make sure our notice template resides in its own line in every case

Related Task: [FS#141](https://fs.golderweb.de/index.php?do=details&task_id=141)
2017-08-21 22:09:59 +02:00
a6fdc974bd Merge branch 'fs#144-PyMySQL-instead-oursql' into develop 2017-08-21 13:58:34 +02:00
30de2a2e12 Replace oursql with PyMySQL
Since this is prefered on toolsforge and works out of the box after
installing via pip, replace oursql which caused some problems.
Especially oursql was not able to connect to db via ssh tunnel.

Related Task: [FS#144](https://fs.golderweb.de/index.php?do=details&task_id=144)
2017-08-21 13:55:33 +02:00
4a6855cf7b Merge branch 'fs#141-place-notice-after-comment' into develop 2017-08-21 13:51:32 +02:00
8422d08cb6 Keep comments and leading templates together
Prevent spliting up existing comments and templates as often those are
documenting archiv templates behaviour

Related Task: [FS#141](https://fs.golderweb.de/index.php?do=details&task_id=141)
2017-08-21 13:49:34 +02:00
6 changed files with 248 additions and 69 deletions

51
README.md Normal file
View File

@@ -0,0 +1,51 @@
jogobot-red
===========
Dependencies
------------
* pywikibot-core
* mwparserfromhell
The libraries above need to be installed and configured manualy considering [documentation of pywikibot-core](https://www.mediawiki.org/wiki/Manual:Pywikibot).
* SQLAlchemy
* PyMySQL
Those can be installed using pip and the _requirements.txt_ file provided with this packet
pip install -r requirements.txt
Versions
--------
* test-v4
- Feature _markpages_ working in semi-automatic mode using command
python red.py -task:markpages -family:wikipedia
- Work on specific redfam using param
-famhash:[sha1-famhash]
- Use _PyMySQL_ instead of _OurSQL_
- Correctly parse redfams with articles with leading small character or spaces in wikilink
* test-v3
* test-v2
* test-v1
License
-------
GPLv3
Author Information
------------------
Copyright 2017 Jonathan Golder jonathan@golderweb.de https://golderweb.de/
alias Wikipedia.org-User _Jogo.obb_ (https://de.wikipedia.org/Benutzer:Jogo.obb)

View File

@@ -26,6 +26,7 @@ Bot to mark pages which were/are subjects of redundance discussions
with templates
"""
import re
from datetime import datetime
import pywikibot
@@ -61,6 +62,9 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
# Init attribute
self.__redfams = None # Will hold a generator with our redfams
if "famhash" in kwargs:
self.famhash = kwargs["famhash"]
# We do not use predefined genFactory as there is no sensefull case to
# give a generator via cmd-line for this right now
self.genFactory = pagegenerators.GeneratorFactory()
@@ -101,8 +105,15 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
end_after = datetime.strptime(
jogobot.config["red.markpages"]["mark_done_after"],
"%Y-%m-%d" )
self.__redfams = list( RedFamWorker.gen_by_status_and_ending(
"archived", end_after) )
if hasattr(self, "famhash"):
self.__redfams = list(
RedFamWorker.session.query(RedFamWorker).filter(
RedFamWorker.famhash == self.famhash ) )
else:
self.__redfams = list( RedFamWorker.gen_by_status_and_ending(
"archived", end_after) )
return self.__redfams
@@ -114,8 +125,12 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
self.genFactory.gens.append( self.redfam_talkpages_generator() )
# Set generator to pass to super class
self.gen = pagegenerators.PreloadingGenerator(
self.genFactory.getCombinedGenerator() )
# Since PreloadingGenerator mixis up the Pages, do not use it right now
# (FS#148)
# We can do so for automatic runs (FS#150)
# self.gen = pagegenerators.PreloadingGenerator(
# self.genFactory.getCombinedGenerator() )
self.gen = self.genFactory.getCombinedGenerator()
def redfam_talkpages_generator( self ):
"""
@@ -131,7 +146,6 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
for talkpage in pagegenerators.PageWithTalkPageGenerator(
redfam.article_generator(
filter_existing=True,
filter_redirects=True,
exclude_article_status=["marked"] ),
return_talk_only=True ):
@@ -172,25 +186,34 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
# None if change was not accepted by user
save_ret = self.put_current( self.new_text, summary=summary )
# Normalize title with anchor (replace spaces in anchor)
article = self.current_page.toggleTalkPage().title(
asLink=True, textlink=True)
article = article.strip("[]")
article_parts = article.split("#", 1)
if len(article_parts) == 2:
article_parts[1] = article_parts[1].replace(" ", "_")
article = "#".join(article_parts)
# Status
if add_ret is None or ( add_ret and save_ret ):
self.current_page.redfam.article_remove_status(
"note_rej",
title=self.current_page.title(withNamespace=False))
title=article)
self.current_page.redfam.article_remove_status(
"sav_err",
title=self.current_page.title(withNamespace=False))
title=article)
self.current_page.redfam.article_add_status(
"marked",
title=self.current_page.title(withNamespace=False))
title=article)
elif save_ret is None:
self.current_page.redfam.article_add_status(
"note_rej",
title=self.current_page.title(withNamespace=False))
title=article)
else:
self.current_page.redfam.article_add_status(
"sav_err",
title=self.current_page.title(withNamespace=False))
title=article)
def add_disc_notice_template( self ):
"""
@@ -214,12 +237,37 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
# There is none on empty pages, so we need to check
if leadsec:
# Get the last template in leadsec
ltemplates = leadsec.filter_templates()
ltemplates = leadsec.filter_templates(recursive=False)
# If there is one, add notice after this
if ltemplates:
self.current_wikicode.insert_after( ltemplates[-1],
self.disc_notice )
# Make sure not separate template and maybe following comment
insert_after_index = self.current_wikicode.index(
ltemplates[-1] )
# If there is more content
if len(self.current_wikicode.nodes) > (insert_after_index + 1):
# Filter one linebreak
if isinstance( self.current_wikicode.get(
insert_after_index + 1),
mwparser.nodes.text.Text) and \
re.search( r"^\n[^\n\S]+$", self.current_wikicode.get(
insert_after_index + 1 ).value ):
insert_after_index += 1
while len(self.current_wikicode.nodes) > \
(insert_after_index + 1) and \
isinstance(
self.current_wikicode.get(insert_after_index + 1),
mwparser.nodes.comment.Comment ):
insert_after_index += 1
self.current_wikicode.insert_after(
self.current_wikicode.get(insert_after_index),
self.disc_notice )
# To have it in its own line we need to add a linbreak before
self.current_wikicode.insert_before(self.disc_notice, "\n" )
@@ -228,13 +276,16 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
else:
self.current_wikicode.insert( 0, self.disc_notice )
# To have it in its own line we need to add a linbreak after it
self.current_wikicode.insert_after(self.disc_notice, "\n" )
# If there is no leadsec (and therefore no template in it, we will add
# before the first element
else:
self.current_wikicode.insert( 0, self.disc_notice )
# To have it in its own line we need to add a linbreak after it
self.current_wikicode.insert_after(self.disc_notice, "\n" )
# To have it in its own line we need to add a linbreak after it
self.current_wikicode.insert_after(self.disc_notice, "\n" )
# Notice was added
return True
@@ -243,6 +294,10 @@ class MarkPagesBot( CurrentPageBot ): # sets 'current_page' on each treat()
"""
Checks if disc notice which shall be added is already present.
"""
if self.disc_notice in self.current_wikicode:
return True
# Iterate over Templates with same name (if any) to search equal
# Link to decide if they are the same
for present_notice in self.current_wikicode.ifilter_templates(

View File

@@ -46,12 +46,14 @@ import sqlalchemy.types as types
Base = declarative_base()
url = URL( "mysql+oursql",
url = URL( "mysql+pymysql",
username=config.db_username,
password=config.db_password,
host=config.db_hostname,
port=config.db_port,
database=config.db_username + jogobot.config['db_suffix'] )
database=config.db_username + jogobot.config['db_suffix'],
query={'charset': 'utf8'} )
engine = create_engine(url, echo=True)

View File

@@ -282,7 +282,14 @@ class RedFamParser( RedFam ):
articlesList = []
for link in heading.ifilter_wikilinks():
article = str( link.title )
article = str( link.title ).strip()
# Short circuit empty links
if not article:
continue
# Make sure first letter is uppercase
article = article[0].upper() + article[1:]
# Split in title and anchor part
article = article.split("#", 1)
@@ -515,46 +522,67 @@ class RedFamWorker( RedFam ):
@type filter_redirects bool/None
"""
# Helper to leave multidimensional loop
# https://docs.python.org/3/faq/design.html#why-is-there-no-goto
class Continue(Exception):
pass
class Break(Exception):
pass
# Iterate over articles in redfam
for article in self.articlesList:
# Not all list elements contain articles
if not article:
# To be able to control outer loop from inside child loops
try:
# Not all list elements contain articles
if not article:
raise Break()
page = pywikibot.Page( pywikibot.Link(article),
pywikibot.Site() )
# Filter existing pages if requested with filter_existing=False
if page.exists():
self.article_remove_status( "deleted", title=article )
if filter_existing is False:
raise Continue()
# Filter non existing Pages if requested with
# filter_existing=True
else:
self.article_add_status( "deleted", title=article )
if filter_existing:
raise Continue()
# Filter redirects if requested with filter_redirects=True
if page.isRedirectPage():
self.article_add_status( "redirect", title=article )
if filter_redirects:
raise Continue()
# Filter noredirects if requested with filter_redirects=False
else:
self.article_remove_status("redirect", title=article )
if filter_redirects is False:
raise Continue()
# Exclude by article status
for status in exclude_article_status:
if self.article_has_status( status, title=article ):
raise Continue()
# Only include by article status
for status in onlyinclude_article_status:
if not self.article_has_status( status, title=article ):
raise Continue()
# Proxy loop control to outer loop
except Continue:
continue
except Break:
break
page = pywikibot.Page(pywikibot.Link(article), pywikibot.Site())
# Filter existing pages if requested with filter_existing=False
if page.exists():
self.article_remove_status( "deleted", title=article )
if filter_existing is False:
continue
# Filter non existing Pages if requested with filter_existing=True
else:
self.article_add_status( "deleted", title=article )
if filter_existing:
continue
# Filter redirects if requested with filter_redirects=True
if page.isRedirectPage():
self.article_add_status( "redirect", title=article )
if filter_redirects:
continue
# Filter noredirects if requested with filter_redirects=False
else:
self.article_remove_status("redirect", title=article )
if filter_redirects is False:
continue
# Exclude by article status
for status in exclude_article_status:
if self.article_has_status( status, title=article ):
continue
# Only include by article status
for status in onlyinclude_article_status:
if not self.article_has_status( status, title=article ):
continue
# Yield filtered pages
yield page
@@ -590,22 +618,22 @@ class RedFamWorker( RedFam ):
@rtype str
"""
# We need to Replace Links with their linktext
anchor_code = mwparser.parse( self.heading.strip() )
for link in anchor_code.ifilter_wikilinks():
if link.text:
text = link.text
else:
text = link.title
# Expand templates using pwb site object
site = pywikibot.Site()
anchor_code = site.expand_text(self.heading.strip())
anchor_code.replace( link, text )
# Remove possibly embbeded files
anchor_code = re.sub( r"\[\[\w+:[^\|]+(?:\|.+){2,}\]\]", "",
anchor_code )
# Whitespace is replaced with underscores
anchor_code.replace( " ", "_" )
# Replace non-breaking-space by correct urlencoded value
anchor_code = anchor_code.replace( " ", ".C2.A0" )
# We try it with out any more parsing as mw will do while parsing page
return ( self.redpage.pagetitle + "#" +
str(anchor_code).strip() )
# Use mwparser to strip and normalize
anchor_code = mwparser.parse( anchor_code ).strip_code()
# We try it without any more parsing as mw will do while parsing page
return ( self.redpage.pagetitle + "#" + anchor_code.strip() )
def generate_disc_notice_template( self ):
"""
@@ -678,6 +706,7 @@ class RedFamWorker( RedFam ):
# RedFamWorker._status.like('archived'),
# RedFamWorker._status.like("%{0:s}%".format(status)),
text("status LIKE '%archived%'"),
text("status NOT LIKE '%marked%'"),
RedFamWorker.ending >= ending ):
yield redfam

23
red.py
View File

@@ -60,7 +60,7 @@ def prepare_bot( task_slug, subtask, genFactory, subtask_args ):
@rtype tuple
"""
# kwargs are passed to selected bot as **kwargs
kwargs = dict()
kwargs = subtask_args
if not subtask or subtask == "discparser":
# Default case: discparser
@@ -83,6 +83,25 @@ def prepare_bot( task_slug, subtask, genFactory, subtask_args ):
return ( subtask, Bot, genFactory, kwargs )
def parse_red_args( argkey, value ):
"""
Process additional args for red.py
@param argkey The arguments key
@type argkey str
@param value The arguments value
@type value str
@return Tuple with (key, value) if given pair is relevant, else None
@rtype tuple or None
"""
if argkey.startswith("-famhash"):
return ( "famhash", value )
return None
def main(*args):
"""
Process command line arguments and invoke bot.
@@ -110,7 +129,7 @@ def main(*args):
# Parse local Args to get information about subtask
( subtask, genFactory, subtask_args ) = jogobot.bot.parse_local_args(
local_args )
local_args, parse_red_args )
# select subtask and prepare args
( subtask, Bot, genFactory, kwargs ) = prepare_bot(

23
requirements.txt Normal file
View File

@@ -0,0 +1,23 @@
# This is a PIP 6+ requirements file for using jogobot-red
#
# All dependencies can be installed using:
# $ sudo pip install -r requirements.txt
#
# It is good practise to install packages using the system
# package manager if it has a packaged version. If you are
# unsure, please use pip as described at the top of the file.
#
# To get a list of potential matches, use
#
# $ awk -F '[#>=]' '{print $1}' requirements.txt | xargs yum search
# or
# $ awk -F '[#>=]' '{print $1}' requirements.txt | xargs apt-cache search
# Needed for Database-Connection
# SQLAlchemy Python ORM-Framework
SQLAlchemy>=1.1
# PyMySQL DB-Connector
PyMySQL>=0.7
# Also needed, but not covered here, is a working copy of pywikibot-core
# which also brings mwparserfromhell