Using PageRank to ascertain quality (Foundation help needed!)

Discussion:

Brian

2007-11-08 22:58:42 UTC

Several collaborators and I are preparing to expand on previous work to
automatically ascertain the quality of Wikipedia articles on the English
Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's hallmark
quality metric, and the foundation actually has access to these numbers
through the Google Webmaster Tools website. If a foundation representative
were to create a Google account and verify that they were a "webmaster,"
they could download the PageRank for every article on the English Wikipedia
in a convenient tabular format. This data would likely serve as a fantastic
predictor. I would also like to compare the Google-computed PageRank to the
PageRank computed via Wikipedia's internal link structure. I don't see any
privacy implications in releasing this data. It also doesn't seem to help
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links. Nonetheless, I
would of course be willing to keep the data private.

This would only take a few minutes if it were approved. Is anyone out there
who has the power to make it happen?

Cheers :)
Brian

[0]
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

P. Birken

2007-11-09 00:27:54 UTC

Permalink

Erik should be able to help you. I read your paper and your
conclusions and you might think about rewriting them. In particular,
correctness is not and cannot be evaluated by your method and
therefore, cannot point readers to articles that are most likely
correct, simply to articles that are wellwritten. Your measure of
accuracy of your method is also a bit dubious, since the tags are not
uniform (take two featured articles of different age and they will be
of very different quality) and recovering them to 100% is therefore
not a reasonable goal. However, I believe that you method is
reasonable to find articles that are badly written.

Bye,

Philipp

Post by Brian
Several collaborators and I are preparing to expand on previous work to
automatically ascertain the quality of Wikipedia articles on the English
Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's hallmark
quality metric, and the foundation actually has access to these numbers
through the Google Webmaster Tools website. If a foundation representative
were to create a Google account and verify that they were a "webmaster,"
they could download the PageRank for every article on the English Wikipedia
in a convenient tabular format. This data would likely serve as a fantastic
predictor. I would also like to compare the Google-computed PageRank to the
PageRank computed via Wikipedia's internal link structure. I don't see any
privacy implications in releasing this data. It also doesn't seem to help
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links. Nonetheless, I
would of course be willing to keep the data private.
This would only take a few minutes if it were approved. Is anyone out there
who has the power to make it happen?
Cheers :)
Brian
[0]
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf
_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

Brian

2007-11-09 01:16:33 UTC

Permalink

Thanks for reading it. Articles that have high quality by the measures used
in that paper tend to score high among all dimensions of quality. These
dimensions are correlated not only with articles that are well written, but
articles that are correct. I'm not sure there is a good argument against the
point that Featured articles will tend to be more correct than A articles,
which will tend to be more correct than Good articles, etc... That comment
was made as more of an aside in the conclusion. These machine learning
algorithms aren't doing much more than searching for correlations. Thus, it
is no better at finding poorly written articles than correct articles than
any other thing you can imagine. It does not discover causation. It does
"reverse engineer" the human ratings, in the sense that it finds features
that correlate with them. Correctness likely correlates with quality, and
the number of references likely correlates with correctness, which is a
feature we included.

The distribution of the tags is skewed towards Start articles. If you train
a classifier on an un-normalized dataset, it will do the intelligent thing:
classify all articles as Start. Click the "Random page" link a couple of
dozen times and you can see that this is indeed a good way to get roughly
70% of the classifications correct. However, we removed the skew from our
dataset by using equal numbers of all classes, based on the number of A
articles, as the fewest number of these are in the encyclopedia. Thus, we
trained on 650 of each class of articles, and from this extremely limited
dataset, achieve decent performance.

Of course, this was only a class project, intended to be a proof of concept.
It is well known that Support Vector Machine classification consistently
outperforms other methods in the domain of text classification, and if we
were only interested in high numbers, we could have boosted them that way.

Cheers,
Brian

Post by P. Birken
Erik should be able to help you. I read your paper and your
conclusions and you might think about rewriting them. In particular,
correctness is not and cannot be evaluated by your method and
therefore, cannot point readers to articles that are most likely
correct, simply to articles that are wellwritten. Your measure of
accuracy of your method is also a bit dubious, since the tags are not
uniform (take two featured articles of different age and they will be
of very different quality) and recovering them to 100% is therefore
not a reasonable goal. However, I believe that you method is
reasonable to find articles that are badly written.
Bye,
Philipp

hallmark

Post by Brian
quality metric, and the foundation actually has access to these numbers
through the Google Webmaster Tools website. If a foundation

representative

Post by Brian
were to create a Google account and verify that they were a "webmaster,"
they could download the PageRank for every article on the English

Wikipedia

Post by Brian
in a convenient tabular format. This data would likely serve as a

fantastic

Post by Brian
predictor. I would also like to compare the Google-computed PageRank to

the

Post by Brian
PageRank computed via Wikipedia's internal link structure. I don't see

any

Post by Brian
privacy implications in releasing this data. It also doesn't seem to

help

Post by Brian
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links. Nonetheless,

Post by Brian
would of course be willing to keep the data private.
This would only take a few minutes if it were approved. Is anyone out

there

Post by Brian
who has the power to make it happen?
Cheers :)
Brian
[0]

http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

Post by Brian
_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

P. Birken

2007-11-09 04:25:33 UTC

Permalink

Well, of course there is a correlation between quality of style and
correctness. But it is not that strong, that you could say that an
article that is well-written is probably correct. You can only deduce
that an article that is badly written is probably incorrect. This is
an important difference that cannot be stressed often enough.

Best wishes,

Philipp

Post by Brian
Thanks for reading it. Articles that have high quality by the measures used
in that paper tend to score high among all dimensions of quality. These
dimensions are correlated not only with articles that are well written, but
articles that are correct. I'm not sure there is a good argument against the
point that Featured articles will tend to be more correct than A articles,
which will tend to be more correct than Good articles, etc... That comment
was made as more of an aside in the conclusion. These machine learning
algorithms aren't doing much more than searching for correlations. Thus, it
is no better at finding poorly written articles than correct articles than
any other thing you can imagine. It does not discover causation. It does
"reverse engineer" the human ratings, in the sense that it finds features
that correlate with them. Correctness likely correlates with quality, and
the number of references likely correlates with correctness, which is a
feature we included.
The distribution of the tags is skewed towards Start articles. If you train
classify all articles as Start. Click the "Random page" link a couple of
dozen times and you can see that this is indeed a good way to get roughly
70% of the classifications correct. However, we removed the skew from our
dataset by using equal numbers of all classes, based on the number of A
articles, as the fewest number of these are in the encyclopedia. Thus, we
trained on 650 of each class of articles, and from this extremely limited
dataset, achieve decent performance.
Of course, this was only a class project, intended to be a proof of concept.
It is well known that Support Vector Machine classification consistently
outperforms other methods in the domain of text classification, and if we
were only interested in high numbers, we could have boosted them that way.
Cheers,
Brian

hallmark

Post by P. Birken

Post by Brian
quality metric, and the foundation actually has access to these numbers
through the Google Webmaster Tools website. If a foundation

representative

Post by P. Birken

Post by Brian
were to create a Google account and verify that they were a "webmaster,"
they could download the PageRank for every article on the English

Wikipedia

Post by P. Birken

Post by Brian
in a convenient tabular format. This data would likely serve as a

fantastic

Post by P. Birken

Post by Brian
predictor. I would also like to compare the Google-computed PageRank to

the

Post by P. Birken

Post by Brian
PageRank computed via Wikipedia's internal link structure. I don't see

any

Post by P. Birken

Post by Brian
privacy implications in releasing this data. It also doesn't seem to

help

Post by P. Birken

Post by Brian
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links. Nonetheless,

Post by P. Birken

Post by Brian
would of course be willing to keep the data private.
This would only take a few minutes if it were approved. Is anyone out

there

Post by P. Birken

Post by Brian
who has the power to make it happen?
Cheers :)
Brian
[0]

http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

Post by P. Birken

Post by Brian
_______________________________________________
Wikiquality-l mailing list

http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

Post by P. Birken
_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

John Erling Blad

2007-11-09 10:08:00 UTC

Permalink

Interesting reading!

I believe that a correct evaluation of article quality must be combined
with writers reputation and most likely also how the writer interacts
with other users. The article itself also don't exist in a vacuum, as
you suggest in the final notes about the PageRank algorithm. Incoming
links are very useful for evaluating the article quality, but it takes
time for them to emerge. It is therefore highly likely that it will be
necessary to use different approaches to asses the article quality, not
only given the category for the article but also given the age of the
article.

A lot of those measures will interact. For example person A writes the
article but has previously written articles that don't rate as very good
due to factual errors. He does although write good English (most likely,
thats not me.. ;) Now person B writes rotten English (oh, thats me!) but
writes factual correct articles. Because of his very bad English other
contributors reverts his edits or rewrites them. Both of these two
persons will rank very badly, and their articles even worse. Still, when
they team up they can produce excellent articles.

When I first started to look into estimating writers reputation and
article quality I expect to find some fairly obvious features to use.
What I did find was that there was several connected systems, and that
all of them (at least the most prominent ones) should be taken into
account. Still there will be a fairly large number of erroneous
classifications.

John E

Post by Brian
Several collaborators and I are preparing to expand on previous work
to automatically ascertain the quality of Wikipedia articles on the
English Wikipedia (presented at Wikimania '07 [0]). PageRank is
Google's hallmark quality metric, and the foundation actually has
access to these numbers through the Google Webmaster Tools website. If
a foundation representative were to create a Google account and verify
that they were a "webmaster," they could download the PageRank for
every article on the English Wikipedia in a convenient tabular format.
This data would likely serve as a fantastic predictor. I would also
like to compare the Google-computed PageRank to the PageRank computed
via Wikipedia's internal link structure. I don't see any privacy
implications in releasing this data. It also doesn't seem to help
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links.
Nonetheless, I would of course be willing to keep the data private.
This would only take a few minutes if it were approved. Is anyone out
there who has the power to make it happen?
Cheers :)
Brian
[0]
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf
------------------------------------------------------------------------
_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

Ivan Beschastnikh

2007-11-09 19:20:10 UTC

Permalink

As another note, quality can be typically associated with something
of high value. The problem is that
value is vague and a subjective concept. Determining the value of a
particular object\page to a
particular individual is impossible at best. However, there are
numerous proxy features for value.
One such feature is popularity and that's what the page rank
algorithm gets at. Another such feature
is the number of readers who navigate to a page -- this is sort of
like popularity except that it also
encompasses the many eyes principle (the more people see an article,
the better it will become (i.e.
it will become of higher quality)).

Of course this may be disputed, but if you think value is what you're
really trying to get at, a potential
direction is to go with page views (e.g. as do Priedhorsky et. al. in
"Creating, Destroying, and Restoring
Value in Wikipedia." -- http://tinyurl.com/269lpq ).

ivan.

Post by John Erling Blad
Interesting reading!
I believe that a correct evaluation of article quality must be
combined
with writers reputation and most likely also how the writer interacts
with other users. The article itself also don't exist in a vacuum, as
you suggest in the final notes about the PageRank algorithm. Incoming
links are very useful for evaluating the article quality, but it takes
time for them to emerge. It is therefore highly likely that it will be
necessary to use different approaches to asses the article quality, not
only given the category for the article but also given the age of the
article.
A lot of those measures will interact. For example person A writes the
article but has previously written articles that don't rate as very good
due to factual errors. He does although write good English (most likely,
thats not me.. ;) Now person B writes rotten English (oh, thats me!) but
writes factual correct articles. Because of his very bad English other
contributors reverts his edits or rewrites them. Both of these two
persons will rank very badly, and their articles even worse. Still, when
they team up they can produce excellent articles.
When I first started to look into estimating writers reputation and
article quality I expect to find some fairly obvious features to use.
What I did find was that there was several connected systems, and that
all of them (at least the most prominent ones) should be taken into
account. Still there will be a fairly large number of erroneous
classifications.
John E

Post by Brian
Several collaborators and I are preparing to expand on previous work
to automatically ascertain the quality of Wikipedia articles on the
English Wikipedia (presented at Wikimania '07 [0]). PageRank is
Google's hallmark quality metric, and the foundation actually has
access to these numbers through the Google Webmaster Tools
website. If
a foundation representative were to create a Google account and verify
that they were a "webmaster," they could download the PageRank for
every article on the English Wikipedia in a convenient tabular format.
This data would likely serve as a fantastic predictor. I would also
like to compare the Google-computed PageRank to the PageRank computed
via Wikipedia's internal link structure. I don't see any privacy
implications in releasing this data. It also doesn't seem to help
spammers much, as they already know the pages that have a very high
PageRank, and we include rel="nofollow" on outbound links.
Nonetheless, I would of course be willing to keep the data private.
This would only take a few minutes if it were approved. Is anyone out
there who has the power to make it happen?
Cheers :)
Brian
[0]
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/
RassbachPincockMingus07.pdf
---------------------------------------------------------------------
---
_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
<john.erling.blad.vcf>

_______________________________________________
Wikiquality-l mailing list
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l