ACCUEIL
INFERER
LEXIQUE
ENTRAÎNEMENT
STRUCTURE
 

Le corpus

A l'instar du British National Corpus (BNC), un corpus a été établi pour les textes de vulgarisation scientifique tirés de diverses sources dont la BBC, CNN, et New Scientist. Le corpus contient 100 000 mots dans 150 textes. L'analyse statistique de ce corpus donne des résultats similaires à ceux du BNC, avec une liste de 150 mots courants qui constituent environ 50% des mots dans un texte. L'article défini, à lui seul, représente 6% des mots du corpus.

Tableau des mots les plus utilisés dans les textes du corpus :

the
but
he
up
scientists
much
another
team
of
they
into
cells
use
system
cell
see
to
an
could
been
university
using
where
however
a
can
when
had
who
our
each
phone
and
or
so
your
first
I
get
between
in
will
also
people
technology
US
web
body
that
you
these
how
do
number
should
results
is
more
about
some
his
brain
software
year
it
which
would
them
make
over
world
based
for
has
if
time
now
different
say
known
be
said
researchers
two
study
those
before
need
are
not
were
computer
many
same
light
power
as
there
may
because
species
disease
three
large
on
one
there
only
no
work
being
long
with
we
all
what
years
called
change
small
by
was
its
research
even
while
data
systems
at
new
such
used
then
after
internet
too
this
than
like
found
any
information
made
around
from
says
out
way
very
high
able
plant
have
other
most
just
human
through
levels
water

 

LEGENDE :

10% des mots 25% des mots 40% des mots
15% des mots 30% des mots 45% des mots
20% des mots 35% des mots 50% des mots

Un exemple interactif...