Utilizator:Flyax/categories

The following script was initially written by Ariel Glenn for the Greek wiktionary. I made a few changes in order to use it here. Its purpose is to find all entries that belong in a certain category.

In order to run this script we need a computer with Linux.

In our home folder (e.g. /home/flyax) we create a new folder named (let's say) "rowikt". We create a new text document, paste there the code and save it as "wordsincategory.sh". Then right click on it, Properties, Permissions, check the "Execute" option.

We open a terminal and give:

> cd rowikt

> ./wordsincategory.sh Română

After a few minutes we'll find a new sub-folder under the name "cat_tmp" with 3 files in it. We double-click on the file "titles.March-04-2011.txt" (or whatever the date is) and see all the entries belonging to the category "Română". Now we return to the terminal and type:

> cd cat_tmp

> grep 'ţ' titles.March-04-2011.txt > t1.list

We open the file t1.list and see all entries containing a ţ.

> grep 'ş' titles.March-04-2011.txt > t2.list

We open the file t2.list and see all entries containing a ş.

> cat t1.list t2.list | sort | uniq > move.list

Here we have a sorted list of all entries containing a ţ or a ş.

wordsincategory.sh

#!/bin/bash

usage() {
  echo "Usage: $0 cat"
  echo "where category is the name of the category for which to retrieve titles"
  echo 
  echo "For example:"
  echo "$0 'Română'";
  exit 1
}

if [ -z "$1"  ]; then
  usage
fi
cat=`echo "$1" | sed -e 's/ /_/g;'`
cat="Categorie:$cat"
tmp="./cat_tmp"
today=`date +"%B-%d-%Y"`
ext="$today"
mkdir -p $tmp
titles="$tmp/titles.$ext"
cmcontinue=""
step=500

rm -f  $titles.*  

count=1
while [ 1 ]; do

    echo getting category titles $count to $count+$step

    # next 500 ($step)

echo "$titles.xml.temp"
    if [ -z "$cmcontinue" ];  then
        curl --retry 10 -H 'Expect:' -f "http://ro.wiktionary.org/w/api.php?action=query&list=categorymembers&cmtitle=$cat&cmprop=title&cmlimit=$step&format=xml" | sed -e 's/>/>\n/g;' > $titles.xml.temp
    else
#set -x
        curl --retry 10 -H 'Expect:' -f "http://ro.wiktionary.org/w/api.php?action=query&list=categorymembers&cmtitle=$cat&cmprop=title&cmcontinue=$cmcontinue&cmlimit=$step&format=xml" | sed -e 's/>/>\n/g;' > $titles.xml.temp
#set +x
    fi
    if [ $? -ne 0 ]; then
        echo "Error $? from curl, unable to get xml pages, bailing"
        exit 1
    fi
    cat $titles.xml.temp >> $titles.xml
    # get continue param

    cmcontinue=`grep cmcontinue $titles.xml.temp`
    if [ -z "$cmcontinue" ]; then
	break;
    else
        cmcontinue=`echo $cmcontinue | awk -F'"' '{ print $2 }' | sed -e 's/ /%20/g;'`
    fi
    sleep 6
    count=$(( $count+$step ))
done

# format <cm ns="10" title="Română" />
cat $titles.xml | grep '<cm ' | awk -F'"' '{ print $4 }'  | sed -e 's/^/[[/g; s/$/]]/g;' > $titles.txt
# done!
echo "done!"
exit 0