Word lists - Separating into different files according to word-length (and removing symbol words)

Hello. I am posting because I thought others may find this useful.

I have written some VB code which takes a word list (a single text file with 1 word per line: OriginalList.txt), then goes through the entire list and separates the words into different files, according to the length of the word. Ie:
-2_letters.txt
-3_letters.txt
-4_letters.txt
etc.

Words that have symbols (apostrophe, etc) are not included in the above files, and get dumped into another file (invalid.txt). Also, words < 2 or > 6 letters also get dumped into this file.

The procedure takes a few minutes to complete, depending on how quick your PC is.

I originally wrote this code to prepare some word lists, for use in my game "Covert Word" (winner of the API competition). Covert Word

I will be using it again, once Oxford have released the new WordList endpoint that they are working on, that has all forms of words (run, runs, ran, running, etc), instead of just the base-forms.

This is the code. This deals with words 2 letters up to 6 letters in length, but could be modified for more.

Imports System.IO


Public Class Form1

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        Dim pth As String
        pth = "C:\newWordLists\"

        Using r As StreamReader = New StreamReader(pth & "OriginalList.txt")

            Dim wrd As String

            wrd = r.ReadLine

            wrd = Trim(wrd)
            wrd = wrd.ToLower

            Do While (Not wrd Is Nothing)

                'change this line if less than 2, or more than 6 letters is required:

                If wrd.Length < 2 Or wrd.Length > 6 Or Not isAlpha(wrd) Then 'if word is invalid

                    My.Computer.FileSystem.WriteAllText(pth & "invalid.txt", Chr(13) & Chr(10) & wrd, True)

                Else 'word is valid


                    If wrd.Length = 2 Then
                        My.Computer.FileSystem.WriteAllText(pth & "2_letters.txt", Chr(13) & Chr(10) & wrd, True)
                    ElseIf wrd.Length = 3 Then
                        My.Computer.FileSystem.WriteAllText(pth & "3_letters.txt", Chr(13) & Chr(10) & wrd, True)
                    ElseIf wrd.Length = 4 Then
                        My.Computer.FileSystem.WriteAllText(pth & "4_letters.txt", Chr(13) & Chr(10) & wrd, True)
                    ElseIf wrd.Length = 5 Then
                        My.Computer.FileSystem.WriteAllText(pth & "5_letters.txt", Chr(13) & Chr(10) & wrd, True)
                    ElseIf wrd.Length = 6 Then
                        My.Computer.FileSystem.WriteAllText(pth & "6_letters.txt", Chr(13) & Chr(10) & wrd, True)

                     'add more of the above here if less than 2, or more than 6 letters is required

                    End If



                End If

                    wrd = r.ReadLine
            Loop

        End Using

        MsgBox("finished")

    End Sub




    Function isAlpha(ByVal str As String) As Boolean
        Dim iPos As Integer
        Dim bolValid As Boolean
        iPos = 1
        bolValid = True

        While iPos <= Len(str) And bolValid
            If Asc(UCase(Mid(str, iPos, 1))) < Asc("A") Or _
             Asc(UCase(Mid(str, iPos, 1))) > Asc("Z") Then _
              bolValid = False

            iPos = iPos + 1
        End While

        isAlpha = bolValid
    End Function

End Class

Comments

Sign In or Register to comment.